foundations of data science for everyone XI

dr.federica bianco | fbb.space |    fedhere |    fedhere

Artificial Neural Networks

this slide deck:

http://slides.com/federicabianco/fdsfe_11

Recap: Data Science

FDSFE

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

remote sensing

survey science

instrumental design and development

data retrieval

...

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

data types

identify correlation

missing variable

...

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

Scaling and

whitening

tokenizing

...

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

what is the goal:

statistical analysis

anomaly detection

prediction

structure identification

....

what is the task:

regression

classification

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

SciPy

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

Data driven models for exploration of structure and prediction that learn parameters from data.

Machine Learning

Reinforcement Learning

Active Learning

unupervised learning supervised learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

- anomaly detection

Algorithms: kMeans clustering, DBSCAN, Agglomerative clustering

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, (SVM), Classification and Regression Tree methods, k-nearest neighbors, neural networks, (...)

Machine Learning

unupervised ------ supervised

Machine Learning

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, (SVM), Classification and Regression Tree methods, k-nearest neighbors, neural networks, (...)

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

- anomaly detection

Algorithms: kMeans clustering, DBSCAN, Agglomerative clustering

Learning relies on the definition of a loss function

learning type	loss / target
unsupervised	intra-cluster variance / inter cluster distance
supervised	distance between prediction and truth

Machine Learning

model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)

e.g. supervised

L1 = |target - prediction|

Learning relies on the definition of a loss function

Machine Learning

supervised and unsupervised

e.g. unsupervised

Inertia =

\sum_j \sum_i x_{i\in j} - \bar{x_j}

Interaction with the environment builds a reward function

Machine Learning

reinforcement

The goal of the agent is to maximize a cumulative reward signal over time

The objective is not to predict a specific output but to learn a policy or strategy that maximizes the cumulative reward over time.

Supervised Learning tasks

regression ------ classification

Target Variable: CONTINUOUS

(age, income, temperature...)

Target Variable: Categorical

(color, shape, income class...)

The definition of a loss function requires the definition of distance or similarity

Machine Learning

Minkowski distance

Jaccard similarity

Great circle distance

{A\cap B}

The definition of a loss function requires the definition of distance or similarity

Machine Learning

NN:

Neural Networks

NN:

Neural Networks

1.1

origins

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron

its a classifier

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron McCulloch & Pitts 1943

\sum_{i=1}^3x_i

M-P Neuron

1943

1 ~\mathrm{if} ~\sum_{i=1}^Nx_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

what does have to be if

x1 = 0.1

x2 = 0.6

x3 = 0.2

and the target variable for this example is 1?

\theta

x_1+x_2+x_3 <= \theta \\ 0.1 + 0.6 + 0.2 = 0.9 <= \theta

M-P Neuron

1943

if is Bool (True/False)

what value of corresponds to logical AND?

x_i

\theta

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

1958

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

Perceptron

1943

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

x_1

x_2

x_N

output

weights

w_i

bias

linear regression:

w_2

w_1

w_N

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

1958

Perceptron

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

1958

y ~= ~\sum_i w_ix_i ~+~ b

Perceptron

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

x_1

x_2

x_N

w_2

w_1

w_N

output

activation function

weights

w_i

bias

y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

w_2

w_1

w_N

output

activation function

weights

w_i

bias

sigmoid

\sigma = \frac{1}{1 + e^{-z}}

x_1

x_2

x_N

y ~= f(~\sum_i w_ix_i ~+~ b)

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

ANN examples of activation function

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

Deep Learning

DNN:

Problem:

Single-layer perceptrons are only capable of learning linearly separable patterns.

Perceptrons

Marvin Minsky and Seymour Papert

1969

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

multilayer perceptron

x_2

x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

Perceptrons by Marvin Minsky and Seymour Papert 1969

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

w_{11}

w_{12}

w_{13}

w_{14}

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

w_{21}

w_{22}

w_{23}

w_{24}

multilayer perceptron

layer of perceptrons

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

w_{31}

w_{32}

w_{33}

w_{34}

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + w_{14}x_4 + b1

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

learned parameters

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

f(w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1)

f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b1)

f(w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b1)

f(w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b1)

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

DNN:

hyperparameters of DNN

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

output

input layer

hidden layer

output layer

hidden layer

(3x4)+4

(4x3)+3

how many parameters?

EXERCISE

(3)+1

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number of neurons/layer-
activation function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization- M

N_l

N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

EXERCISE

http://playground.tensorflow.org/

DNN:

training DNN

https://colab.research.google.com/drive/13c9uJ_fPGjszgsyEuYWafR2F4_n-IXeZ

deep neural net

Fully connected: all nodes go to all nodes of the next layer.

1986: Deep Neural Nets

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

f: activation function:

turns neurons on-off

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

In a CNN these layers would not be fully connected except the last one

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

Seminal paper

Y. LeCun 1998

x_1

x_2

x_N

\vec{y} = \vec{x}W + b

Any linear model:

w_2

w_1

w_N

y : prediction

ytrue : target

Error: e.g.

L_2~=~(y - y_\mathrm{true})^2

intercept

slope

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

x_1

x_N

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1) partial derivatives, 2) chain rule

http://neuralnetworksanddeeplearning.com/chap2.html

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

Advanced issues found

▲

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss (gallery of horrors)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

On the interpretability of DNNs

https://distill.pub/2020/circuits/zoom-in/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Punch Line

Deep Neural Net are not some fancy-pants methods, they are just linear models with a bunch of parameters

Black Box?

Because they have many parameters they are difficult to "interpret" (no easy feature extraction)

tha is ok becayse they are prediction machines

deep dreams

what is happening in DeepDream?

Deep Dream (DD) is a google software, a pre-trained NN (originally created on the Cafe architecture, now imported on many other platforms including tensorflow).

The high level idea relies on training a convolutional NN to recognize common objects, e.g. dogs, cats, cars, in images. As the network learns to recognize those objects is developes its layers to pick out "features" of the NN, like lines at a cetrain orientations, circles, etc.

The DD software runs this NN on an image you give it, and it loops on some layers, thus "manifesting" the things it knows how to recognize in the image.

Olague et al 2017