data science

for (physical) scientists 13



dr.federica bianco | fbb.space |    fedhere |    fedhere 

Convolutional Neural networks

 

this slide deck:

 
  • Machine Learning basic concepts
    • interpretability
    • parameters vs hyperparameters
    • supervised/unsupervised


  • CART
  • Clustering methods
  • Neural Networks

Neural Networks

 

  • the brain connection
  • perceptron
  • activation functions
  • shallow nets
  • deep nets architecture
  • back-propagation
  • convolutional NN
  • preprocessing and whitening (minibatch)

 

mean square error

mean absolute error

L_1 = \Sigma \left| y_{true} - y_{predicted}\right|
L_2= \Sigma \left( y_{true} - y_{predicted}\right)^2

recap

 

How do we fit a model to data?

 

minimize loss function

optimization schemes

Gradient descent

.

.

.

 

x_1
x_2
x_N
+b
\vec{y} = \vec{x}W + b

Any linear model

w_2
w_1
w_N
y

.

.

.

 

x_1
x_2
x_N
+b
f
\vec{y} = f(\vec{x}W + b)

perceptron or

shallow NN

w_2
w_1
w_N

input layer

hidden  layer

output layer

.

.

.

 

x_1
x_2
x_N
+b
\vec{y} = \vec{x}W + b

Any linear model

w_2
w_1
w_N
y

y : prediction

ytrue : target

Error: e.g.

 

intercept

slope

L2

L_2~=~(y - y_\mathrm{true})^2

x

features

y = \sum_i w_i x_i + b

Find the best parameters by finding the minimum of the L2 hyperplane

 

.

.

.

 

x_1
x_2
x_N
+b
\vec{y} = \vec{x}W + b

Any linear model

w_2
w_1
w_N
y

y : prediction

ytrue : target

Error: e.g.

 

intercept

slope

L2

L_2~=~(y - y_\mathrm{true})^2

Find the best parameters by finding the minimum of the L2 hyperplane

 

 

x

y = ax + b
y = ax + b

.

.

.

 

x_1
x_2
x_N
+b
\vec{y} = \vec{x}W + b

Any linear model

w_2
w_1
w_N
y

y : prediction

ytrue : target

Error: e.g.

 

intercept

slope

L2

L_2~=~(y - y_\mathrm{true})^2

Find the best parameters by finding the minimum of the L2 hyperplane

 

at every step look around and choose the best direction

global minimum

x

initial

guess

.

\vec{y} = a\vec{x}+b

intercept

slope

L2

global minimum

x

How do I know in which direction to go?

find the direction where L2 decreases by taking the gradient of the loss function

 w respect to the parameters (a,b)

global minimum

L2

\vec{L_2} = \sum_i{(y_i - (a{x_i} + b))^2}

initial

guess

.

Find the best parameters by finding the minimum of the L2 hyperplane

 

at every step look around and choose the best direction

Gradient Descent

Used to optimize parameters of a function fit to data

  1. Start at a random location in parameters space 
  2. Calculate partial derivative w respect to each parameter 
  3. Update parameters according to the (partial) derivative
m=m_0,b=b_0
f(m,b)
f' = (\frac{\partial f}{\partial m}, ~ \frac{\partial f}{\partial b} )
m_\mathrm{new} = m_\mathrm{old} - \frac{\partial f}{\partial m}\cdot\frac{\alpha}{N}\\ b_\mathrm{new} = b_\mathrm{old} - \frac{\partial f}{\partial b}\cdot \frac{\alpha}{N}

0. Choose a learning rate hyperparameter α

{

(m,b)
\vec{x}=x_1...x_N

Gradient Descent

Used to optimize parameters of a function fit to data

  1. Start at a random location in parameters space 
  2. Calculate partial derivative w respect to each parameter 
  3. Update parameters according to the (partial) derivative
m_\mathrm{new} = m_\mathrm{old} - \frac{\partial f}{\partial m}\cdot\frac{\alpha}{N}\\ b_\mathrm{new} = b_\mathrm{old} - \frac{\partial f}{\partial b}\cdot \frac{\alpha}{N}

0. Choose a learning rate hyperparameter α

{

\vec{x}=x_1...x_N

parameter

function f

\frac{\partial{f}}{\partial{w}} > 0
\frac{\partial{f}}{\partial{w}} < 0
\vec{y} = a\vec{x}+b

Find the best parameters by finding the minimum of the L2 hyperplane

 

at every step look around and choose the best direction

How do I know in which direction to go?

find the direction where L2 decreases by taking the gradient of the loss function

 w respect to the parameters (a,b)

global minimum

L2

\vec{L_2} = \frac{1}{N}\sum_i{(y_i - (a{x_i} + b))^2}

Gradient Descent

Used to optimize parameters of a function fit to data

\vec{x} = x_1...x_N\\ f(\vec{x},m,b) = m\vec{x} + b\\ L_2 = \frac{1}{N}\sum(y -( m\vec{x} + b))^2
def gradDesc(m, b, x, y, alpha, ax=None):
    N = len(x)
    #partial derivative: -2x(y - (mx + b)), -2(y - (mx + b))
    f_m = (-2 * x * (y - (m * x + b))).sum()
    f_b = (-2 * (y - (m * x + b))).sum()
    # We subtract because the derivatives point in direction of steepest ascent
    m -= f_m / float(N) * alpha
    b -= f_b / float(N) * alpha
    return m, b

#initial setup
m_new, b_new = 11, 11

#hyperparmeters
epsilon = 1  #convergence threshold
alpha = 0.0005 #learning rate

print (loss(m, b, x, y))
while  loss(m_new, b_new, x, y) > epsilon:
    #time.sleep(1)
    m_old, b_old = m_new, b_new
    print (loss(m_new, b_new, x, y))
    m_new, b_new = gradDesc(m_old, b_old, x, y, alpha, ax=ax)

neural networks

recap

 

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

y ~= ~\sum_i w_ix_i ~+~ b
y ~= ~wx ~+~ b
y ~= ~f(\sum_i w_ix_i ~+~ b)

.

.

.

 

x_1
x_2
x_N
+b
f

output

f

activation function

weights

w_i

bias

b
w_2
w_1
w_N

recap

 

perceptrons

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

recap

 

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1
w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b1
w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b1
w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b1
x_1

activation functions

0

Back

Propagation

Training ANN

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

.

.

 

x_1
x_N
f
+b
f
w_2

output

Seminal paper 

Y. LeCun 1998

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

W4

y = \frac{1}{ 1+e^{-\frac{w7}{1+e^{-w_1x_1-w_4x_2 - b_1}} - \frac{w8}{1+e^{-x_1w_2-w_5x_2 - b_2}}- \frac{w9}{1+e^{-x_1w_3-x_2w_6 - b_3}}-b_4}}
\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

- defining the metric

- choose optimization schemes

- training/validation/testing sets

 

Small changes in the parameters leads to small changes in the output for the right activation functions.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1)  partial derivatives, 2)  chain rule

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

ISSUES TO WATCH FOR

Training a DNN

Exploding Gradient

the gradients of the network's loss with respect to the parameters (weights) become excessively large.

The "explosion" of the gradient can lead to numerical instability and the inability of the network to converge.

Erratic learning, with the loss becoming NaN (not a number) or Inf (infinity)

see link

Vanishing Gradient

when the gradients are very small, they can diminish as they are propagated back through the network, leading to minimal or no updates to the weights in the initial layers.

Activation functions like the sigmoid or hyperbolic tangent (tanh) have gradients that are in the range of 0 to 0.25 for sigmoid and -1 to 1 for tanh: the gradients of the loss function with respect to the parameters can become very small

Slow and stalled learning

see link

CNN

1

Convolutional Neural Nets

@akumadog

Brain Programming and the Random Search in Object Categorization

 

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

CNN

CNN

1a

Convolution

Convolution

convolution is a mathematical operator on two functions

f and g

that produces a third function  

f x g

expressing how the shape of one is modified by the other.

o

Convolution Theorem

f * g= \mathcal{F}^{-1}\big\{\mathcal{F}\{f\}\cdot\mathcal{F}\{g\}\big\}
\mathcal{F}

fourier transform

{\displaystyle {\begin{aligned}F(\nu )&=\int _{\mathbb {R} ^{n}}f(x)e^{-2\pi ix\cdot \nu }\,dx,\\ G(\nu )&=\int _{\mathbb {R} ^{n}}g(x)e^{-2\pi ix\cdot \nu }\,dx,\end{aligned}}}

two images. 

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1

1

1

1

1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
1 -1 -1
-1 1 -1
-1 -1 1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
-1 -1 1
-1 1 -1
1 -1 -1

feature maps

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

convolution

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*-1)+(1*1)+(-1*-1)\\ (-1*-1)+(-1*-1)+(1*1)\\ = 7
7

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*1)+(-1*1)+(-1*1)\\ (-1*-1)+(-1*1)+(-1*1)\\ = -3
7 -3

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
?

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
-3

=

input layer

feature map

convolution layer

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
-3 5 -3
3 -1 7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
-3 5 -3
3 -1 7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

and it is reminiscent of the original layer

7

5 

7

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

Convolve with different feature: each neuron is 1 feature

CNN

1b

ReLu

7 -3 3
-3 5 -3
3 -1 7

7

5 

7

ReLu: normalization that replaces negative values with 0's

7 0 3
0 5 0
3 0 7

7

5 

7

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

1c

Max-Pool

CNN

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
0 0 7

7

5 

7

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5
5

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5
5 7

MaxPooling: reduce image size & generalizes result

 

 

By reducing the size and picking the maximum of a sub-region we make the network less sensitive to specific details

model.add(MaxPooling2D(pool_size=(2, 2)))

CNN

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
model = Sequencial()
model.add(Conv2D(32, kernel_size=(10, 10),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

final layer:

the final layer is fully connected MPL

x

O

last hidden layer

output layer

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
model = Sequencial()
model.add(Conv2D(32, kernel_size=(10, 10),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2))) 
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))

Stack multiple convolution layers

overfitting

3

Minibatch

&

Dropout

Overfitting

What are the symptoms

How can we fix it?

model performance (accuracy)

model performance (accuracy)

tree depth

tree depth

ANN training epochs

  • If one updates model parameters after processing the whole training data (i.e., epoch), it would take too long to get a model update in training, and the entire training data probably won’t fit in the memory.
  • If one updates model parameters after processing every instance (i.e., stochastic gradient descent), model updates would be too noisy, and the process is not computationally efficient.
  • Therefore, minibatch gradient descent is introduced as a trade-off between fast model updates (memory efficiency) and accurate model updates (computational efficiency).

Split your training set into many smaller subsets and train on each small set separately

overfitting

overfitting

Dropout

Artificially remove some neurons for different minibatches to avoid overfitting

output

from keras.models import Sequential
from keras.layers import Dropout
model.add(Dropout(0.5))

Class Imbalance

what is the simplest classifier you can build for this dataset ?

what is the accuracy?

x

y

Class Imbalance

If your dataset is imbalanced (more of one class than the other)

your model will learn that it is better to guess the most common class

this will contaminate the prediction

what is the simplest classifier you can build for this dataset ?

what is the accuracy?

x

y

Class Imbalance

If your dataset is imbalanced (more of one class than the other)

 

key concepts

 

 

Architecture components: neurons, activation function

  • basically each neuron is a multivariate regression with an activation function that turns the output into a probability
  • changing the weights and biases in the linear regression gives different results

Single layer NN: perceptrons

  • perceptrons were developed in the 50s but a long time passed since then till people figured out how to build complex layered architectures and especially how to train them

Deep NN:

  • DNN are multi-layer architectures of neurons. They can be fully connected (each neuron goes to each neuron of the next layer) or not (a neuron goes only to some neurons in the next layer)
  • DNN have a lot of parameters (thousands!) which makes the interpretability and feature extraction of NN difficult.

 

key concepts

 

 

Convolutional NN

  • convolutional NN are DNN with three types of layers: 
    • convolutional layers: run filters through an image to detect features like edges or colors
    • maxpool layers: decrease the size of the previous layer outputs and removes some details 
    • ReLU : rectified linear units: normalizes the output of conv layers so that it is all positive (sets negatives to 0)
  • CNNs are great for the stud of structure in large datasets (images are large datasets)

Training an NN:

  • most ML methods are trained by gradient descent: change weights and biases based on the derivative of the loss (or cost) function 
  • DLL are difficult to train cause of the layer structure
  • backpropagation propagates changes to the weights to the entire NN
  • Minibatch: split the training set into many (100s!) subset and use these to train the NN
  • Dropout: set some neurons to zero to avoid overfitting

Galaxyzoo challenge on Kaggle

data science for (physical) scientists 13

By federica bianco

data science for (physical) scientists 13

convolutional neural networks

  • 269