Machine Learning for

Time Series Analysis XII

Neural Networks: CNN

Fall 2022 - UDel PHYS 667
dr. federica bianco 

 

@fedhere

this slide deck:

 
  • convolutional NN
  • preprocessing and whitening (minibatch)

 

neural networks

recap

 

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

y ~= ~\sum_i w_ix_i ~+~ b
y ~= ~wx ~+~ b
y ~= ~f(\sum_i w_ix_i ~+~ b)

.

.

.

 

x_1
x_2
x_N
+b
f

output

f

activation function

weights

w_i

bias

b
w_2
w_1
w_N

recap

 

3

perceptrons

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

recap

 

3

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1
w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b1
w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b1
w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b1
x_1

activation functions

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

.

.

 

x_1
x_N
f
+b
f
w_2

output

.

.

.

 

x_1
x_2
x_N
+b
f
\vec{y} = f(\vec{x}W + b)

perceptron or

shallow NN

w_2
w_1
w_N
\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

input layer

hidden  layer

output layer

W4

y = \frac{1}{ 1+e^{-\frac{w7}{1+e^{-w_1x_1-w_4x_2 - b_1}} - \frac{w8}{1+e^{-x_1w_2-w_5x_2 - b_2}}- \frac{w9}{1+e^{-x_1w_3-x_2w_6 - b_3}}-b_4}}
\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

- defining the metric

- choose optimization schemes

- training/validation/testing sets

 

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1)  partial derivatives, 2)  chain rule

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

a new frontere: irregularly sampled time NN

0

CNN

1

Convolutional Neural Nets

Seminal paper 

Y. LeCun 1998

@akumadog

Brain Programming and the Random Search in Object Categorization

 

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

CNN

CNN

1a

Convolution

Convolution

convolution is a mathematical operator on two functions

f and g

that produces a third function  

f x g

expressing how the shape of one is modified by the other.

o

Convolution Theorem

f * g= \mathcal{F}^{-1}\big\{\mathcal{F}\{f\}\cdot\mathcal{F}\{g\}\big\}
\mathcal{F}

fourier transform

{\displaystyle {\begin{aligned}F(\nu )&=\int _{\mathbb {R} ^{n}}f(x)e^{-2\pi ix\cdot \nu }\,dx,\\ G(\nu )&=\int _{\mathbb {R} ^{n}}g(x)e^{-2\pi ix\cdot \nu }\,dx,\end{aligned}}}

two images. 

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1

1

1

1

1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
-1 -1 1
-1 1 -1
1 -1 -1

feature maps

1

1

1

1

1

convolution

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*-1)+(1*1)+(-1*-1)\\ (-1*-1)+(-1*-1)+(1*1)\\ = 7
7

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*1)+(-1*1)+(-1*1)\\ (-1*-1)+(-1*1)+(-1*1)\\ = -3
7 -3

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -1 3
-3

=

input layer

feature map

convolution layer

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
-3 5 -3
3 -1 7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
-3 5 -3
3 -1 7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

and it is reminiscent of the original layer

7

5 

7

Convolve with different feature: each neuron is 1 feature

CNN

1b

ReLu

7 -3 3
-3 5 -3
3 -1 7

7

5 

7

ReLu: normalization that replaces negative values with 0's

7 -3 3
-3 5 -3
3 -1 7

7

5 

7

ReLu: normalization that replaces negative values with 0's

7 0 3
0 5 0
3 0 7

7

5 

7

1c

Max-Pool

CNN

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
0 0 7

7

5 

7

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5
5

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5
5 7

MaxPooling: reduce image size & generalizes result

 

 

By reducing the size and picking the maximum of a sub-region we make the network less sensitive to specific details

CNN

final layer:

the final layer is fully connected

x

O

last hidden layer

output layer

Stack multiple convolution layers

CNN for time series analysis

overfitting

3

Minibatch

&

Dropout

  • If one updates model parameters after processing the whole training data (i.e., epoch), it would take too long to get a model update in training, and the entire training data probably won’t fit in the memory.
  • If one updates model parameters after processing every instance (i.e., stochastic gradient descent), model updates would be too noisy, and the process is not computationally efficient.
  • Therefore, minibatch gradient descent is introduced as a trade-off between fast model updates (memory efficiency) and accurate model updates (computational efficiency).

Split your training set into many smaller subsets and train on each small set separately

overfitting

overfitting

Dropout

Artificially remove some neurons for different minibatches to avoid overfitting

output

EITHER:

 

we take a look at DeepDream

 

OR

 

we take a look at a convolutional NN code

deep dreams

deep dreams

what is happening in DeepDream?

Deep Dream (DD) is a google software, a pre-trained NN (originally created on the Cafe architecture, now imported on many other platforms including tensorflow).

 

The high level idea relies on training a convolutional NN to recognize common objects, e.g. dogs, cats, cars, in images. As the network learns to recognize those objects is developes its layers to pick out "features" of the NN, like lines at a cetrain orientations, circles, etc. 

 

The DD software runs this NN on an image you give it, and it loops on some layers, thus "manifesting" the things it knows how to recognize in the image. 

 

 

CNN

Object detection

Naive model: we took different region of the image and measured the probability of presence of the object in that region

Object detection

Problem: we had to search the whole image which is time consuming, we could only find 1 kind of object at 1 scale

 CNN 

YOLO and R-CNN

Object detection

What if you do not know what is in the mage?

Final Dense layer has undefined size (one per kind of object in the region)

 

Objects can have different scale or axis ration: how many regions can you search before the problem blows up computationally??

R-CNN

Girshick et al. 2013

Extract 2000 regions from the image "region proposals." 

Feature Extraction CNN produces a 4096-dimensional feature vector in an output dense layer

SVM classify the presence of the object within that candidate region proposal.

 
1. Generate initial sub-segmentation, we generate many candidate     regions
2. Use greedy algorithm to recursively combine similar regions into larger ones 
3. Use the generated regions to produce the final candidate region proposals 

TOO SLOW (47 sec to test 1 image)

Fast R-CNN

Girshick et al. 2015

Use a CNN to generate convolutional feature maps

Use Selective Search Algorithm to tdentify the RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

 

Fast R-CNN

Girshick et al. 2015

Use a CNN to generate convolutional feature maps

Use Selective Search Algorithm to tdentify the RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

 

Faster R-CNN

Ren et al. 2015

Use a CNN to generate convolutional feature maps

Use CNN to predict RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

 

Faster R-CNN

Ren et al. 2015

Use a CNN to generate convolutional feature maps

Use CNN to predict RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

 

Yolo

Redmon et al 2016

What if you looked at the whole image instead of RoIs in the image??

 

Split an image into a SxS grid

For each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.

CNN outputs probability that BB has an object (+ offset)

High prob BBs are classified

Yolo

Redmon et al 2016

What if you looked at the whole image instead of RoIs in the image??

 

Split an image into a SxS grid

For each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.

CNN outputs probability that BB has an object (+ offset)

High prob BBs are classified

 

https://arxiv.org/abs/1902.01466

CNNs for image sequences

 

Recurrent CNN 

 

key concepts

 

 

Architecture components: neurons, activation function

  • basically each neuron is a multivariate regression with an activation function that turns the output into a probability
  • changing the weights and biases in the linear regression gives different results

Single layer NN: perceptrons

  • perceptrons were developed in the 50s but a long time passed since then till people figured out how to build complex layered architectures and especially how to train them

Deep NN:

  • DNN are multi-layer architectures of neurons. They can be fully connected (each neuron goes to each neuron of the next layer) or not (a neuron goes only to some neurons in the next layer)
  • DNN have a lot of parameters (thousands!) which makes the interpretability and feature extraction of NN difficult.

 

key concepts

 

 

Convolutional NN

  • convolutional NN are DNN with three types of layers: 
    • convolutional layers: run filters through an image to detect features like edges or colors
    • maxpool layers: decrease the size of the previous layer outputs and removes some details 
    • ReLU : rectified linear units: normalizes the output of conv layers so that it is all positive (sets negatives to 0)
  • CNNs are great for the stud of structure in large datasets (images are large datasets)

Training an NN:

  • most ML methods are trained by gradient descent: change weights and biases based on the derivative of the loss (or cost) function 
  • DLL are difficult to train cause of the layer structure
  • backpropagation propagates changes to the weights to the entire NN
  • Minibatch: split the training set into many (100s!) subset and use these to train the NN
  • Dropout: set some neurons to zero to avoid overfitting

recap

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

the demarcation problem

in Bayesian context

p(M | D) =

The probability that a belief is true given new evidence equals the probability that the belief is true regardless of that evidence times the probability that the evidence is true given that the belief is true divided by the probability that the evidence is true regardless of whether the belief is true.

1
\frac{P(M) ~P(D | M) }{P(D)}

Principle of Parsimony

Between two models with the same explanatory power choose the one with fewer parameters

1

Likelihood Ration Test | AIC | BIC | Kullback Divergence

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

TAXONOMY

Distribution: a formula (a model)

Population: all of the elemnts of a "family"

Sample: a finite subset of the population that you observe

Descriptive Statistics

TAXONOMY

central tendency: mean, median, mode

spread         : variance, interquantile range

distributions

N (r | \mu, \sigma) \sim \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(r - \mu)^2}{2\sigma^2}}

parameters (-0.1, 0.9)

support

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

normal or Gaussian

continuous support

Poisson

discrete support

(1,+\inf]
[-\inf,+\inf]

parameters (lambda=1)

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments and frequentist probability

a distribution’s moments summarize its properties:

 

 

 

 

central tendency: mean (n=1), median, mode

spread: standard deviation/variance (n=2), quartiles range

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Monte Carlo.

and

MCMC

Why am I bothering with areas? - Expectation values are related to areas

The ratio of the area of the circle to the area of the square is π / 4.

Calculate Pi

https://www.jstor.org/stable/2686489?seq=1

https://github.com/fedhere/DSPS_FBianco/tree/master/montecarlo

 

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

step

feature value

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF pnew/pcurr > r:
                          current = new

               ELSE:

                          pass // do nothing

 

Examples of how to choose the next point

affine invariant : EMCEE package

ML

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

clustering

PCA

Apriori

k-Nearest Neighbors

Regression

Support Vector Machines

Classification/Regression Trees

Neural networks

 

Data    Preparation

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

change categorical to (integer) numerical

one-hot encoding

spicies age weight
1 7 32.3
2 1 0.3
3 3 8.1

change each category to a binary

cat bird dog age weight
0 0 1 7 32.3
0 1 0 1 0.3
1 0 0 3 8.1

numerical encoding

ML model performance

LR = _____________________________

 

True Negative

False Negative

H0 is True H0 is False
H0 is falsified Type I Error
False Positive
True Positive
H0 is not falsified
​True Negative Type II Error
False Negative

Accuracy, Recall, Precision

Receiver operating characteristic

 

GOOD

BAD

what is the simplest classifier you can build for this dataset ?

what is the accuracy?

x

y

Class Imbalance

If your dataset is imbalanced (more of one class than the other)

your model will learn that it is better to guess the most common class

this will contaminate the prediction

ML

tasks

Partition (unsupervised)

Classification (supervised)

Regression (supervised)

A Data-Driven Evaluation of Delays in Criminal Prosecution

feature importance:

how soon was a feature chosen,

how many times was it used...

https://explained.ai/rf-importance/

RF

 

GBT

 

ML

models

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

X = np.c_[np.ones(len(x)), x]
lr.fit(X, y)
lr.coef_, lr.intercept_

We can let sklearn solve the equation for us:

 

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)

2x1

2xN       Nx2             2xN     Nx1

Linear Regression

Normal Equation

x = np.sort(10 * np.random.rand(N))
y = x * m_true + b_true
yerr = 0.1 + 0.5 * np.random.rand(N)
y += np.abs(f_true * y) * np.random.randn(N) + yerr * np.random.randn(N)

X = np.c_[np.ones(len(x)), x]
theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

resources

 

Neural Network and Deep Learning an excellent and free book on NN and DL http://neuralnetworksanddeeplearning.com/index.html

History of NN https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history2.html

Raissi et al. Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. arXiv 1711.10561

Raissi et al. Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations. arXiv 1711.10566

Raissi et al. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comp. Phys. 378 pp. 686-707 DOI: 10.1016/j.jcp.2018.10.045

 

resources

 

MLTSA12 2020

By federica bianco

MLTSA12 2020

convolutional neural networks

  • 585