Machine Learning for

Time Series Analysis XII

Neural Networks: CNN

Fall 2022 - UDel PHYS 667
dr. federica bianco

@fedhere

fbianco@udel.edu

this slide deck:

https://slides.com/federicabianco/mltsa22_12

neural networks

recap

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

y ~= ~\sum_i w_ix_i ~+~ b

y ~= ~wx ~+~ b

y ~= ~f(\sum_i w_ix_i ~+~ b)

.

x_1

x_2

x_N

+b

f

output

f

activation function

weights

w_i

bias

b

w_2

w_1

w_N

recap

3 perceptrons

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

recap

3 multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b1

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b1

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b1

x_1

activation functions

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

x_1

x_N

f

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

+b

f

w_2

output

.

x_1

x_2

x_N

+b

f

\vec{y} = f(\vec{x}W + b)

perceptron or

shallow NN

w_2

w_1

w_N

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

input layer

hidden layer

output layer

W4

y = \frac{1}{ 1+e^{-\frac{w7}{1+e^{-w_1x_1-w_4x_2 - b_1}} - \frac{w8}{1+e^{-x_1w_2-w_5x_2 - b_2}}- \frac{w9}{1+e^{-x_1w_3-x_2w_6 - b_3}}-b_4}}

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

- defining the metric

- choose optimization schemes

- training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1) partial derivatives, 2) chain rule

http://neuralnetworksanddeeplearning.com/chap2.html

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

a new frontere: irregularly sampled time NN

0

CNN

1

Convolutional Neural Nets

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

Seminal paper

Y. LeCun 1998

Olague et al 2017

@akumadog

Brain Programming and the Random Search in Object Categorization

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

CNN

1a

Convolution

Convolution

convolution is a mathematical operator on two functions

f and g

that produces a third function

f x g

expressing how the shape of one is modified by the other.

o

Convolution Theorem

f * g= \mathcal{F}^{-1}\big\{\mathcal{F}\{f\}\cdot\mathcal{F}\{g\}\big\}

\mathcal{F}

fourier transform

{\displaystyle {\begin{aligned}F(\nu )&=\int _{\mathbb {R} ^{n}}f(x)e^{-2\pi ix\cdot \nu }\,dx,\\ G(\nu )&=\int _{\mathbb {R} ^{n}}g(x)e^{-2\pi ix\cdot \nu }\,dx,\end{aligned}}}

two images.

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1

-1	-1	-1	-1	-1
-1	-1	-1	-1	-1
-1	-1	-1	-1	-1
-1	-1	-1	-1	-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

-1	-1	1
-1	1	-1
1	-1	-1

feature maps

1

convolution

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*-1)+(1*1)+(-1*-1)\\ (-1*-1)+(-1*-1)+(1*1)\\ = 7

7

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*1)+(-1*1)+(-1*1)\\ (-1*-1)+(-1*1)+(-1*1)\\ = -3

7	-3

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-3	3

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
?

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
?	?

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
?	?

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
?	?

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
?	?

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
?	?

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

7	-1	3
-3

=

input layer

feature map

convolution layer

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

7	-3	3
-3	5	-3
3	-1	7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

7	-3	3
-3	5	-3
3	-1	7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

and it is reminiscent of the original layer

7

5

7

Convolve with different feature: each neuron is 1 feature

CNN

1b

ReLu

7	-3	3
-3	5	-3
3	-1	7

7

5

7

ReLu: normalization that replaces negative values with 0's

7	-3	3
-3	5	-3
3	-1	7

7

5

7

ReLu: normalization that replaces negative values with 0's

7	0	3
0	5	0
3	0	7

7

5

7

1c

Max-Pool

CNN

MaxPooling: reduce image size, generalizes result

7	0	3
0	5	0
0	0	7

7

5

7

MaxPooling: reduce image size, generalizes result

7	0	3
0	5	0
3	0	7

7

5

7

2x2 Max Poll

7	5

MaxPooling: reduce image size, generalizes result

7	0	3
0	5	0
3	0	7

7

5

7

2x2 Max Poll

7	5
5

MaxPooling: reduce image size, generalizes result

7	0	3
0	5	0
3	0	7

7

5

7

2x2 Max Poll

7	5
5	7

MaxPooling: reduce image size & generalizes result

By reducing the size and picking the maximum of a sub-region we make the network less sensitive to specific details

CNN

final layer:

the final layer is fully connected

x

O

last hidden layer

output layer

Stack multiple convolution layers

CNN for time series analysis

overfitting

3

Minibatch

&

Dropout

If one updates model parameters after processing the whole training data (i.e., epoch), it would take too long to get a model update in training, and the entire training data probably won’t fit in the memory.
If one updates model parameters after processing every instance (i.e., stochastic gradient descent), model updates would be too noisy, and the process is not computationally efficient.
Therefore, minibatch gradient descent is introduced as a trade-off between fast model updates (memory efficiency) and accurate model updates (computational efficiency).

Split your training set into many smaller subsets and train on each small set separately

overfitting

https://www.quora.com/What-is-a-minibatch-in-a-neural-network

overfitting

Dropout

Artificially remove some neurons for different minibatches to avoid overfitting

output

deep dreams

what is happening in DeepDream?

Deep Dream (DD) is a google software, a pre-trained NN (originally created on the Cafe architecture, now imported on many other platforms including tensorflow).

The high level idea relies on training a convolutional NN to recognize common objects, e.g. dogs, cats, cars, in images. As the network learns to recognize those objects is developes its layers to pick out "features" of the NN, like lines at a cetrain orientations, circles, etc.

The DD software runs this NN on an image you give it, and it loops on some layers, thus "manifesting" the things it knows how to recognize in the image.

CNN

Object detection

Naive model: we took different region of the image and measured the probability of presence of the object in that region

Object detection

Problem: we had to search the whole image which is time consuming, we could only find 1 kind of object at 1 scale

CNN

YOLO and R-CNN

Object detection

What if you do not know what is in the mage?

Final Dense layer has undefined size (one per kind of object in the region)

Objects can have different scale or axis ration: how many regions can you search before the problem blows up computationally??

R-CNN

Girshick et al. 2013

Extract 2000 regions from the image "region proposals."

Feature Extraction CNN produces a 4096-dimensional feature vector in an output dense layer

SVM classify the presence of the object within that candidate region proposal.

1. Generate initial sub-segmentation, we generate many candidate     regions
2. Use greedy algorithm to recursively combine similar regions into larger ones 
3. Use the generated regions to produce the final candidate region proposals

TOO SLOW (47 sec to test 1 image)

Fast R-CNN

Girshick et al . 2015

Use a CNN to generate convolutional feature maps

Use Selective Search Algorithm to tdentify the RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

Fast R-CNN

Girshick et al . 2015

Use a CNN to generate convolutional feature maps

Use Selective Search Algorithm to tdentify the RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

Faster R-CNN

Ren et al. 2015

Use a CNN to generate convolutional feature maps

Use CNN to predict RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

Faster R-CNN

Ren et al. 2015

Use a CNN to generate convolutional feature maps

Use CNN to predict RPs and warp them into squares

Using an RoI pooling layer to reshape them into a fixed size so that it can be fed into a fully connected layer - predict box offset

Softmax layer to predict the class of the proposed

Yolo

Redmon et al 2016

What if you looked at the whole image instead of RoIs in the image??

Split an image into a SxS grid

For each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.

CNN outputs probability that BB has an object (+ offset)

High prob BBs are classified

Yolo

Redmon et al 2016

What if you looked at the whole image instead of RoIs in the image??

Split an image into a SxS grid

For each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.

CNN outputs probability that BB has an object (+ offset)

High prob BBs are classified

Redmon et al 2016

Labling tools

https://lionbridge.ai/articles/bounding-box-annotation-tools-and-services-machine-learning/

https://arxiv.org/abs/1902.01466

CNNs for image sequences

Recurrent CNN

https://wiki.tum.de/display/lfdv/Recurrent+Neural+Networks+-+Combination+of+RNN+and+CNN

key concepts

Architecture components: neurons, activation function

basically each neuron is a multivariate regression with an activation function that turns the output into a probability
changing the weights and biases in the linear regression gives different results

Single layer NN: perceptrons

perceptrons were developed in the 50s but a long time passed since then till people figured out how to build complex layered architectures and especially how to train them

Deep NN:

DNN are multi-layer architectures of neurons. They can be fully connected (each neuron goes to each neuron of the next layer) or not (a neuron goes only to some neurons in the next layer)
DNN have a lot of parameters (thousands!) which makes the interpretability and feature extraction of NN difficult.

key concepts

Convolutional NN

convolutional NN are DNN with three types of layers:
- convolutional layers: run filters through an image to detect features like edges or colors
- maxpool layers: decrease the size of the previous layer outputs and removes some details
- ReLU : rectified linear units: normalizes the output of conv layers so that it is all positive (sets negatives to 0)
CNNs are great for the stud of structure in large datasets (images are large datasets)

Training an NN:

most ML methods are trained by gradient descent: change weights and biases based on the derivative of the loss (or cost) function
DLL are difficult to train cause of the layer structure
backpropagation propagates changes to the weights to the entire NN
Minibatch: split the training set into many (100s!) subset and use these to train the NN
Dropout: set some neurons to zero to avoid overfitting

recap

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

—Karl Popper, The Logic of Scientific Discovery

the demarcation problem:

what is science? what is not?

the demarcation problem

in Bayesian context

p(M | D) =

The probability that a belief is true given new evidence equals the probability that the belief is true regardless of that evidence times the probability that the evidence is true given that the belief is true divided by the probability that the evidence is true regardless of whether the belief is true.

1

- Thomas Bayes Essay towards solving a Problem in the Doctrine of Chances (1763)

\frac{P(M) ~P(D | M) }{P(D)}

Principle of Parsimony

Between two models with the same explanatory power choose the one with fewer parameters

1

Likelihood Ration Test | AIC | BIC | Kullback Divergence

Reproducible research in practice:

using the code and raw data provided by the analyst.

Claerbout, J. 1990,

Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

TAXONOMY

Distribution: a formula (a model)

Population: all of the elemnts of a "family"

Sample: a finite subset of the population that you observe

Descriptive Statistics

TAXONOMY

central tendency: mean, median, mode

spread : variance, interquantile range

distributions

N (r | \mu, \sigma) \sim \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(r - \mu)^2}{2\sigma^2}}

parameters (-0.1, 0.9)

support

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

normal or Gaussian

continuous support

Poisson

discrete support

(1,+\inf]

[-\inf,+\inf]

parameters (lambda=1)

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments and frequentist probability

a distribution’s moments summarize its properties:

central tendency: mean (n=1), median, mode

spread: standard deviation/variance (n=2), quartiles range

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Monte Carlo.

and

MCMC

Why am I bothering with areas? - Expectation values are related to areas

The ratio of the area of the circle to the area of the square is π / 4.

Calculate Pi

https://www.jstor.org/stable/2686489?seq=1

https://github.com/fedhere/DSPS_FBianco/tree/master/montecarlo

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

step

feature value

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF pnew/pcurr > r:
current = new

ELSE:

pass // do nothing

Examples of how to choose the next point

affine invariant : EMCEE package

MCMC animations https://www.youtube.com/watch?v=J6FrNf5__G0&list=PLgArfv_fOU5dwjeP_57NO_jnRJ7cWCe6J

ML

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

clustering

PCA

Apriori

k-Nearest Neighbors

Regression

Support Vector Machines

Classification/Regression Trees

Neural networks

Data Preparation

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

change categorical to (integer) numerical

one-hot encoding

spicies	age	weight
1	7	32.3
2	1	0.3
3	3	8.1

change each category to a binary

cat	bird	dog	age	weight
0	0	1	7	32.3
0	1	0	1	0.3
1	0	0	3	8.1

numerical encoding

ML model performance

LR = _____________________________

True Negative

False Negative

	H0 is True	H0 is False
H0 is falsified	Type I Error False Positive	True Positive
H0 is not falsified	True Negative	Type II Error False Negative

Accuracy, Recall, Precision

Receiver operating characteristic

GOOD

BAD

what is the simplest classifier you can build for this dataset ?

what is the accuracy?

x

y

Class Imbalance

If your dataset is imbalanced (more of one class than the other)

your model will learn that it is better to guess the most common class

this will contaminate the prediction

ML

tasks

Partition (unsupervised)

Classification (supervised)

Regression (supervised)

https://doi.org/10.22541/au.155535549.97131926

A Data-Driven Evaluation of Delays in Criminal Prosecution

feature importance:

how soon was a feature chosen,

how many times was it used...

https://explained.ai/rf-importance/

RF

GBT

ML

models

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

X = np.c_[np.ones(len(x)), x]
lr.fit(X, y)
lr.coef_, lr.intercept_

We can let sklearn solve the equation for us:

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)

2x1

2xN Nx2 2xN Nx1

Linear Regression

Normal Equation

x = np.sort(10 * np.random.rand(N))
y = x * m_true + b_true
yerr = 0.1 + 0.5 * np.random.rand(N)
y += np.abs(f_true * y) * np.random.randn(N) + yerr * np.random.randn(N)

X = np.c_[np.ones(len(x)), x]
theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

resources

Neural Network and Deep Learning an excellent and free book on NN and DL http://neuralnetworksanddeeplearning.com/index.html

History of NN https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history2.html

Raissi et al. Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. arXiv 1711.10561

Raissi et al. Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations. arXiv 1711.10566

Raissi et al. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comp. Phys. 378 pp. 686-707 DOI: 10.1016/j.jcp.2018.10.045

resources

Gradient Descent

https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

Machine Learning for

Time Series Analysis XII

this slide deck:

neural networks

recap

recap

3

perceptrons

multilayer perceptron

recap

3

multilayer perceptron

activation functions

back-propagation

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

back-propagation

a new frontere: irregularly sampled time NN

0

CNN

1

Brain Programming and the Random Search in Object Categorization

CNN

CNN

1a

CNN

1b

1c

CNN

CNN

final layer:

the final layer is fully connected

CNN for time series analysis

overfitting

3

overfitting

overfitting

deep dreams

deep dreams

what is happening in DeepDream?

key concepts

key concepts

recap

Reproducibility

TAXONOMY

Distribution: a formula (a model)

Population: all of the elemnts of a "family"

Sample: a finite subset of the population that you observe

Descriptive Statistics

TAXONOMY

central tendency: mean, median, mode

spread : variance, interquantile range

distributions

Moments and frequentist probability

Monte Carlo.

and

MCMC

MCMC

ML

what is machine learning?

Data Preparation

one-hot encoding

numerical encoding

ML model performance

Accuracy, Recall, Precision

Receiver operating characteristic

Class Imbalance

ML

tasks

ML

models

Linear Regression

Normal Equation

resources

resources

MLTSA12 2020

More from federica bianco