Machine Learning for

Time Series Analysis X

Neural Networks: RNNs, LSTM

Fall 2022 - UDel PHYS 667
dr. federica bianco

@fedhere

fbianco@udel.edu

this slide deck:

https://slides.com/federicabianco/mltsa22_10

Deep Learning

1

MLTSA:

what we are doing, except for the activation function

is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function φ

y = φ(x)

which (in general) is not linear with linear operations

what we are doing, except for the activation function

is exactly a series of matrix multiplictions.

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function φ

y = φ(x)

which (in general) is not linear with linear operations

http://neuralnetworksanddeeplearning.com/chap4.html

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

Text

DeepNeuralNetwork - loss functions

https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

Binary Cross Entropy

(Multiclass) Cross Entropy

-(y \log{(p)} + (1-y) \log{(1-p)})

-(y \log{(p)} + (1-y) \log{(1-p)})

-\sum_{c=1}^M y_{o,c} \log{(p_{o,c})}

-\sum_{c=1}^M y_{o,c} \log{(p_{o,c})}

c = class

o = object

p = probability

y = label | truth

y = prediction

Kullback-Leibler

\sum(\hat{y} \log {\frac{\hat{y}}{y}})

\sum(\hat{y} \log {\frac{\hat{y}}{y}})

(Multiclass) Cross Entropy

Mean Squared Error

L2

L2

Mean Absolute Error

L1

L1

L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}(\log(y_i + 1) - \log({\hat{y}}_i + 1))^2

L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}(\log(y_i + 1) - \log({\hat{y}}_i + 1))^2

Mean Squared Logarithmic Error

^

On the interpretability of DNNs

https://distill.pub/2020/circuits/zoom-in/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

MLTSA:

training DNN

2

https://colab.research.google.com/drive/13c9uJ_fPGjszgsyEuYWafR2F4_n-IXeZ

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

\vec{y} = \vec{x}W + b

\vec{y} = \vec{x}W + b

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

intercept

slope

L2

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

Gradient Descent

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

intercept

slope

L2

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

Gradient Descent

p_2 = p_1 - e ~\delta f(p_1)

p_2 = p_1 - e ~\delta f(p_1)

y = f(\sum\vec{w}\vec{x} + {b})

y = f(\sum\vec{w}\vec{x} + {b})

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

p_2 = p_1 - e ~\delta f(p_1)

p_2 = p_1 - e ~\delta f(p_1)

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

intercept

slope

L2

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

Gradient Descent

y = f(\sum\vec{w}\vec{x} + {b})

y = f(\sum\vec{w}\vec{x} + {b})

new position

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

p_2 = p_1 - e ~\delta f(p_1)

p_2 = p_1 - e ~\delta f(p_1)

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

intercept

slope

L2

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

Gradient Descent

y = f(\sum\vec{w}\vec{x} + {b})

y = f(\sum\vec{w}\vec{x} + {b})

old position

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

p_2 = p_1 - e ~\delta f(p_1)

p_2 = p_1 - e ~\delta f(p_1)

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

intercept

slope

L2

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

Gradient Descent

y = f(\sum\vec{w}\vec{x} + {b})

y = f(\sum\vec{w}\vec{x} + {b})

gradient at the old position

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

p_2 = p_1 - e ~\delta f(p_1)

p_2 = p_1 - e ~\delta f(p_1)

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

intercept

slope

L2

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

Gradient Descent

y = f(\sum\vec{w}\vec{x} + {b})

y = f(\sum\vec{w}\vec{x} + {b})

learning rate

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

p_2 = p_1 - e ~\delta f(p_1)

p_2 = p_1 - e ~\delta f(p_1)

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

Error: e.g.

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

L_2(\theta)~=~|y(\theta) - y_\mathrm{model}|^2

Gradient Descent

y = f(\sum\vec{w}\vec{x} + {b})

y = f(\sum\vec{w}\vec{x} + {b})

learning rate

e = e(L_2(\theta))

e = e(L_2(\theta))

adaptive lr

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

x_1

x_1

x_N

x_N

f

f

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

+b

+b

f

f

w_2

w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, small changes in the parameters lead to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, small changes in the parameters lead to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

z = z(y)\\ y = y(x)\\ {\displaystyle {\frac {dz}{dx}}={\frac {dz}{dy}}\cdot {\frac {dy}{dx}},}

z = z(y)\\ y = y(x)\\ {\displaystyle {\frac {dz}{dx}}={\frac {dz}{dy}}\cdot {\frac {dy}{dx}},}

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1) partial derivatives, 2) chain rule

http://neuralnetworksanddeeplearning.com/chap2.html

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

at every step look around and choose the best direction

Gradient Descent

why do we not worry about local minima?

the course of dimensionality actually is a blessing here!

Training a DNN

http://www.comp.hkbu.edu.hk/~markus/teaching/comp7650/tnn-94-gradient.pdf

1994

An time-domain enabled AI system should:

Training a DNN

you need to pick

http://www.comp.hkbu.edu.hk/~markus/teaching/comp7650/tnn-94-gradient.pdf

1994

Training a DNN

you need to pick

Training a DNN

http://www.comp.hkbu.edu.hk/~markus/teaching/comp7650/tnn-94-gradient.pdf

1994

We show why gradient based learning algorithms face an increasingly dicult problem as the duration of the dependencies to be captured increases

the magnitude of the derivative of the state of a dynamical system at time t with respect to the state at time 0 decreases exponentially as t increases.

We show why gradient based learning algorithms face an increasingly dicult problem as the duration of the dependencies to be captured increases

you need to pick

Training a DNN

you need to pick

Training a DNN

http://www.comp.hkbu.edu.hk/~markus/teaching/comp7650/tnn-94-gradient.pdf

1994

MLTSA:

RNN

3

RNN architecture

input layer

output layer

hidden layers

Feed-forward NN architecture

RNN architecture

output layer

hidden layers

Feed-forward NN architecture

Recurrent NN architecture

input layer

output layer

RNN hidden layers

output layer

hidden layers

input layer

RNN architecture

input layer

output layer

RNN hidden layers

current state

previous state

Remember the state-space problem!

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

h_t = f_q(h_{t-1}, x_t)

RNN architecture

input layer

output layer

RNN hidden layers

Remember the state-space problem!

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

h_t = f_q(h_{t-1}, x_t)

current state

previous state

features

(can be time dependent)

function with parameters q

MLTSA:

state space model (from week ~4)

y_t=Hx_t+\epsilon_t;~~\epsilon_t∼N(0,\Sigma^2_\epsilon)

y_t=Hx_t+\epsilon_t;~~\epsilon_t∼N(0,\Sigma^2_\epsilon)

x_{t} =\Phi x_{t-1} + \nu_t;~~\nu_t∼N(0,\Sigma^2_\nu)

x_{t} =\Phi x_{t-1} + \nu_t;~~\nu_t∼N(0,\Sigma^2_\nu)

A State-space model is a model to derive the value of a time-dependent variable x(t), the state, generated by a noisy Markovian process, from observations of a variable y(t), also subject to noise, linearly related to the target variable

Definition

RNN architecture

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = f_q(h_{t-1}, x_t)

h_t = f_q(h_{t-1}, x_t)

y_t = Q_{hy}\cdot h_{t}

y_t = Q_{hy}\cdot h_{t}

Whh

Wxh

Qhy

RNN architecture

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = tanh(W_{hh}\cdot h_{t-1},W_{xh}\cdot x_t)\\

h_t = tanh(W_{hh}\cdot h_{t-1},W_{xh}\cdot x_t)\\

y_t = Q_{hy}\cdot h_{t}

y_t = Q_{hy}\cdot h_{t}

Whh

Wxh

Qhy

RNN architecture

input layer

Alternative graphical representation of RNN

h_t = f_q(h_{t-1}, x_t)

h_t = f_q(h_{t-1}, x_t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

the weights are the same! always the same Whh and Why

RNN architecture

appllications

image captioning:

one image to a

sequence of words

RNN architecture

appllications

image captioning:

one image to a

sequence of words

sentiment analysis

sequence of words to one sentiment

RNN architecture

appllications

image captioning:

one image to a

sequence of words

sentiment analysis

sequence of words to one sentiment

language translator

sequence of words to sequence of words

RNN architecture

appllications

image captioning:

one image to a

sequence of words

sentiment analysis

sequence of words to one sentiment

language translator

sequence of words to sequence of words

online: video classification frame by frame

RNN architecture

more complicated RNNs

Some layers will be recurrent, others will not. Does not need to be fully connected

RNN architecture

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

y_t = W_y\phi(h_t)

RNN architecture

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

y_t = W_y\phi(h_t)

The cats that ate were full

The cat that ate was full

RNN architecture

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

y_t = W_y\phi(h_t)

LOSS

RNN architecture

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

y_t = W_y\phi(h_t)

\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

RNN architecture

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

Why

Whh

Wxh

each output has its own loss

Why

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

y_t = W_y\phi(h_t)

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

\frac{\partial h_t}{\partial h_k} = \prod_{i=1}^{k} \frac{\partial h_t}{\partial h_{k-i}}

\frac{\partial h_t}{\partial h_k} = \prod_{i=1}^{k} \frac{\partial h_t}{\partial h_{k-i}}

e(t)

y(t+5)

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

RNN architecture

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

Why

Whh

Wxh

each output has its own loss

Why

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

y_t = W_y\phi(h_t)

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

e(t)

y(t+5)

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

\left| \frac{\partial h_t}{\partial h_{t-1}} \right|< 1 \rightarrow 0

\left| \frac{\partial h_t}{\partial h_{t-1}} \right|< 1 \rightarrow 0

\left|\frac{\partial h_t}{\partial h_{t-1}}\right| > 1 \rightarrow \infty

\left|\frac{\partial h_t}{\partial h_{t-1}}\right| > 1 \rightarrow \infty

RNN architecture

vanishing gradient problem!

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

Why

Learns Fast!

Learns slow!

RNN

obsesses

over

recent

past

forgets

remote

past

vanishing gradient problem!

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

vanishing gradient problem is exacerbated by having the same set of weights.

The vanishing gradient problem causes early layer to not to learn as effectively

The earlier layers learn from the remote past

As a result: vanilla RNN would only have short term memory (only learn from recent states)

Whh

MLTSA:

LSTM

4

https://www.pluralsight.com/guides/introduction-to-lstm-units-in-rnn

Ct: output

h: hidden states

X: input

Ct-1 : previous cell state (previous output)

ht-1 : previous hidden state

xt : current state (input)

forget gate:

do i keep memory of this past step

f^{(t)} = \sigma(W^f[h_{t-1},x_t] + b^f)

f^{(t)} = \sigma(W^f[h_{t-1},x_t] + b^f)

LSTM: long short term memory

solution to the vanishing gradient problem

input gate:

do I update the current cell?

i^{(t)} = \sigma(W^i[h_{t-1},x_t] = b^i)

i^{(t)} = \sigma(W^i[h_{t-1},x_t] = b^i)

\hat{C}^{(t)} = \sigma(W^C[h_{t-1},x_t] = b^C)

\hat{C}^{(t)} = \sigma(W^C[h_{t-1},x_t] = b^C)

LSTM: long short term memory

solution to the vanishing gradient problem

cell state:

procuces the prediction

C^{(t)} = C^{(t-1)} \times f^{(t)}+ i^{(t)} \times \hat{C}^{(t)}

C^{(t)} = C^{(t-1)} \times f^{(t)}+ i^{(t)} \times \hat{C}^{(t)}

LSTM: long short term memory

solution to the vanishing gradient problem

output gate

previous input that goes into the hidden state

o^{(t)} = \sigma(W^o[h_{t-1},x_t] = b^o)

o^{(t)} = \sigma(W^o[h_{t-1},x_t] = b^o)

LSTM: long short term memory

solution to the vanishing gradient problem

hidden state

produces the new hidden states

h^{(t)} = o^{(t)} *\tanh\left( C^{(t)}\right)

h^{(t)} = o^{(t)} *\tanh\left( C^{(t)}\right)

LSTM: long short term memory

solution to the vanishing gradient problem

LSTM: long short term memory

solution to the vanishing gradient problem

even if you want to predict a single time series, you need many example

split the time series into chunks

C_t

C_t

LSTM: how to actually run it

batch size: how many sequencies you pass at once

timeseries: how many time stamps in a sequence

features: how many measurements in the time seris

even if you want to predict a single time series, you need many example

split the time series into chunks

C_t

C_t

LSTM: how to actually run it

batch size: N

timeseries: 1000

features: 2

model = Sequential()
model.add(LSTM(32, input_shape=(50, 2)))
model.add(Dense(2))

even if you want to predict a single time series, you need many example

split the time series into chunks

C_t

C_t

LSTM: how to actually run it

To be or not to be? this is the question. Whether 'tis nobler in the mind

sequencies of 12 letters

batch size: N

timeseries: 12

features: 1

LSTM: how to actually run it

There is no homework on this cause I am at the end of the semester, but if you want to learn more I will upload an exercise over the weekend where you will train an RNN to generate physics paper titles!

http://davidsd.org/2010/09/the-arxiv-according-to-arxiv-vs-snarxiv/

MLTSA:

visualizing NNs

5

Saliency Maps

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

"The guesses are colored by their probability (so dark red = judged as very likely, white = not very likely).

...

The input character sequence (blue/green) is colored based on the firing of a randomly chosen neuron in the hidden representation of the RNN. Think about it as green = very excited and blue = not very excited (... these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state).

Intuitively, this is visualizing the firing rate of some neuron in the “brain” of the RNN while it reads the input sequence. Different neurons might be looking for different patterns.

learning markdown syntax: URL's

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

learning markdown syntax: [[]]

"The guesses are colored by their probability (so dark red = judged as very likely, white = not very likely).

...

The input character sequence (blue/green) is colored based on the firing of a randomly chosen neuron in the hidden representation of the RNN. Think about it as green = very excited and blue = not very excited (... these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state).

Intuitively, this is visualizing the firing rate of some neuron in the “brain” of the RNN while it reads the input sequence. Different neurons might be looking for different patterns.

Visualizing the predictions and the “neuron” firings in the RNN

Vanilla RNN

Visualizing the predictions and the “neuron” firings in the RNN

LSTM

reading

The Unreasonable Effectiveness of Recurrent Neural Networks

andrej karpathy

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

not mandatory

resources

Neural Network and Deep Learning

an excellent and free book on NN and DL

http://neuralnetworksanddeeplearning.com/index.html

Deep Learning An MIT Press book in preparation

Ian Goodfellow, Yoshua Bengio and Aaron Courville

https://www.deeplearningbook.org/lecture_slides.html

History of NN

https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history2.html

resources

Gradient Descent

https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

Machine Learning for

Time Series Analysis X

Deep Learning

MLTSA:

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork - loss functions

MLTSA:

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

back-propagation

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

back-propagation

Gradient Descent

Gradient Descent

MLTSA:

MLTSA:

state space model (from week ~4)

MLTSA:

MLTSA:

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

Visualizing the predictions and the “neuron” firings in the RNN

Visualizing the predictions and the “neuron” firings in the RNN

reading

The Unreasonable Effectiveness of Recurrent Neural Networks

resources

resources

MLTSA10 2022

MLTSA10 2022

federica bianco PRO

Machine Learning for

Time Series Analysis X

Deep Learning

MLTSA:

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork - loss functions

MLTSA:

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

back-propagation

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

back-propagation

Gradient Descent

Gradient Descent

MLTSA:

MLTSA:

state space model (from week ~4)

MLTSA:

MLTSA:

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

Visualizing the predictions and the “neuron” firings in the RNN https://sungsoo.github.io/2017/01/08/recurrent-neural-networks.html

Visualizing the predictions and the “neuron” firings in the RNN

Visualizing the predictions and the “neuron” firings in the RNN

reading

The Unreasonable Effectiveness of Recurrent Neural Networks

resources

resources

MLTSA10 2022

More from federica bianco