data science

for (physical) scientists 12

dr.federica bianco | fbb.space |    fedhere |    fedhere

PINNs and Generative AI

this slide deck:

https://slides.com/federicabianco/dsps_12

WRITTEN FINAL:

72 hours at home

Dec 11 8AM - Dec 14 8AM

WORK ALONE

USE AI AT YOUR OWN RISK

USE THE SLACK CHANNEL FOR QUESTIONS

DM ME IF YOUR QUESTION REQUIRES SHARING CODE OR SOLUTIONS

WRITTEN FINAL:

72 hours at home

Dec 11 8AM - Dec 14 8AM

WORK ALONE

USE AI AT YOUR OWN RISK

USE THE SLACK CHANNEL FOR QUESTIONS

DM ME IF YOUR QUESTION REQUIRES SHARING CODE OR SOLUTIONS

ORAL FINAL:

Schedule 30 minutes with me

in those 30 minutes I will ask you questions about things you did in your final

You shall demonstrate that :
- you understand what you did
- you made informed choices based on what you leanred in the class

- you can answer questions about data science topics (e.g. what other method could you have used? or why would a NN not be a good choice here?)

https://calendly.com/fbbianco/30min

Deep Learning

0

recap

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

multilayer perceptron

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

multilayer perceptron

x_2

x_3

output

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

b_1

b_2

b_3

b_4

b

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

multilayer perceptron

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

http://neuralnetworksanddeeplearning.com/chap4.html

f: activation function:

turns neurons on-off

layer connectivity

x_2

x_3

output

input layer

hidden layer

output layer

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

b_1

b

x_2

x_3

output

input layer

hidden layer

output layer

x_1

Sparcely connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

b_1

b

layer connectivity

x_2

x_3

output

input layer

hidden layer

output layer

x_1

Sparcely connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

b_1

b

The last layer is always connected

layer connectivity

how does it relate to matrix multiplication

each layer is a matrix

Except this is a very misleading representation

there are no biases or activation functions

each layer should be a different shape

1x3

3x5

5x2

=

2x1

what we are doing is just a series of matrix multiplictions.

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

(((\vec{x} \cdot W_1) \cdot W_2) \cdot W_3)~=~y

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function φ

y = φ(x)

which (in general) is not linear with linear operations

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function φ

y = φ(x)

which (in general) is not linear with linear operations

http://neuralnetworksanddeeplearning.com/chap4.html

output

input layer

hidden layer

output layer

hidden layer

32 parameters and

?? hyperparameters

activation functions -

loss function - 1

optimization method - 1

architecture - M

how many hyperparameters?

Parameters and hyperparameters

\sum_{l=1}^N N_{n_l}

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

proper care of your DNN

0

Advanced issue found

▲

0

NN are a vast topics and we only have 2 weeks!

Some FREE references!

michael nielsen

better pedagogical approach, more basic, more clear

ian goodfellow

mathematical approach, more advanced, unfinished

http://neuralnetworksanddeeplearning.com/index.html

michael nielsen

better pedagogical approach, more basic, more clear

https://www.deeplearningbook.org/

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

5

Advanced issues found

▲

1

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss (gallery of horrors)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

On the interpretability of DNNs

https://distill.pub/2020/circuits/zoom-in/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Physics Informed NN

PiNN

1

Application regime:

PiNN

-infinity - 1950's

theory driven: little data, mostly theory, falsifiability and all that...

Application regime:

PiNN

-infinity - 1950's

theory driven: little data, mostly theory, falsifiability and all that...

-1980's - today

data driven: lots of data, drop theory and use associations, black-box modles

Application regime:

PiNN

-infinity - 1950's

theory driven: little data, mostly theory, falsifiability and all that...

-1980's - today

data driven: lots of data, drop theory and use associations, black-box modles

lots of data yet not enough for entirely automated decision making

complex theory that cannot be solved analytically

combine it with some theory

PiNN

Non Linear PDEs are hard to solve!

u:[0,T] \times D =>\mathbb{R}

\partial_t u (t,x) + \mathcal{N}[u](t,x) = 0\\ u(0,x) = u_0(x)

(t,x) \in (0,T] x D

PiNN

Non Linear PDEs are hard to solve!

Existence and uniqueness of solutions

A fundamental question for any PDE is the existence and uniqueness of a solution for given boundary conditions.

E.g.: Open problem of existence (and smoothness) of solutions to the Navier–Stokes equations is one of the seven Millennium Prize problems in mathematics.

https://en.wikipedia.org/wiki/Nonlinear_partial_differential_equation

PiNN

Non Linear PDEs are hard to solve!

Linear approximation

The solutions in a neighborhood of a known solution can sometimes be studied by linearizing the PDE around the solution. This corresponds to studying the tangent space of a point of the moduli space of all solutions.

https://en.wikipedia.org/wiki/Nonlinear_partial_differential_equation

PiNN

Non Linear PDEs are hard to solve!

Exact solutions

It is often possible to write down some special solutions explicitly in terms of elementary functions (though it is rarely possible to describe all solutions like this). One way of finding such explicit solutions is to reduce the equations to equations of lower dimension, preferably ordinary differential equations, which can often be solved exactly.

https://en.wikipedia.org/wiki/Nonlinear_partial_differential_equation

PiNN

Non Linear PDEs are hard to solve!

Numerical solutions

Numerical solution on a computer is almost the only method that can be used for getting information about arbitrary systems of PDEs. There has been a lot of work done, but a lot of work still remains on solving certain systems numerically, especially for the Navier–Stokes and other equations related to weather prediction.

https://en.wikipedia.org/wiki/Nonlinear_partial_differential_equation

PiNN

Burger's equation with viscosity

u:[0,T] \times D =>\mathbb{R}

\partial_t u (t,x) + \mathcal{N}[u](t,x) = 0\\ u(0,x) = u_0(x)

(t,x) \in (0,T] x D

PiNN

Burges equation

\partial_t u + u \, \partial_x u - (0.01/\pi) \, \partial_{xx} u = 0,\\

(t,x) \in (0,1] \times (-1,1),\\ x \in [-1,1],\\ t \in (0,1]

Domain

Boundary Conditions

u(0,x) = - \sin(\pi \, x), \\ u(t,-1) = u(t,1) = 0.

How to solve analytically

https://www.youtube.com/watch?v=5ZrwxQr6aV4

PiNN

Non Linear PDEs are hard to solve!

\partial_t u + u \, \partial_x u - (0.01/\pi) \, \partial_{xx} u = 0,\\

How to solve analytically

https://www.youtube.com/watch?v=5ZrwxQr6aV4

PiNN

Non Linear PDEs are hard to solve!

Raissi et al. Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. arXiv 1711.10561
Raissi et al. Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations. arXiv 1711.10566
Raissi et al. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comp. Phys. 378 pp. 686-707 DOI: 10.1016/j.jcp.2018.10.045

PiNN

Non Linear PDEs are hard to solve!

Provide training points at the boundary with calculated solution (trivial cause we have boundary conditions)

input layer

PiNN

Non Linear PDEs are hard to solve!

Provide training points at the boundary with calculated solution (trivial cause we have boundary conditions)

Provide the physical constraint: make sure the solution satisfies the PDE

???

PiNN

Non Linear PDEs are hard to solve!

Provide training points at the boundary with calculated solution (trivial cause we have boundary conditions)

Provide the physical constraint: make sure the solution satisfies the PDE

via a modified loss function that includes residuals of the prediction and residual of the PDE

PiNN

Non Linear PDEs are hard to solve!

Provide training points at the boundary with calculated solution (trivial cause we have boundary conditions)

Provide the physical constraint: make sure the solution satisfies the PDE

via a modified loss function that includes residuals of the prediction and residual of the PDE

\mathrm{loss} = L2 + PDE =\\ \sum(u_\theta - u)^2 + \\ (\partial_t u_\theta + u_\theta \, \partial_x u_\theta - (0.01/\pi) \, \partial_{xx} u_\theta)^2\\

PiNN

generative AI

0

Advanced issue found

▲

2

Applications

Image Generation (and 3D Shape Generation)
Semantic Image-to-Photo Translation
Image Resolution Increase
Text-to-Speech Generator
Speech-to-Speech Conversion
Text Generation (Chat GP3)
Music Generation
Image-to-Image Conversion

https://news.mit.edu/2024/scientists-use-generative-ai-complex-questions-physics-0516

GANs

VAE

Diffusion models

VAE

https://github.com/fedhere/MLPNS_FBianco/tree/main/generativeAI

Autoencoders

3

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

.... so if my layers are smaller what I have is a compact representation of the data

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)

.... so if my layers are smaller what I have is a compact representation of the data

Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

Autoencoder Architecture

Encoder: outputs a lower dimensional representation z of the data x (similar to PCA, tSNE...)
Decoder: Learns how to reconstruct x given z: learns p(x|z)

Autoencoder Architecture

https://link.springer.com/chapter/10.1007/978-981-13-6661-1_3

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

from keras.models import Sequential
#can upload pretrained models from keras.models
from keras.layers import Dense,  Conv2D, MaxPooling2D
#create model
model = Sequential()


#create the model architecture by adding model layers
model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))

#need to choose the loss function, metric, optimization scheme
model.compile(optimizer='adam', loss='mean_squared_error')

#need to learn what to look for - always plot the loss function!
model.fit(x_train, y_train, validation_data=(x_test, y_test),
                     epochs=20, batch_size=100, verbose=1)
#note that the model allows to give a validation test, 
#this is for a 3fold cross valiation: train-validate-test 
#predict
test_y_predictions = model.predict(validate_X)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

encoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

decoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

bottle neck

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This simple model has 200K parameters!

My original choice is to train it with "adadelta" with a mean squared loss function, all activation functions are relu, appropriate for a linear regression

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

autoencoder for image recontstruction

model_digits64.add(Dense(ndim, 
                        activation='linear'))
model_digits64_sig.compile(optimizer="adadelta", 
                   loss="mean_squared_error")

model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_sig.compile(optimizer="adadelta", 
                           loss="mean_squared_error")

model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_bce.compile(optimizer="adadelta", 
                           loss="binary_crossentropy")

loss function: did not finish learning, it is still decreasing rapidly

The predictions are far too detailed. While the input is not binary, it does not have a lot of details. Maybe approaching it as a binary problem (with a sigmoid and a binary cross entropy loss) will give better results

loss function: also did not finish learning, it is still decreasing rapidly

A sigmoid gives activation gives a much better result!

Binary cross entropy loss function: It is more appriopriate when the output layer is sigmoid

Even better results!

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

original

predicted

original

predicted

original

predicted

autoencoder for image recontstruction

A more ambitious model has a 16 neurons bottle neck: we are trying to extract 16 numbers to reconstruct the entire image! its pretty remarcable! those 16 number are extracted features from the data

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

predicted

original

latent

representation

models are neutral, the bias is in the data (or is it?)

https://www.theverge.com/21298762/face-depixelizer-ai-machine-learning-tool-pulse-stylegan-obama-bias

Why does this AI model whitens Obama face?

Simple answer: the data is biased. The algorithm is fed more images of white people

models are neutral, the bias is in the data (or is it?)

https://www.theverge.com/21298762/face-depixelizer-ai-machine-learning-tool-pulse-stylegan-obama-bias

Why does this AI model whitens Obama face?

Simple answer: the data is biased. The algorithm is fed more images of white people

But really, would the opposite have been acceptable? The bias is in society

Joy Boulamwini

models are neutral, the bias is in the data (or is it?)

comparing generative AI models

0

Advanced issue found

▲

3

see also https://arxiv.org/pdf/2103.04922.pdf

The latent space is assumed to be Gaussian distributed - this causes inaccuracy (blurry) generation

similar to a VAE but with a NN in the middle that approximates the true distribution of the latent space

The latent space is assumed to be Gaussian distributed - this causes inaccuracy (blurry) generation

Normalizing Flows

have two networks trained at the same time that compete again each other in a minimax game.

The generator generates images, starting with pure noise.

The discriminator classifies the image from the generator as Real/Fake

trained not to be fooled by the generator.

generator is trained to make better images

Ian Goodfellow et al., 2014 Generative Adversarial Networks

GANs: Generative Adversarial NN

trained not to be fooled by the generator.

generator is trained to make better images

Minmax Loss Function:

minimize

maximize

GANs: Generative Adversarial NN

https://danieltakeshi.github.io/2017/03/05/understanding-generative-adversarial-networks/

trained not to be fooled by the generator.

generator is trained to make better images

Minmax Loss Function:

minimize

maximize

log(D(G(z)))

change introduced to minimize geneerator saturation

GANs: Generative Adversarial NN

https://danieltakeshi.github.io/2017/03/05/understanding-generative-adversarial-networks/

DDPM:Denoising Diffusion Probabilistic Model

Ho Jain Abbel 2006

https://arxiv.org/abs/2006.11239

Which generative AI is right for you??

Neural Networks: Transformers

Encoder + Decoder architecture

Attention mechanism

Multithreaded attention

Attention is all you need: transformer model

transformer generalized architecture elements

Attention is all you need

Encoder + Decoder architecture

Encodes the past

Encoder + Decoder architecture

decodes the past and predicts the future

MHA acting on encoder (1)

Attention is all you need (2017)

each attention head learns relationships between elements of the series (i.e. words/punctuation in the sentence)

resources

Neural Network and Deep Learning

an excellent and free book on NN and DL

http://neuralnetworksanddeeplearning.com/index.html

Deep Learning An MIT Press book in preparation

Ian Goodfellow, Yoshua Bengio and Aaron Courville

https://www.deeplearningbook.org/lecture_slides.html

History of NN

https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history2.html

DNN for time series

RNN

RNN architecture

input layer

output layer

hidden layers

Feed-forward architecture

RNN architecture

output layer

hidden layers

Feed-forward NN architecture

Recurrent NN architecture

input layer

output layer

RNN hidden layers

output layer

hidden layers

input layer

RNN architecture

input layer

output layer

RNN hidden layers

current state

previous state

Remember the state-space problem!

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

RNN architecture

input layer

output layer

RNN hidden layers

Remember the state-space problem!

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

current state

previous state

features

(can be time dependent)

function with parameters q

MLTSA:

state space model (from week 4)

y_t=Hx_t+\epsilon_t;~~\epsilon_t∼N(0,\Sigma^2_\epsilon)

x_{t} =\Phi x_{t-1} + \nu_t;~~\nu_t∼N(0,\Sigma^2_\nu)

A State-space model is a model to derive the value of a time-dependent variable x(t), the state, generated by a noisy Markovian process, from observations of a variable y(t), also subject to noise, linearly related to the target variable

Definition

RNN architecture

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = f_q(h_{t-1}, x_t)

h_t = tanh(W_{hh}\cdot h_{t-1},W_{xh}\cdot x_t)\\

y_t = Q_{hy}\cdot h_{t}

Whh

Wxh

Why

RNN architecture

input layer

Alternative graphical representation of RNN

h_t = f_q(h_{t-1}, x_t)

Whh

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

the weights are the same! always the same Whh and Why

RNN architecture

appllications

image captioning:

one image to a

sequence of worods

sentiment analysis

sequence of words to one sentiment

language translator

sequence of words to sequence of words

online: video classification frame by frame

RNN architecture

more complicated RNNs

Some layers will be recurrent, others will not. Does not need to be fully connected

RNN architecture

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)

y_t = W_y\phi(h_t)

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial E_t}{\partial \theta}

\frac{\partial E_t}{\partial W} =\sum_{k=1}^{t} \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W}

vanishing gradient problem!

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

Why

Learns Fast!

Learns slow!

RNN

obsesses

over

recent

past

forgets

remote

past

vanishing gradient problem!

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Whh

Wxh

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

vanishing gradient problem is exacerbated by having the same set of weights.

The vanishing gradient problem causes early layer to not to learn as effectively

The earlier layers learn from the remote past

As a result: vanilla RNN would only have short term memory (only learn from recent states)

Whh

LSTM

LSTM: long short term memory

solution to the vanishing gradient problem

in one (or 4) slide(s)

input gate:

do I update the current cell?

i^{(t)} = \sigma(W^i[h_{t-1},x_t] = b^i)

forget gate:

do i keep memory of this past step

f^{(t)} = \sigma(W^f[h_{t-1},x_t] = b^f)

LSTM: long short term memory

solution to the vanishing gradient problem

in one (or 4) slide(s)

data science

for (physical) scientists 12

this slide deck:

Deep Learning

recap

multilayer perceptron

multilayer perceptron

multilayer perceptron

layer connectivity

layer connectivity

layer connectivity

how does it relate to matrix multiplication

each layer is a matrix

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork

DeepNeuralNetwork

Parameters and hyperparameters

proper care of your DNN

PiNN

PiNN

PiNN

PiNN

PiNN

PiNN

Existence and uniqueness of solutions

PiNN

Linear approximation

PiNN

Exact solutions

PiNN

Numerical solutions

PiNN

PiNN

PiNN

PiNN

PiNN

PiNN

PiNN

PiNN

PiNN

generative AI

Image Generation (and 3D Shape Generation)

Semantic Image-to-Photo Translation

Image Resolution Increase

Text-to-Speech Generator

Speech-to-Speech Conversion

Text Generation (Chat GP3)

Music Generation

Image-to-Image Conversion

models are neutral, the bias is in the data (or is it?)

models are neutral, the bias is in the data (or is it?)

models are neutral, the bias is in the data (or is it?)

comparing generative AI models

resources

DNN for time series

MLTSA:

state space model (from week 4)

Data Science for (Physical) scientists

More from federica bianco