NN22 course (part 1)

Tutorial Neural Networks

Emmanuel Roux, Rémi Emonet, Odyssée Merveille

CONTENT

Introduction
Data
Model
Method
Conclusion

CONTENT

Introduction
Data
Model
Method
Conclusion

Examples of applications
Historical background
General pipeline

DALL-E-2

https://openai.com/dall-e-2/

What do you think ?

need of legal disclosure "AI generated synthetic media" ?
metrics to evaluate possible harms and misuses ?

link to an article of the Stanford Institute for
Human-Centered Artificial Intelligence

Natural Language Processing (NLP)

https://bigscience.huggingface.co/

BigScience

46 different languages

176B parameters

training time: ~3-4 months

https://arxiv.org/abs/2110.08207

Link to Tensorboard

Ethical responsibility !

F. Urbina, F. Lentzos, C. Invernizzi, and S. Ekins, “Dual use of artificial-intelligence-powered drug discovery,” Nat Mach Intell, vol. 4, no. 3, Art. no. 3, Mar. 2022, doi: 10.1038/s42256-022-00465-9.

"We have spent decades using computers and AI to improve human health — not to degrade it. We were naive in thinking about the potential misuse of our trade, [...]."

Biochemical weapons design

On March, 4th 2022, the FTC, the U.S. agency in charge of consumer protection, ruled that an app developed by WW International (Kurbo app) did not respect data-collection laws :

collected data : age, gender, height, weight, and lifestyle choices.

children younger than 13 without permission from a parent

delete the data | destroy the models | $1.5 million fine

link to the FTC article

In 2021, the FTC made Everalbum destroy models that used images uploaded by users who hadn’t consented to face recognition

app vendors punished by the U.S. government for building algorithms based on illegally collected data

link to the FTC article

Better communication !

include cultural diversity in AI

transcribe

disappearing languages

link to the NY Times article

sign langage recognition

link to the CVPR2021 challenge

Inclusive story teller

https://sina.ivow.ai/

Better healthcare !

Personalized treatment (MRI)

Stroke Prevention (TCD)

by Phd student Vindas Y.,

Medical Image Analysis, 2022, doi: 10.1016/j.media.2022.102437.

by PhD student Fraissenon A.

https://rhu-cosy.com/

by PhD student Penarrubia L.

Medical physics, 2022,
doi: 10.1002/mp.15347.

... and much more @

Ventilation imaging (CT)

classification tasks

Decision Boundary

regression tasks

binary

segmentation

multi-class
(multi-label)

\hat{y} = a \times x +b

\phi

\mathrm{input \ values}

predicted continuous value(s)

0.1

1.7

6.1

Examples of applications
Historical background
General pipeline

Introduction
Data
Model
Method
Conclusion

CONTENT

Machine Learning

Deep Learning

Artificial Intelligence

Inspired (and simplified) from the deeplearningbook.org
(I. Goodfellow and Y. Bengio, A. Courville, 2016)

historical background

Machine Learning

Deep Learning

Artificial Intelligence

Inspired from Sebastian Raschka's deep-learning course

historical background

Artificial Intelligence

Inspired from (Cardon D., Cointet J.-P., Mazieres A., 2018)

historical background

Artificial Intelligence

Inspired from (Cardon D., Cointet J.-P., Mazieres A., 2018)

DEDUCTIVE

rule-based

no need of examples

INDUCTIVE

example based

adaptation

Symbolic AI

connectionism

historical background

Cybernetics (40’s to 60’s)

Inspired from (Cardon D., Cointet J.-P., Mazieres A., 2018)

connexionism

Perceptron (Rosenblatt)

ADALINE (Widrow & Hoff)

Homeostat, 1948

(W. Ross Ashby)

source wikipedia

source reddit

https://isl.stanford.edu/~widrow/papers/t1960anadaptive.pdf

historical background

Symbolic Artificial Intelligence (60’s to 80’s)

Inspired from (Cardon D., Cointet J.-P., Mazieres A., 2018)

Symbolic AI

MYCIN (Shortliffe): medical diagnoses (bacteria identification)

GUIDON (Clancey): teaching medical diagnostic strategy

CADUCEUS (Pople): internal medicine expert system

Transcript of an INTERNIST-I Consultation (Myers)

historical background

machine learning (00's to 10's)

deep learning (10's - now)

Image by Dake, Mysid

connexionism (80’s to 00’s)

historical background

Examples of applications
Historical background
General pipeline

Introduction
Data
Model
Method
Conclusion

CONTENT

data

General pipeline

sample

\left[ \begin{matrix} 0.2, 0.5 \end{matrix} \right]

data

General pipeline

sample

data

General pipeline

sample

data

General pipeline

data

General pipeline

data

label

General pipeline

data

label

Dataset

General pipeline

data

Dataset

General pipeline

data

label

Dataset

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

input

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

output

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

output

label

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (

)^2

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L}

\mathcal{\theta}

data

label

General pipeline

Dataset

neural network

\mathcal{L} = (

)^2

loss function

gradient

\mathcal{\theta}

\frac{\partial \mathcal{L}}{\partial \theta}

\mathcal{L}

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (

)^2

loss function

update

General pipeline

neural network

\mathcal{\theta}

gradient

\theta_{n+1} = \theta_{n} + ...

\frac{\partial \mathcal{L}}{\partial \theta}

\mathcal{L}

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (

)^2

loss function

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (

)^2

loss function

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (

)^2

loss function

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

\mathcal{L}

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (

)^2

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L}

\mathcal{\theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

\mathcal{L}

\mathcal{\theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

\mathcal{L}

\mathcal{\theta}

gradient

\frac{\partial \mathcal{L}}{\partial \theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

\mathcal{L}

\mathcal{\theta}

update

\theta_{n+1} = \theta_{n} + ...

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}

\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}

\mathcal{\theta}

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}

\mathcal{\theta}

update

\theta_{n+1} = \theta_{n} + ...

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}

\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}

\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

update

\theta_{n+1} = \theta_{n} + ...

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}

\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}

\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}

\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}

\mathcal{L} = (

)^2

for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}

\mathcal{\theta}

update

\theta_{n+1} = \theta_{n} + ...

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}

\mathcal{\theta}

\mathcal{L} =(

)^2

Model

Method

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}

\mathcal{\theta}

\mathcal{L} =(

)^2

Model

Method

Introduction
Data
Model
Method
Conclusion

Pre-processing
Notations (2-D example)

CONTENT

data - pre-processing

resampling
feature scaling
data augmentation

data - pre-processing

1. resampling

data - pre-processing

1. resampling

data - pre-processing

normalization

MinMax

x_1

x_2

x_1

x_2

2. feature scaling

standardisation

\mu=0 \\ \sigma=1

data - pre-processing

2. feature scaling

x_1

x_2

x_1

x_2

scale

crop (patches)

rotate

flip

perspectives ...

filtering/noise

data - pre-processing

3. data augmentation

from torch audio transforms

audio and time-frequency representations

data - pre-processing

3. data augmentation

from torch audio transforms

audio and time-frequency representations

data - pre-processing

3. data augmentation

from torch audio transforms

audio and time-frequency representations

data - pre-processing

time stretching

3. data augmentation

from torch audio transforms

audio and time-frequency representations

data - pre-processing

time stretching

3. data augmentation

Introduction
Data
Model
Method
Conclusion

Pre-processing
Notations (2-D example)

CONTENT

data

label

\textbf{X}

data - notations

N \ \mathrm{samples}

data

label

X_0

sample

\textbf{X}

data - notations

data

label

sample

0.2

0.5

\left[ \begin{matrix} 0.2 \\ 0.5 \\ \end{matrix} \right]

X_0

data - notations

2D points example

\mathrm{feat}_1

\mathrm{feat}_2

\textbf{X}

input space

data

label

y_0

\left[ \begin{matrix} 0.2 \\ 0.5 \\ \end{matrix} \right]

0.2

0.5

and its label

sample

X_0

data - notations

2D points example

\mathrm{feat}_1

\mathrm{feat}_2

\textbf{X}

data

label

X_1

sample

y_1

0.2

\left[ \begin{matrix} 0.2 \\ 0.2 \\ \end{matrix} \right]

data - notations

2D points example

\mathrm{feat}_1

\mathrm{feat}_2

\textbf{X}

data

label

X_{N-1}

sample

y_{N-1}

data - notations

2D points example

\textbf{X}

\mathrm{feat}_1

\mathrm{feat}_2

\left[ \begin{matrix} 0.35 \\ 0.3 \\ \end{matrix} \right]

0.35

0.3

data

label

\textbf{(X}, y \textbf{)} = \{(X_i, y_i)\}_{i=0,...,N-1}

data - notations

2D points example

\textbf{X}

\mathrm{feat}_1

\mathrm{feat}_2

data

label

data - notations

2D points example

\textbf{X}

\mathrm{feat}_1

\mathrm{feat}_2

\textbf{(X}, y \textbf{)} \sim \mathcal{D}

unknown distribution

data

label

data - notations

2D points example

\textbf{X}

\mathrm{feat}_1

\mathrm{feat}_2

\textbf{X} \sim \mathcal{D_2}

new domain

\textbf{(X}, y \textbf{)} \sim \mathcal{D}

General pipeline

neural network

\mathcal{\theta}

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}

\mathcal{\theta}

\mathcal{L} =(

)^2

Model

Method

data

label

Introduction
Data
Model
Method
Conclusion

CONTENT

Artificial Neuron
Neural Network (NN)
XOR (with CooLearning)
Convolutions (CNN)

model - artifical neuron

Input

Output

\Sigma

0.2

0.4

0.6

\Sigma

model - artifical neuron

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

model - artifical neuron

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2 + b

3 weights

1 biais

x_0

x_1

x_2

\Sigma

w_0

w_1

w_2

model - artifical neuron

z = \sum_i{w_i x_i} + b

\Sigma

w_0

w_1

w_2

\sigma

model - artifical neuron

Activation function

x_0

x_1

x_2

Activation functions

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

\sigma

model - artifical neuron

ReLU

0 \ \{z < 0\}

z \ \{z \ge 0\}

a=\sigma(z) =

Activation functions

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

\sigma

model - artifical neuron

tanh

a=\sigma(z) = \tanh (z)

Activation functions

0.2

0.4

0.6

\Sigma

w_0

w_1

w_2

\sigma

model - artifical neuron

Sigmoid (logistic)

a=\sigma(z) = \frac{1}{1+e^{-z}}

Activation functions

model - artifical neuron

Softmax

\sigma(z)_c = \frac{e^{z_c}}{\sum_i{e^{z_i}}}

pseudo-probabilités

\sum_c \sigma(z)_c = 1

x_0

x_1

\Sigma

w_0

w_1

\sigma

model - artifical neuron

https://www.geogebra.org/m/rgzzwe5w

3-D interactive
visualization

Sigmoid (logistic)

2 inputs

model - artifical neuron

2 inputs

x_0

x_1

\Sigma

w_0

w_1

\sigma

model - artifical neuron

2 inputs

Sigmoid

2 inputs

x_0

x_1

\Sigma

w_0

w_1

\sigma

model - artifical neuron

2 inputs

top view

2 inputs

x_0

x_1

\Sigma

w_0

w_1

\sigma

model - artifical neuron

2 inputs

top view

Introduction
Data
Model
Method
Conclusion

CONTENT

Artificial Neuron
Neural Network (NN)
XOR (with CooLearning)
Convolutions (CNN)

x_0

x_1

\Sigma

w_{0,1}

\sigma

\Sigma

w_{0,0}

\sigma

b_0

b_1

w_{0,0}

w_{0,1}

w_{1,0}

w_{1,1}

model - neural network

2-neurons input layer

model - neural network

2-neurons input layer

x_0

x_1

\Sigma

w_{0,1}

\sigma

\Sigma

\sigma

b_0

b_1

w_{0,1}

w_{1,0}

w_{1,1}

w_{0,0}

x_0

x_1

\Sigma

w_{0,1}

\sigma

\Sigma

w_{0,0}

\sigma

b_0

b_1

w_{0,0}

w_{0,1}

w_{1,0}

w_{1,1}

\Sigma

\sigma

b_2

model - neural network

w_{0,2}

w_{1,2}

2-neurons input layer

1 neuron output layer

Perceptron

deep neural network

(MLP)

model - neural network

5-neurons input layer

1 neuron output layer

x_0

x_1

x_2

x_3

x_4

3-neurons hidden layer

model - neural network

x_0

x_1

x_2

x_3

x_4

deep neural network

(MLP)

5-neurons input layer

1 neuron output layer

3-neurons first
hidden layers

3-neurons second
hidden layers

model - neural network

x_0

x_1

x_2

x_3

x_4

deep neural network

(MLP)

12 neurons

BUT
64 parameters !

5x5 + 5 = 30

5x3 + 3 = 18

3x3 + 3 = 12

3x1 + 1 = 4

model - neural network

CooLearning

(demo)

Introduction
Data
Model
Method
Conclusion

CONTENT

Artificial Neuron
Neural Network (NN)
XOR (with CooLearning)
Convolutions (CNN)

model - neural network

CooLearning

(demo)

https://www.geogebra.org/calculator/xxw84yb5

model - neural network

Introduction
Data
Model
Method
Conclusion

CONTENT

Artificial Neuron
Neural Network (NN)
XOR (with CooLearning)
Convolutions (CNN)

\mathrm{kernel}

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\times

model - convolutions

\mathrm{kernel}

\sigma

model - convolutions

0.2

0.5

0.3

0.9

0.4

\mathrm{kernel}

0.2

0.5

0.3

0.9

0.4

\sigma

0.5

0.3

0.9

link animations

\mathrm{maxpooling}

model - convolutions

\mathrm{kernel}

link animations

model - convolutions

(zero) padding

\mathrm{kernel}

link animations

model - convolutions

STRIDE

\mathrm{kernel}

link animations

model - convolutions

STRIDE

\mathrm{kernel}

link animations

model - convolutions

STRIDE

\mathrm{kernel}

link animations

model - convolutions

STRIDE

model - convolutions

2-D convolutions

model - convolutions

\sigma

a_{0,0} = \sigma (\sum W \circ X[0:2, 0:2])

model - convolutions

\sigma

a_{0,1} = \sigma (\sum W \circ X[0:2, 1:3])

model - convolutions

\sigma

a_{1,0} = \sigma (\sum W \circ X[1:3, 0:2])

model - convolutions

\sigma

a_{1,1} = \sigma (\sum W \circ X[1:3, 1:3])

model - convolutions

a_{0,0}

a_{0,1}

a_{1,0}

a_{1,1}

X_{0:2,0:2}

X_{0:2,1:3}

X_{1:3, 0:2}

X_{1:3, 1:3}

max_{i,j}(a_{i,j})

model - convolutions

a_0

a_1

a_2

a=[a_0, a_1, a_2]

model - convolutions

a^{(0)}

a^{(1)}

a^{(2)}

Link to 2D visualisations

model - convolutions

General pipeline

neural network

\mathcal{\theta}

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}

\mathcal{\theta}

\mathcal{L} =(

)^2

Model

Method

data

label

Introduction
Data
Model
Method
Conclusion

CONTENT

Loss function
Gradient Backpropagation (chain rule)
Model update

method - loss function

Likelihood (Bernoulli distribution )

method - loss function

NN output
(pseudo proba)

pseudo-proba of being label

(1-a)

pseudo-proba of
being label

1-y

\prod_i a_{i}^{y_i} \times (1- a_{i})^{(1-y_i)}

method - loss function

likelihood

labels

NN output
(pseudo proba)

\mathcal{\theta}

\prod_i a_{i}^{y_i} \times (1- a_{i})^{(1-y_i)}

method - loss function

likelihood

labels

NN output
(pseudo proba)

log-likelihood

\mathcal{\theta}

\sum_i y_i \ln \frac{1}{a_i} + (1-y_i) \ln \frac{1}{1-a_i}

\prod_i a_{i}^{y_i} \times (1- a_{i})^{(1-y_i)}

method - loss function

likelihood

labels

NN output
(pseudo proba)

log-likelihood

\mathcal{L}(y_i, \hat{y_i}) = - \sum_i y_i \ln \frac{1}{a_i} + (1-y_i) \ln \frac{1}{1-a_i}

negative
log-likelihood

= \sum_i y_i \ln a_i + (1-y_i) \ln (1-a_i)

\mathcal{\theta}

\sum_i y_i \ln \frac{1}{a_i} + (1-y_i) \ln \frac{1}{1-a_i}

Introduction
Data
Model
Method
Conclusion

CONTENT

Loss function
Gradient Backpropagation (chain rule)
Model update

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

\mathcal{\theta}

\mathcal{L}

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

w_0

\mathcal{L}

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

w_0

\mathcal{L}

\frac{\partial \mathcal{L}}{\partial w_0} < 0

we want to increase

w_0

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

w_1

\mathcal{L}

\frac{\partial \mathcal{L}}{\partial w_1} > 0

we want to decrease

w_1

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

\mathcal{L}

\frac{\partial \mathcal{L}}{\partial b} >> 0

we want to decrease
(a lot !)

How to compute the gradient of the loss with respect to each model parameter ?

chain rule

\frac{\partial f(g(x)) }{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}

method - gradient backpropagation

(f \circ g)'(x)=f'(g(x))\times g'(x)

Compute the gradient of the loss with respect to each model parameter using the chain rule (2-inputs example)

reminder

method - gradient backpropagation

\frac{\partial \mathcal{L}}{\partial w_0} = \frac{\partial \mathcal{L}}{\partial \sigma} \frac{\partial \mathcal{\sigma}}{\partial z} \frac{\partial \mathcal{z}}{\partial w_0}

\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial \sigma} \frac{\partial \mathcal{\sigma}}{\partial z} \frac{\partial \mathcal{z}}{\partial w_1}

\frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial \sigma} \frac{\partial \mathcal{\sigma}}{\partial z} \frac{\partial \mathcal{z}}{\partial b}

z = \sum_i{w_i x_i} + b

gradient backpropagation

\mathcal{L}(a, y)

gradient vector contains
1 value for each parameter

method - gradient backpropagation

\frac{\partial \mathcal{L}}{\partial \theta} = \left[ \begin{matrix} \frac{\partial \mathcal{L}}{\partial w_0} \\ \\ \frac{\partial \mathcal{L}}{\partial w_1} \\ \\ \frac{\partial \mathcal{L}}{\partial b} \end{matrix} \right]

Introduction
Data
Model
Method
Conclusion

CONTENT

Loss function
Gradient Backpropagation (chain rule)
Model update

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

\left[ \begin{matrix} w_{0,0} \\ w_{0,1} \\ b_0 \\ \vdots \\ w_{2,0} \\ b_2 \\ \end{matrix} \right]_{n+1} =

\left[ \begin{matrix} w_{0,0} \\ w_{0,1} \\ b_0 \\ \vdots \\ w_{2,0} \\ b_2 \\ \end{matrix} \right]_{n} -\lambda

\left[ \begin{matrix} 0.2 \\ -0.4 \\ 0.6 \\ \vdots \\ 0.4 \\ -0.8 \\ \end{matrix} \right]

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0

\mathcal{L}

w_1

\mathcal{L}

\frac{\partial \mathcal{L}}{\partial w_0} < 0

\frac{\partial \mathcal{L}}{\partial w_1} > 0

\frac{\partial \mathcal{L}}{\partial b} >> 0

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0

\mathcal{L}

w_1

\mathcal{L}

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0

\mathcal{L}

w_1

\mathcal{L}

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0

\mathcal{L}

w_1

\mathcal{L}

https://www.geogebra.org/calculator/dmxurcqw

method - model update

CooLearning

(demo)

method - model update

Introduction
Data
Model
Method
Conclusion

CONTENT

Conclusion

neural network

\mathcal{\theta}

data

label

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}

\mathcal{\theta}

\mathcal{L} =(

)^2

Model

Method

neural network

\mathcal{\theta}

data

label

data split

Model

Evaluation methodology

lock parameters

little break !