Tutorial Neural Networks

Emmanuel Roux, Rémi Emonet, Odyssée Merveille

CONTENT

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion
  • Examples of applications
  • Historical background
  • General pipeline

DALL-E-2

 

What do you think ?

 

  • need of legal disclosure "AI generated synthetic media" ?
  • metrics to evaluate possible harms and misuses ?

link to an article of the Stanford Institute for
Human-Centered Artificial Intelligence

Natural Language Processing (NLP)

BigScience


46 different languages

176B parameters

training time: ~3-4 months

Link to Tensorboard

Ethical responsibility !

F. Urbina, F. Lentzos, C. Invernizzi, and S. Ekins, “Dual use of artificial-intelligence-powered drug discovery,” Nat Mach Intell, vol. 4, no. 3, Art. no. 3, Mar. 2022, doi: 10.1038/s42256-022-00465-9.

"We have spent decades using computers and AI to improve human health — not to degrade it. We were naive in thinking about the potential misuse of our trade, [...]."

Biochemical weapons design

On March, 4th 2022, the FTC, the U.S. agency in charge of consumer protection, ruled that an app developed by WW International (Kurbo app) did not respect data-collection laws :

 

collected data : age, gender, height, weight, and lifestyle choices.

children younger than 13 without permission from a parent

 

delete the data | destroy the models |  $1.5 million fine

 

In 2021, the FTC made Everalbum destroy models that used images uploaded by users who hadn’t consented to face recognition

 

app vendors punished by the U.S. government  for building algorithms based on illegally collected data

Better communication !

 

 

include cultural diversity in AI

transcribe

disappearing languages

link to the NY Times article

sign langage recognition

link to the CVPR2021 challenge

Inclusive story teller

https://sina.ivow.ai/

Better healthcare !

Personalized treatment (MRI)

Stroke Prevention (TCD)

by Phd student Vindas Y.,

Medical Image Analysis, 2022, doi: 10.1016/j.media.2022.102437.

by PhD student Fraissenon A.

https://rhu-cosy.com/

by PhD student Penarrubia L.

Medical physics, 2022,
doi: 10.1002/mp.15347.

Ventilation imaging (CT)

classification tasks

Decision Boundary

regression tasks

binary

segmentation

multi-class
(multi-label)

\hat{y} = a \times x +b
x
y
\phi
\mathrm{input \ values}

predicted continuous value(s)

0.1
1.7
7
6.1
  • Examples of applications
  • Historical background
  • General pipeline
  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

Machine Learning

Deep Learning

Artificial Intelligence

Inspired (and simplified) from the deeplearningbook.org
(I. Goodfellow and Y. Bengio, A. Courville, 2016)

historical background

Machine Learning

Deep Learning

Artificial Intelligence

Inspired from Sebastian Raschka's deep-learning course 

historical background

Artificial Intelligence

historical background

Artificial Intelligence

DEDUCTIVE

 

rule-based

no need of examples

INDUCTIVE

 

example based

adaptation

Symbolic AI

connectionism

historical background

Cybernetics (40’s to 60’s)

connexionism

Perceptron (Rosenblatt)

ADALINE (Widrow & Hoff)

Homeostat, 1948

(W. Ross Ashby)

source wikipedia

https://isl.stanford.edu/~widrow/papers/t1960anadaptive.pdf

historical background

Symbolic Artificial Intelligence (60’s to 80’s)

Symbolic AI

MYCIN (Shortliffe):  medical diagnoses (bacteria identification)

 GUIDON (Clancey):  teaching medical diagnostic strategy

CADUCEUS (Pople): internal medicine expert system

historical background

 machine learning (00's to 10's)

 

deep learning (10's - now)

Image by Dake, Mysid

connexionism (80’s to 00’s)

historical background

  • Examples of applications
  • Historical background
  • General pipeline
  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

data

General pipeline

sample

\left[ \begin{matrix} 0.2, 0.5 \end{matrix} \right]

data

General pipeline

sample

data

General pipeline

sample

data

General pipeline

data

General pipeline

data

label

General pipeline

data

label

Dataset

General pipeline

data

Dataset

General pipeline

data

label

Dataset

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

input

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

output

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

output

label

=

?

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (
-
)^2

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L}
\mathcal{\theta}

data

label

General pipeline

Dataset

neural network

\mathcal{L} = (
-
)^2

loss function

gradient

\mathcal{\theta}
\frac{\partial \mathcal{L}}{\partial \theta}
\mathcal{L}
\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (
-
)^2

loss function

update

General pipeline

neural network

\mathcal{\theta}

gradient

\theta_{n+1} = \theta_{n} + ...
\frac{\partial \mathcal{L}}{\partial \theta}
\mathcal{L}
\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (
-
)^2

loss function

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (
-
)^2

loss function

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (
-
)^2

loss function

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
\mathcal{L}
\mathcal{\theta}

data

label

Dataset

\mathcal{L} = (
-
)^2

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L}
\mathcal{\theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update
\mathcal{L}
\mathcal{\theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update
\mathcal{L}
\mathcal{\theta}

gradient

\frac{\partial \mathcal{L}}{\partial \theta}

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update
\mathcal{L}
\mathcal{\theta}

update

\theta_{n+1} = \theta_{n} + ...
\frac{\partial \mathcal{L}}{\partial \theta}

gradient

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}
\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}
\mathcal{\theta}
\frac{\partial \mathcal{L}}{\partial \theta}

gradient

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}
\mathcal{\theta}

update

\theta_{n+1} = \theta_{n} + ...
\frac{\partial \mathcal{L}}{\partial \theta}

gradient

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}
\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}
\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

update

\theta_{n+1} = \theta_{n} + ...
\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}
\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\mathcal{L}
\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}
\mathcal{\theta}

loss function

General pipeline

neural network

\mathcal{\theta}
\mathcal{L} = (
-
)^2
for data, label in dataloader:
    label_pred = model(data) # forward pass
    loss = (label - label_pred)**2 # loss 
    loss.backward() # loss gradient
    optimizer.step() # model update

data

label

Dataset

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\mathcal{L}
\mathcal{\theta}

update

\theta_{n+1} = \theta_{n} + ...

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}
\mathcal{\theta}
\mathcal{L} =(
-
)^2

Model

Method

General pipeline

neural network

\mathcal{\theta}

data

label

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}
\mathcal{\theta}
\mathcal{L} =(
-
)^2

Model

Method

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion
  • Pre-processing
  • Notations (2-D example)

CONTENT

data - pre-processing

  1. resampling
  2. feature scaling
  3. data augmentation

data - pre-processing

1. resampling

data - pre-processing

1. resampling

data - pre-processing

normalization

MinMax

x_1
x_2
x_1
x_2

2. feature scaling

standardisation

 

\mu=0 \\ \sigma=1

data - pre-processing

2. feature scaling

x_1
x_2
x_1
x_2

scale

crop (patches)

rotate

flip

perspectives ...

filtering/noise

data - pre-processing

3. data augmentation

audio and time-frequency representations

data - pre-processing

3. data augmentation

audio and time-frequency representations

data - pre-processing

3. data augmentation

audio and time-frequency representations

data - pre-processing

time stretching

3. data augmentation

audio and time-frequency representations

data - pre-processing

time stretching

3. data augmentation

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion
  • Pre-processing
  • Notations (2-D example)

CONTENT

data

label

y
\textbf{X}

data - notations

N \ \mathrm{samples}

data

label

y
X_0

sample

\textbf{X}

data - notations

data

label

y

sample

0.2
0.5
\left[ \begin{matrix} 0.2 \\ 0.5 \\ \end{matrix} \right]
X_0

data - notations

2D points example

\mathrm{feat}_1
\mathrm{feat}_2
\textbf{X}

input space

data

label

y
y_0
\left[ \begin{matrix} 0.2 \\ 0.5 \\ \end{matrix} \right]
0.2
0.5

and its label

sample

X_0

data - notations

2D points example

\mathrm{feat}_1
\mathrm{feat}_2
\textbf{X}

data

label

y
X_1

sample

y_1
0.2
0.2
\left[ \begin{matrix} 0.2 \\ 0.2 \\ \end{matrix} \right]

data - notations

2D points example

\mathrm{feat}_1
\mathrm{feat}_2
\textbf{X}

data

label

y
X_{N-1}

sample

y_{N-1}

data - notations

2D points example

\textbf{X}
\mathrm{feat}_1
\mathrm{feat}_2
\left[ \begin{matrix} 0.35 \\ 0.3 \\ \end{matrix} \right]
0.35
0.3

data

label

y
\textbf{(X}, y \textbf{)} = \{(X_i, y_i)\}_{i=0,...,N-1}

data - notations

2D points example

\textbf{X}
\mathrm{feat}_1
\mathrm{feat}_2

data

label

y

data - notations

2D points example

\textbf{X}
\mathrm{feat}_1
\mathrm{feat}_2
\textbf{(X}, y \textbf{)} \sim \mathcal{D}

unknown distribution

data

label

y

data - notations

2D points example

\textbf{X}
\mathrm{feat}_1
\mathrm{feat}_2
\textbf{X} \sim \mathcal{D_2}

new domain

\textbf{(X}, y \textbf{)} \sim \mathcal{D}

General pipeline

neural network

\mathcal{\theta}

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}
\mathcal{\theta}
\mathcal{L} =(
-
)^2

Model

Method

data

label

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Artificial Neuron
  • Neural Network (NN)
  • XOR (with CooLearning)
  • Convolutions (CNN)

model - artifical neuron

Input

Output

\Sigma
0.2
0.4
0.6
\Sigma

model - artifical neuron

0.2
0.4
0.6
\Sigma
w_0
w_1
w_2

model - artifical neuron

0.2
0.4
0.6
\Sigma
w_0
w_1
w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2
0.2
0.4
0.6
\Sigma
w_0
w_1
w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2
0.2
0.4
0.6
\Sigma
w_0
w_1
w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2
0.2
0.4
0.6
\Sigma
b
1
w_0
w_1
w_2

model - artifical neuron

0.2 \times w_0 + 0.4 \times w_1+ 0.6\times w_2 + b

3 weights

1 biais

x_0
x_1
x_2
\Sigma
b
1
w_0
w_1
w_2

model - artifical neuron

z = \sum_i{w_i x_i} + b
\Sigma
w_0
w_1
w_2
\sigma
b
1

model - artifical neuron

Activation function

z
x_0
x_1
x_2
a

Activation functions

0.2
0.4
0.6
\Sigma
w_0
w_1
w_2
\sigma
b
1

model - artifical neuron

ReLU

0 \ \{z < 0\}
z
z
z \ \{z \ge 0\}
a=\sigma(z) =
a

Activation functions

0.2
0.4
0.6
\Sigma
w_0
w_1
w_2
\sigma
b
1

model - artifical neuron

tanh

a=\sigma(z) = \tanh (z)
z
z
a

Activation functions

0.2
0.4
0.6
\Sigma
w_0
w_1
w_2
\sigma
b
1

model - artifical neuron

Sigmoid (logistic)

a=\sigma(z) = \frac{1}{1+e^{-z}}
z
z
a

Activation functions

model - artifical neuron

Softmax

\sigma(z)_c = \frac{e^{z_c}}{\sum_i{e^{z_i}}}
c

pseudo-probabilités

\sum_c \sigma(z)_c = 1
x_0
x_1
\Sigma
w_0
w_1
\sigma
b
1

model - artifical neuron

3-D interactive
visualization

Sigmoid (logistic)

2 inputs

a
z

model - artifical neuron

2 inputs

x_0
x_1
\Sigma
w_0
w_1
\sigma
b
1

model - artifical neuron

2 inputs

Sigmoid

a
z

2 inputs

x_0
x_1
\Sigma
w_0
w_1
\sigma
b
1

model - artifical neuron

2 inputs

top view

a
z

2 inputs

x_0
x_1
\Sigma
w_0
w_1
\sigma
b
1

model - artifical neuron

2 inputs

top view

a
z
  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Artificial Neuron
  • Neural Network (NN)
  • XOR (with CooLearning)
  • Convolutions (CNN)
x_0
x_1
\Sigma
w_{0,1}
\sigma
\Sigma
w_{0,0}
\sigma
b_0
b_1
w_{0,0}
w_{0,1}
w_{1,0}
w_{1,1}

model - neural network

2-neurons input layer

model - neural network

2-neurons input layer

x_0
x_1
\Sigma
w_{0,1}
\sigma
\Sigma
\sigma
b_0
b_1
w_{0,1}
w_{1,0}
w_{1,1}
w_{0,0}
x_0
x_1
\Sigma
w_{0,1}
\sigma
\Sigma
w_{0,0}
\sigma
b_0
b_1
w_{0,0}
w_{0,1}
w_{1,0}
w_{1,1}
\Sigma
\sigma
b_2

model - neural network

w_{0,2}
w_{1,2}

2-neurons input layer

1 neuron output layer

Perceptron

deep neural network

(MLP)

model - neural network

5-neurons input layer

1 neuron output layer

x_0
x_1
x_2
x_3
x_4

3-neurons hidden layer

model - neural network

x_0
x_1
x_2
x_3
x_4

deep neural network

(MLP)

5-neurons input layer

1 neuron output layer

3-neurons first
hidden layers

3-neurons second
hidden layers

model - neural network

x_0
x_1
x_2
x_3
x_4

deep neural network

(MLP)

12 neurons

BUT
64 parameters !

5x5 + 5 = 30

5x3 + 3 = 18

3x3 + 3 = 12

3x1 + 1 = 4

model - neural network

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Artificial Neuron
  • Neural Network (NN)
  • XOR (with CooLearning)
  • Convolutions (CNN)

model - neural network

model - neural network

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Artificial Neuron
  • Neural Network (NN)
  • XOR (with CooLearning)
  • Convolutions (CNN)
0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12 8
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12 8
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12 8
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12 8 8
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12 8 8 20
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
\times
\times
\times
4 12 8 8 20 10
+
+

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
4 12 8 8 20 10
\sigma
0 4 2

model - convolutions

0.2 0.5 0.3 0.3 0.9 0.4
0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
4 12 8 8 20 10
0.2 0.5 0.3 0.3 0.9 0.4
\sigma
0 4 2
0.5 0.3 0.9
\mathrm{maxpooling}
y

model - convolutions

0 4 2
2 0 2 2 0 4 2 1
x
\mathrm{kernel}
8 4 12 8 8 20 10 4
0 4 2

model - convolutions

(zero) padding

0 4 2
\mathrm{kernel}
0 4 2

0

0

0 4 2
0 2 0 2 2 0 4 2 1 0
x
\mathrm{kernel}
8 4 12 8 8 20 10 4
0 4 2

model - convolutions

STRIDE

0 4 2
0 2 0 2 2 0 4 2 1 0
x
\mathrm{kernel}
8 4 12 8 8 20 10 4
0 4 2

model - convolutions

STRIDE

0 4 2
0 2 0 2 2 0 4 2 1 0
x
\mathrm{kernel}
8 4 12 8 8 20 10 4
0 4 2

model - convolutions

STRIDE

0 4 2
0 2 0 2 2 0 4 2 1 0
x
\mathrm{kernel}
8 4 12 8 8 20 10 4
0 4 2

model - convolutions

STRIDE

model - convolutions

2-D convolutions

model - convolutions

X
W
\sigma
a_{0,0} = \sigma (\sum W \circ X[0:2, 0:2])

model - convolutions

X
W
\sigma
a_{0,1} = \sigma (\sum W \circ X[0:2, 1:3])

model - convolutions

X
W
\sigma
a_{1,0} = \sigma (\sum W \circ X[1:3, 0:2])

model - convolutions

X
W
\sigma
a_{1,1} = \sigma (\sum W \circ X[1:3, 1:3])

model - convolutions

X
a_{0,0}
a_{0,1}
a_{1,0}
a_{1,1}
X
X
X
X_{0:2,0:2}
X_{0:2,1:3}
X_{1:3, 0:2}
X_{1:3, 1:3}
max_{i,j}(a_{i,j})
X
X

model - convolutions

X
X
X
X
X
X
a_0
a_1
a_2
a=[a_0, a_1, a_2]
X
a

model - convolutions

a^{(0)}
a^{(1)}
a^{(2)}
X

model - convolutions

General pipeline

neural network

\mathcal{\theta}

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}
\mathcal{\theta}
\mathcal{L} =(
-
)^2

Model

Method

data

label

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Loss function
  • Gradient Backpropagation (chain rule)
  • Model update
1
0

method - loss function

1
0

method - loss function

Likelihood (Bernoulli distribution )

method - loss function

NN output
(pseudo proba)

1
0

pseudo-proba of being label

y
(1-a)

pseudo-proba of
being label

1-y
a
\prod_i a_{i}^{y_i} \times (1- a_{i})^{(1-y_i)}

method - loss function

likelihood

labels

NN output
(pseudo proba)

\mathcal{\theta}
\prod_i a_{i}^{y_i} \times (1- a_{i})^{(1-y_i)}

method - loss function

likelihood

labels

NN output
(pseudo proba)

log-likelihood

\mathcal{\theta}
\mathcal{\theta}
\sum_i y_i \ln \frac{1}{a_i} + (1-y_i) \ln \frac{1}{1-a_i}
\prod_i a_{i}^{y_i} \times (1- a_{i})^{(1-y_i)}

method - loss function

likelihood

labels

NN output
(pseudo proba)

log-likelihood

\mathcal{L}(y_i, \hat{y_i}) = - \sum_i y_i \ln \frac{1}{a_i} + (1-y_i) \ln \frac{1}{1-a_i}

negative
log-likelihood

= \sum_i y_i \ln a_i + (1-y_i) \ln (1-a_i)
\mathcal{\theta}
\mathcal{\theta}
\mathcal{\theta}
\sum_i y_i \ln \frac{1}{a_i} + (1-y_i) \ln \frac{1}{1-a_i}
  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Loss function
  • Gradient Backpropagation (chain rule)
  • Model update

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

\mathcal{\theta}
\mathcal{L}

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

w_0
\mathcal{L}

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

w_0
\mathcal{L}
\frac{\partial \mathcal{L}}{\partial w_0} < 0

we want to increase

w_0

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

w_1
\mathcal{L}
\frac{\partial \mathcal{L}}{\partial w_1} > 0

we want to decrease

w_1

Gradient of the loss with respect to the model parameters

method - gradient backpropagation

b
\mathcal{L}
\frac{\partial \mathcal{L}}{\partial b} >> 0

we want to decrease
(a lot !)

b

How to compute the gradient of the loss with respect to each model parameter ?

chain rule

\frac{\partial f(g(x)) }{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}

method - gradient backpropagation

(f \circ g)'(x)=f'(g(x))\times g'(x)

Compute the gradient of the loss with respect to each model parameter using the chain rule (2-inputs example)

reminder

method - gradient backpropagation

\frac{\partial \mathcal{L}}{\partial w_0} = \frac{\partial \mathcal{L}}{\partial \sigma} \frac{\partial \mathcal{\sigma}}{\partial z} \frac{\partial \mathcal{z}}{\partial w_0}
\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial \sigma} \frac{\partial \mathcal{\sigma}}{\partial z} \frac{\partial \mathcal{z}}{\partial w_1}
\frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial \sigma} \frac{\partial \mathcal{\sigma}}{\partial z} \frac{\partial \mathcal{z}}{\partial b}
z = \sum_i{w_i x_i} + b
z

gradient backpropagation

\mathcal{L}(a, y)

gradient vector contains
1 value for each parameter

method - gradient backpropagation

\frac{\partial \mathcal{L}}{\partial \theta} = \left[ \begin{matrix} \frac{\partial \mathcal{L}}{\partial w_0} \\ \\ \frac{\partial \mathcal{L}}{\partial w_1} \\ \\ \frac{\partial \mathcal{L}}{\partial b} \end{matrix} \right]
  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

  • Loss function
  • Gradient Backpropagation (chain rule)
  • Model update

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

\left[ \begin{matrix} w_{0,0} \\ w_{0,1} \\ b_0 \\ \vdots \\ w_{2,0} \\ b_2 \\ \end{matrix} \right]_{n+1} =
\left[ \begin{matrix} w_{0,0} \\ w_{0,1} \\ b_0 \\ \vdots \\ w_{2,0} \\ b_2 \\ \end{matrix} \right]_{n} -\lambda
\left[ \begin{matrix} 0.2 \\ -0.4 \\ 0.6 \\ \vdots \\ 0.4 \\ -0.8 \\ \end{matrix} \right]

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0
\mathcal{L}
w_1
\mathcal{L}
b
\mathcal{L}
\frac{\partial \mathcal{L}}{\partial w_0} < 0
\frac{\partial \mathcal{L}}{\partial w_1} > 0
\frac{\partial \mathcal{L}}{\partial b} >> 0

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0
\mathcal{L}
w_1
\mathcal{L}
b
\mathcal{L}

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0
\mathcal{L}
w_1
\mathcal{L}
b
\mathcal{L}

gradient descent

\theta_{n+1} = \theta_n -\lambda \frac{\partial \mathcal{L}}{\partial \theta}

method - model update

w_0
\mathcal{L}
w_1
\mathcal{L}
b
\mathcal{L}

method - model update

method - model update

  • Introduction
  • Data
  • Model
  • Method
  • Conclusion

CONTENT

Conclusion

neural network

\mathcal{\theta}

data

label

Dataset

loss function

\frac{\partial \mathcal{L}}{\partial \theta}

gradient

\theta_{n+1} = \theta_{n} + ...

update

\mathcal{L}
\mathcal{\theta}
\mathcal{L} =(
-
)^2

Model

Method

neural network

\mathcal{\theta}

data

label

data split

Model

Evaluation methodology

X
y

lock parameters

little break !