Deep Learning

- Crash Course - part II -

UNIFEI - May, 2018

[See part I here]

Topics

  1. Why is deep learning so popular now?
  2. Introduction to Machine Learning
  3. Introduction to Neural Networks
  4. Deep Neural Networks
  5. Training and measuring the performance of the N.N.
  6. Regularization
  7. Creating Deep Learning Projects
  8. CNNs and RNNs
  9. (OPTIONAL): Tensorflow intro

7. DL Projects: Optimization tricks

1. Normalize the input data

x_1 = 10000
x1=10000x_1 = 10000
x_2 = 0.026
x2=0.026x_2 = 0.026

1. Normalize the input data

x_1 = 10000
x1=10000x_1 = 10000
x_2 = 0.026
x2=0.026x_2 = 0.026

1. Normalize the input data

It helps to minimize the cost function

Not normalized J

Normalized J

Not normalized J

Normalized J

2. Exploding Gradients

The derivatives can get too big

Weight initialization => prevents Z from blowing up

variance(w_i) = \frac{2}{n}
variance(wi)=2nvariance(w_i) = \frac{2}{n}

(for ReLU)

3. Gradient checking

"Debugger" for backprop

Whiteboard time

4. Alternatives to the gradient descent

Is there any problem with the gradient descent?

If m is large, what does it happen?

J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [L(\hat{y}, y) ]
J(W,b)=1mi=1m[L(y^,y)]J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [L(\hat{y}, y) ]

If m is large, is this computation be fast?

What can we do?

Split this large block of computation

\sum_{i=1}^{batch} [L(\hat{y}, y) ]
i=1batch[L(y^,y)]\sum_{i=1}^{batch} [L(\hat{y}, y) ]
\sum_{i=1}^{batch} [L(\hat{y}, y) ]
i=1batch[L(y^,y)]\sum_{i=1}^{batch} [L(\hat{y}, y) ]
\sum_{i=1}^{batch} [L(\hat{y}, y) ]
i=1batch[L(y^,y)]\sum_{i=1}^{batch} [L(\hat{y}, y) ]
J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [batches]
J(W,b)=1mi=1m[batches]J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [batches]

Mini-batch gradient descent

1. Split the training data in batches

X

Y

X^{\{1\}}
X{1}X^{\{1\}}
X^{\{2\}}
X{2}X^{\{2\}}
X^{\{3\}}
X{3}X^{\{3\}}
Y^{\{1\}}
Y{1}Y^{\{1\}}
Y^{\{2\}}
Y{2}Y^{\{2\}}
Y^{\{3\}}
Y{3}Y^{\{3\}}

Example: Training data = 1 million

Split into batches of 1k elements

1k batches of 1k elements

2. Algorithm (whiteboard)

What is the size of the mini-batch?

t == m ----> gradient descent

t == 1 ----> stochastic gradient descent

We tend to select a value between 1 and m; usually a multiple of 64

Why does mini-batch gradient descent work?

Benefits of vectorization

Making progress without needing to wait for too long

We tend to select a value between 1 and m; usually a multiple of 64

Can you think of other strategies that could make gradient descent faster?

If we make the learning rate smaller, it could fix the problem

What would be an ideal solution?

Slower learning

Faster learning

Can we tell the gradient to go faster once it finds a good direction?

Gradient descent with momentum

dW =
dW=dW =
\beta VdW + (1 - \beta )dW
βVdW+(1β)dW\beta VdW + (1 - \beta )dW

It helps the gradient to take a more straight forward path

Vdb =\beta Vdb + (1 - \beta )db
Vdb=βVdb+(1β)dbVdb =\beta Vdb + (1 - \beta )db
V
VV

What is the inconvenience of these strategies?

Now we have much more parameters to manually control

\alpha, \beta, batch(size)
α,β,batch(size)\alpha, \beta, batch(size)

Hyperparameters

Are there MOAR strategies?

RMSProp

Adam (Momentum + RMSProp)

The previous strategies work with the derivatives.

Is there something else we could do to seep up our algorithms?

\alpha
α\alpha

with a variable value

Bigger in the earlier phases

Smaller in the later phases

Learning rate decay

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0
α=11+decayRate+epochNum+α0\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0

Did you notice something interesting about the previous formula?

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0
α=11+decayRate+epochNum+α0\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0

How do we select hyperparameters? 

A: Batch normalization

It normalizes the hidden layers, and not only the input data

Batch Normalization

We normalize a, so that we train W and b faster

In practice, we compute a normalized value for Z:

Z^{(i)}_{norm} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}
Znorm(i)=Z(i)μσ2+εZ^{(i)}_{norm} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}
\tilde{Z^{(i)}} = \gamma Z^{(i)}_{norm} + \beta
Z(i)~=γZnorm(i)+β\tilde{Z^{(i)}} = \gamma Z^{(i)}_{norm} + \beta

Batch Normalization

X
XX
x_1
x1x_1
x_2
x2x_2
x_3
x3x_3
\hat{y}
y^\hat{y}
z^{[1]}_1=w^{[1]T}+b^{[1]}_1
z1[1]=w[1]T+b1[1]z^{[1]}_1=w^{[1]T}+b^{[1]}_1
a^{[1]}_1 = \sigma(z^{[1]}_1)
a1[1]=σ(z1[1])a^{[1]}_1 = \sigma(z^{[1]}_1)
z^{[1]}_2=w^{[1]T}+b^{[1]}_2
z2[1]=w[1]T+b2[1]z^{[1]}_2=w^{[1]T}+b^{[1]}_2
a^{[1]}_2 = \sigma(z^{[1]}_2)
a2[1]=σ(z2[1])a^{[1]}_2 = \sigma(z^{[1]}_2)
z^{[1]}_3=w^{[1]T}+b^{[1]}_3
z3[1]=w[1]T+b3[1]z^{[1]}_3=w^{[1]T}+b^{[1]}_3
a^{[1]}_3 = \sigma(z^{[1]}_3)
a3[1]=σ(z3[1])a^{[1]}_3 = \sigma(z^{[1]}_3)
z^{[2]}_1=w^{[2]T}+b^{[2]}_1
z1[2]=w[2]T+b1[2]z^{[2]}_1=w^{[2]T}+b^{[2]}_1
a^{[2]}_1 = \sigma(z^{[2]}_1)
a1[2]=σ(z1[2])a^{[2]}_1 = \sigma(z^{[2]}_1)
Z^{[1]}
Z[1]Z^{[1]}
W^{[1]}, b^{[1]}
W[1],b[1]W^{[1]}, b^{[1]}
\gamma^{[1]}, \beta^{[1]}
γ[1],β[1]\gamma^{[1]}, \beta^{[1]}

Batch

Norm

\tilde{Z^{[1]}}
Z[1]~\tilde{Z^{[1]}}
A^{[1]} = g^{[1]}(\tilde{Z^{[1]}})
A[1]=g[1](Z[1]~)A^{[1]} = g^{[1]}(\tilde{Z^{[1]}})

Batch Normalization

We act on the cache Z

What are the parameters

\gamma, \beta
γ,β\gamma, \beta

?

They are learnable parameters

Parameters of the Neural Net:

W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}
W[1],b[1],W[2],b[2]W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}
\gamma^{[1]}, \beta^{[1]}, \gamma^{[2]}, \beta^{[2]}
γ[1],β[1],γ[2],β[2]\gamma^{[1]}, \beta^{[1]}, \gamma^{[2]}, \beta^{[2]}

Batch Normalization

We often use B.N. with mini-batch/stochastic gradient descent

Why does this process work?

The values of the hidden units will be on the same scale

It also mixes weights from early layers with later layers

It reduces the covariant shift

Topics

  1. Why is deep learning so popular now?
  2. Introduction to Machine Learning
  3. Introduction to Neural Networks
  4. Deep Neural Networks
  5. Training and measuring the performance of the N.N.
  6. Regularization
  7. Creating Deep Learning Projects
  8. CNNs and RNNs
  9. (OPTIONAL): Tensorflow intro

How do we extract information from an image?

Speech/text

J'aime le Canada <3

Detect vertical edges

Detect horizontal edges

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

How can we obtain the "edges"?

We can "scan" the image

What is the mathematical operation that represents a scan?

Convolution

<3

Canada

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

Filter

3x3

*

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1

Filter

3x3

*

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
= 3*1 + 0*0 +1* -1 + 1*1+5*0+8*-1+2*1+7*0+2*-1
=31+00+11+11+50+81+21+70+21= 3*1 + 0*0 +1* -1 + 1*1+5*0+8*-1+2*1+7*0+2*-1
=-5
=5=-5
-5
=
==
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4
=
==
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0
=
==
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
=
==
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10
=
==
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
==

shift = 1 unit

Positive numbers = light colours

Negative numbers = dark colours

Whiteboard time - examples of filters

This is what we call a Convolutional Neural Network

Why does this work?

The filter works like a matrix of weights

w1 w2 w3
w4 w5 w6
w7 w8 w9
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9
1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
==

Do you notice any problem?

1. The image shrinks

2. The edges are not that often used as the middle parts of the image

Padding

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

Can you detect anything else we could do differently?

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
==

shift = 1 unit

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

Stride

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
==

shift = 2 units

What about RGB Images?

We need an extra dimension

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6x3

1 0 -1
1 0 -1
1 0 -1

Filter

3x3x3

*

1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1

Number of channels

It must be the same in the image and in the filter

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6x3

1 0 -1
1 0 -1
1 0 -1

Filter

3x3x3

*

1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1

= 4x4 (output dimension)

We can combine multiple filters

Whiteboard time

Let's try to make it look like a neural net

x_1
x1x_1
x_2
x2x_2
x_3
x3x_3

Hidden Layer

\hat{y}
y^\hat{y}

Hidden Layer

a^{[1]}
a[1]a^{[1]}
a^{[1]}_1
a1[1]a^{[1]}_1
a^{[1]}_3
a3[1]a^{[1]}_3
a^{[1]}_2
a2[1]a^{[1]}_2
a^{[2]}
a[2]a^{[2]}
a^{[2]}_1
a1[2]a^{[2]}_1
X = a^{[0 ]}
X=a[0]X = a^{[0 ]}

Types of layers in a CNN

  1. Convolutional layer
  2. Pooling layer
  3. Fully Connected layer

 

Types of layers in a CNN

  1. Convolutional layer
  2. Pooling layer => reduce the size (ex: Max Pooling. It has no parameters)
  3. Fully Connected layer

 

LeNet (handwriting)

Advantages of convolutions

  1. Parameter sharing (reusing filters to detect features)
  2. Sparsity of connections (each output depends on a small number of values)

MOAR

AlexNet

Inception Network

There is much MOAR

  • Object detection
  • Face recognition 

Now it's time for text!

Given a sentence, identify the person's name

Professor Luiz Eduardo loves Canada

x^{<1>}
x&lt;1&gt;x^{&lt;1&gt;}
x^{<2>}
x&lt;2&gt;x^{&lt;2&gt;}
x^{<3>}
x&lt;3&gt;x^{&lt;3&gt;}
x^{<4>}
x&lt;4&gt;x^{&lt;4&gt;}
x^{<5>}
x&lt;5&gt;x^{&lt;5&gt;}
0
00
1
11
1
11
0
00
0
00
y^{<1>}
y&lt;1&gt;y^{&lt;1&gt;}
y^{<2>}
y&lt;2&gt;y^{&lt;2&gt;}
y^{<3>}
y&lt;3&gt;y^{&lt;3&gt;}
y^{<4>}
y&lt;4&gt;y^{&lt;4&gt;}
y^{<5>}
y&lt;5&gt;y^{&lt;5&gt;}
T_x = 5
Tx=5T_x = 5
T_y = 5
Ty=5T_y = 5

How can we identify the names?

Obtain the words from a vocabulary (~10k words)

Compare each word

What should we take into consideration to identify a name?

x^{<1>}
x&lt;1&gt;x^{&lt;1&gt;}
\hat{y}^{<1>}
y^&lt;1&gt;\hat{y}^{&lt;1&gt;}
x^{<2>}
x&lt;2&gt;x^{&lt;2&gt;}
\hat{y}^{<2>}
y^&lt;2&gt;\hat{y}^{&lt;2&gt;}
x^{<1>}
x&lt;1&gt;x^{&lt;1&gt;}
\hat{y}^{<1>}
y^&lt;1&gt;\hat{y}^{&lt;1&gt;}
x^{<2>}
x&lt;2&gt;x^{&lt;2&gt;}
\hat{y}^{<2>}
y^&lt;2&gt;\hat{y}^{&lt;2&gt;}
x^{<3>}
x&lt;3&gt;x^{&lt;3&gt;}
\hat{y}^{<3>}
y^&lt;3&gt;\hat{y}^{&lt;3&gt;}

We must consider the context

x^{<1>}
x&lt;1&gt;x^{&lt;1&gt;}
\hat{y}^{<1>}
y^&lt;1&gt;\hat{y}^{&lt;1&gt;}
x^{<2>}
x&lt;2&gt;x^{&lt;2&gt;}
\hat{y}^{<2>}
y^&lt;2&gt;\hat{y}^{&lt;2&gt;}
x^{<3>}
x&lt;3&gt;x^{&lt;3&gt;}
\hat{y}^{<3>}
y^&lt;3&gt;\hat{y}^{&lt;3&gt;}
a^{<1>}
a&lt;1&gt;a^{&lt;1&gt;}
a^{<2>}
a&lt;2&gt;a^{&lt;2&gt;}

Let's compare it with a simple neural net

x_1
x1x_1
x_2
x2x_2
x_3
x3x_3

Hidden Layer

\hat{y}
y^\hat{y}

Hidden Layer

a^{[1]}
a[1]a^{[1]}
a^{[1]}_1
a1[1]a^{[1]}_1
a^{[1]}_3
a3[1]a^{[1]}_3
a^{[1]}_2
a2[1]a^{[1]}_2
a^{[2]}
a[2]a^{[2]}
a^{[2]}_1
a1[2]a^{[2]}_1
X = a^{[0 ]}
X=a[0]X = a^{[0 ]}

Let's compare it with a simple neural net

x_1
x1x_1
x_2
x2x_2
x_3
x3x_3

Hidden Layer

\hat{y}
y^\hat{y}

Hidden Layer

a^{[1]}
a[1]a^{[1]}
a^{[1]}_1
a1[1]a^{[1]}_1
a^{[1]}_3
a3[1]a^{[1]}_3
a^{[1]}_2
a2[1]a^{[1]}_2
a^{[2]}
a[2]a^{[2]}
a^{[2]}_1
a1[2]a^{[2]}_1
x_1
x1x_1

(close enough)

x^{<1>}
x&lt;1&gt;x^{&lt;1&gt;}
\hat{y}^{<1>}
y^&lt;1&gt;\hat{y}^{&lt;1&gt;}
x^{<2>}
x&lt;2&gt;x^{&lt;2&gt;}
\hat{y}^{<2>}
y^&lt;2&gt;\hat{y}^{&lt;2&gt;}
x^{<3>}
x&lt;3&gt;x^{&lt;3&gt;}
\hat{y}^{<3>}
y^&lt;3&gt;\hat{y}^{&lt;3&gt;}
a^{<1>}
a&lt;1&gt;a^{&lt;1&gt;}
a^{<2>}
a&lt;2&gt;a^{&lt;2&gt;}
W_{ax}
WaxW_{ax}
W_{aa}
WaaW_{aa}
W_{ay}
WayW_{ay}

This is the forward step

Whiteboard time

We also have the backward propagation

L(\hat{y^{}}, y^{})
L(y^,y)L(\hat{y^{}}, y^{})
x^{<1>}
x&lt;1&gt;x^{&lt;1&gt;}
\hat{y}^{<1>}
y^&lt;1&gt;\hat{y}^{&lt;1&gt;}
x^{<2>}
x&lt;2&gt;x^{&lt;2&gt;}
\hat{y}^{<2>}
y^&lt;2&gt;\hat{y}^{&lt;2&gt;}
x^{<3>}
x&lt;3&gt;x^{&lt;3&gt;}
\hat{y}^{<3>}
y^&lt;3&gt;\hat{y}^{&lt;3&gt;}
a^{<1>}
a&lt;1&gt;a^{&lt;1&gt;}
a^{<2>}
a&lt;2&gt;a^{&lt;2&gt;}
a^{<3>}
a&lt;3&gt;a^{&lt;3&gt;}
x^{<4>}
x&lt;4&gt;x^{&lt;4&gt;}
\hat{y}^{<4>}
y^&lt;4&gt;\hat{y}^{&lt;4&gt;}
x^{<5>}
x&lt;5&gt;x^{&lt;5&gt;}
\hat{y}^{<5>}
y^&lt;5&gt;\hat{y}^{&lt;5&gt;}
a^{<4>}
a&lt;4&gt;a^{&lt;4&gt;}
L^{<1>}
L&lt;1&gt;L^{&lt;1&gt;}
L^{<2>}
L&lt;2&gt;L^{&lt;2&gt;}
L^{<3>}
L&lt;3&gt;L^{&lt;3&gt;}
L^{<4>}
L&lt;4&gt;L^{&lt;4&gt;}
L^{<5>}
L&lt;5&gt;L^{&lt;5&gt;}

Different types of RNNs (whiteboard)

GRU and LSTM

BidirectionalRNN

Machine Translation

A few problems with these algorithms (bias)

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita