Deep Learning

- Crash Course - part II -

UNIFEI - May, 2018

[See part I here]

https://slides.com/hannelitavante-hannelita/deep-learning-unifei/

Topics

Why is deep learning so popular now?
Introduction to Machine Learning
Introduction to Neural Networks
Deep Neural Networks
Training and measuring the performance of the N.N.
Regularization
Creating Deep Learning Projects
CNNs and RNNs
(OPTIONAL): Tensorflow intro

7. DL Projects: Optimization tricks

1. Normalize the input data

x_1 = 10000

x_1 = 10000

x_2 = 0.026

x_2 = 0.026

1. Normalize the input data

x_1 = 10000

x_1 = 10000

x_2 = 0.026

x_2 = 0.026

1. Normalize the input data

It helps to minimize the cost function

Not normalized J

Normalized J

Not normalized J

Normalized J

2. Exploding Gradients

The derivatives can get too big

Weight initialization => prevents Z from blowing up

variance(w_i) = \frac{2}{n}

variance(w_i) = \frac{2}{n}

(for ReLU)

3. Gradient checking

"Debugger" for backprop

Whiteboard time

4. Alternatives to the gradient descent

Is there any problem with the gradient descent?

If m is large, what does it happen?

J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [L(\hat{y}, y) ]

J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [L(\hat{y}, y) ]

If m is large, is this computation be fast?

What can we do?

Split this large block of computation

\sum_{i=1}^{batch} [L(\hat{y}, y) ]

\sum_{i=1}^{batch} [L(\hat{y}, y) ]

\sum_{i=1}^{batch} [L(\hat{y}, y) ]

\sum_{i=1}^{batch} [L(\hat{y}, y) ]

\sum_{i=1}^{batch} [L(\hat{y}, y) ]

\sum_{i=1}^{batch} [L(\hat{y}, y) ]

J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [batches]

J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [batches]

Mini-batch gradient descent

1. Split the training data in batches

X^{\{1\}}

X^{\{1\}}

X^{\{2\}}

X^{\{2\}}

X^{\{3\}}

X^{\{3\}}

Y^{\{1\}}

Y^{\{1\}}

Y^{\{2\}}

Y^{\{2\}}

Y^{\{3\}}

Y^{\{3\}}

Example: Training data = 1 million

Split into batches of 1k elements

1k batches of 1k elements

2. Algorithm (whiteboard)

What is the size of the mini-batch?

t == m ----> gradient descent

t == 1 ----> stochastic gradient descent

We tend to select a value between 1 and m; usually a multiple of 64

Why does mini-batch gradient descent work?

Benefits of vectorization

Making progress without needing to wait for too long

We tend to select a value between 1 and m; usually a multiple of 64

Can you think of other strategies that could make gradient descent faster?

If we make the learning rate smaller, it could fix the problem

What would be an ideal solution?

Slower learning

Faster learning

Can we tell the gradient to go faster once it finds a good direction?

Gradient descent with momentum

dW =

dW =

\beta VdW + (1 - \beta )dW

\beta VdW + (1 - \beta )dW

It helps the gradient to take a more straight forward path

Vdb =\beta Vdb + (1 - \beta )db

Vdb =\beta Vdb + (1 - \beta )db

V

What is the inconvenience of these strategies?

Now we have much more parameters to manually control

\alpha, \beta, batch(size)

\alpha, \beta, batch(size)

Hyperparameters

Are there MOAR strategies?

RMSProp

Adam (Momentum + RMSProp)

The previous strategies work with the derivatives.

Is there something else we could do to seep up our algorithms?

\alpha

\alpha

with a variable value

Bigger in the earlier phases

Smaller in the later phases

Learning rate decay

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0

Did you notice something interesting about the previous formula?

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0

How do we select hyperparameters?

A: Batch normalization

It normalizes the hidden layers, and not only the input data

Batch Normalization

We normalize a, so that we train W and b faster

In practice, we compute a normalized value for Z:

Z^{(i)}_{norm} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}

Z^{(i)}_{norm} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}

\tilde{Z^{(i)}} = \gamma Z^{(i)}_{norm} + \beta

\tilde{Z^{(i)}} = \gamma Z^{(i)}_{norm} + \beta

Batch Normalization

X

x_1

x_1

x_2

x_2

x_3

x_3

\hat{y}

\hat{y}

z^{[1]}_1=w^{[1]T}+b^{[1]}_1

z^{[1]}_1=w^{[1]T}+b^{[1]}_1

a^{[1]}_1 = \sigma(z^{[1]}_1)

a^{[1]}_1 = \sigma(z^{[1]}_1)

z^{[1]}_2=w^{[1]T}+b^{[1]}_2

z^{[1]}_2=w^{[1]T}+b^{[1]}_2

a^{[1]}_2 = \sigma(z^{[1]}_2)

a^{[1]}_2 = \sigma(z^{[1]}_2)

z^{[1]}_3=w^{[1]T}+b^{[1]}_3

z^{[1]}_3=w^{[1]T}+b^{[1]}_3

a^{[1]}_3 = \sigma(z^{[1]}_3)

a^{[1]}_3 = \sigma(z^{[1]}_3)

z^{[2]}_1=w^{[2]T}+b^{[2]}_1

z^{[2]}_1=w^{[2]T}+b^{[2]}_1

a^{[2]}_1 = \sigma(z^{[2]}_1)

a^{[2]}_1 = \sigma(z^{[2]}_1)

Z^{[1]}

Z^{[1]}

W^{[1]}, b^{[1]}

W^{[1]}, b^{[1]}

\gamma^{[1]}, \beta^{[1]}

\gamma^{[1]}, \beta^{[1]}

Batch

Norm

\tilde{Z^{[1]}}

\tilde{Z^{[1]}}

A^{[1]} = g^{[1]}(\tilde{Z^{[1]}})

A^{[1]} = g^{[1]}(\tilde{Z^{[1]}})

Batch Normalization

We act on the cache Z

What are the parameters

\gamma, \beta

\gamma, \beta

They are learnable parameters

Parameters of the Neural Net:

W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}

W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}

\gamma^{[1]}, \beta^{[1]}, \gamma^{[2]}, \beta^{[2]}

\gamma^{[1]}, \beta^{[1]}, \gamma^{[2]}, \beta^{[2]}

Batch Normalization

We often use B.N. with mini-batch/stochastic gradient descent

Why does this process work?

The values of the hidden units will be on the same scale

It also mixes weights from early layers with later layers

It reduces the covariant shift

Topics

Why is deep learning so popular now?
Introduction to Machine Learning
Introduction to Neural Networks
Deep Neural Networks
Training and measuring the performance of the N.N.
Regularization
Creating Deep Learning Projects
CNNs and RNNs
(OPTIONAL): Tensorflow intro

How do we extract information from an image?

Speech/text

J'aime le Canada <3

Detect vertical edges

Detect horizontal edges

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

How can we obtain the "edges"?

We can "scan" the image

What is the mathematical operation that represents a scan?

Convolution

Canada

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

Filter

3x3

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

Filter

3x3

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

= 3*1 + 0*0 +1* -1 + 1*1+5*0+8*-1+2*1+7*0+2*-1

= 3*1 + 0*0 +1* -1 + 1*1+5*0+8*-1+2*1+7*0+2*-1

=-5

=-5

-5

=

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

-5	-4

=

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

-5	-4	0

=

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

-5	-4	0	8

=

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

-5	-4	0	8
-10

=

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6

1	0	-1
1	0	-1
1	0	-1

-5	-4	0	8
-10	-2	2	3
0	-2	-4	-7
-3	-2	-3	-16

=

shift = 1 unit

Positive numbers = light colours

Negative numbers = dark colours

Whiteboard time - examples of filters

This is what we call a Convolutional Neural Network

Why does this work?

The filter works like a matrix of weights

w1	w2	w3
w4	w5	w6
w7	w8	w9

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

1	0	-1
1	0	-1
1	0	-1

-5	-4	0	8
-10	-2	2	3
0	-2	-4	-7
-3	-2	-3	-16

=

Do you notice any problem?

1. The image shrinks

2. The edges are not that often used as the middle parts of the image

Padding

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

Can you detect anything else we could do differently?

1	0	-1
1	0	-1
1	0	-1

-5	-4	0	8
-10	-2	2	3
0	-2	-4	-7
-3	-2	-3	-16

=

shift = 1 unit

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

Stride

1	0	-1
1	0	-1
1	0	-1

-5	-4	0	8
-10	-2	2	3
0	-2	-4	-7
-3	-2	-3	-16

=

shift = 2 units

What about RGB Images?

We need an extra dimension

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6x3

1	0	-1
1	0	-1
1	0	-1

Filter

3x3x3

1	0	-1
1	0	-1
1	0	-1

1	0	-1
1	0	-1
1	0	-1

Number of channels

It must be the same in the image and in the filter

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

6x6x3

1	0	-1
1	0	-1
1	0	-1

Filter

3x3x3

1	0	-1
1	0	-1
1	0	-1

1	0	-1
1	0	-1
1	0	-1

= 4x4 (output dimension)

We can combine multiple filters

Whiteboard time

Let's try to make it look like a neural net

x_1

x_1

x_2

x_2

x_3

x_3

Hidden Layer

\hat{y}

\hat{y}

Hidden Layer

a^{[1]}

a^{[1]}

a^{[1]}_1

a^{[1]}_1

a^{[1]}_3

a^{[1]}_3

a^{[1]}_2

a^{[1]}_2

a^{[2]}

a^{[2]}

a^{[2]}_1

a^{[2]}_1

X = a^{[0 ]}

X = a^{[0 ]}

Types of layers in a CNN

Convolutional layer
Pooling layer
Fully Connected layer

Types of layers in a CNN

Convolutional layer
Pooling layer => reduce the size (ex: Max Pooling. It has no parameters)
Fully Connected layer

LeNet (handwriting)

Advantages of convolutions

Parameter sharing (reusing filters to detect features)
Sparsity of connections (each output depends on a small number of values)

MOAR

AlexNet

Inception Network

There is much MOAR

Object detection
Face recognition

Now it's time for text!

Given a sentence, identify the person's name

Professor Luiz Eduardo loves Canada

x^{<1>}

x^{&lt;1&gt;}

x^{<2>}

x^{&lt;2&gt;}

x^{<3>}

x^{&lt;3&gt;}

x^{<4>}

x^{&lt;4&gt;}

x^{<5>}

x^{&lt;5&gt;}

0

1

1

0

0

y^{<1>}

y^{&lt;1&gt;}

y^{<2>}

y^{&lt;2&gt;}

y^{<3>}

y^{&lt;3&gt;}

y^{<4>}

y^{&lt;4&gt;}

y^{<5>}

y^{&lt;5&gt;}

T_x = 5

T_x = 5

T_y = 5

T_y = 5

How can we identify the names?

Obtain the words from a vocabulary (~10k words)

Compare each word

What should we take into consideration to identify a name?

x^{<1>}

x^{&lt;1&gt;}

\hat{y}^{<1>}

\hat{y}^{&lt;1&gt;}

x^{<2>}

x^{&lt;2&gt;}

\hat{y}^{<2>}

\hat{y}^{&lt;2&gt;}

x^{<1>}

x^{&lt;1&gt;}

\hat{y}^{<1>}

\hat{y}^{&lt;1&gt;}

x^{<2>}

x^{&lt;2&gt;}

\hat{y}^{<2>}

\hat{y}^{&lt;2&gt;}

x^{<3>}

x^{&lt;3&gt;}

\hat{y}^{<3>}

\hat{y}^{&lt;3&gt;}

We must consider the context

x^{<1>}

x^{&lt;1&gt;}

\hat{y}^{<1>}

\hat{y}^{&lt;1&gt;}

x^{<2>}

x^{&lt;2&gt;}

\hat{y}^{<2>}

\hat{y}^{&lt;2&gt;}

x^{<3>}

x^{&lt;3&gt;}

\hat{y}^{<3>}

\hat{y}^{&lt;3&gt;}

a^{<1>}

a^{&lt;1&gt;}

a^{<2>}

a^{&lt;2&gt;}

Let's compare it with a simple neural net

x_1

x_1

x_2

x_2

x_3

x_3

Hidden Layer

\hat{y}

\hat{y}

Hidden Layer

a^{[1]}

a^{[1]}

a^{[1]}_1

a^{[1]}_1

a^{[1]}_3

a^{[1]}_3

a^{[1]}_2

a^{[1]}_2

a^{[2]}

a^{[2]}

a^{[2]}_1

a^{[2]}_1

X = a^{[0 ]}

X = a^{[0 ]}

Let's compare it with a simple neural net

x_1

x_1

x_2

x_2

x_3

x_3

Hidden Layer

\hat{y}

\hat{y}

Hidden Layer

a^{[1]}

a^{[1]}

a^{[1]}_1

a^{[1]}_1

a^{[1]}_3

a^{[1]}_3

a^{[1]}_2

a^{[1]}_2

a^{[2]}

a^{[2]}

a^{[2]}_1

a^{[2]}_1

x_1

x_1

(close enough)

x^{<1>}

x^{&lt;1&gt;}

\hat{y}^{<1>}

\hat{y}^{&lt;1&gt;}

x^{<2>}

x^{&lt;2&gt;}

\hat{y}^{<2>}

\hat{y}^{&lt;2&gt;}

x^{<3>}

x^{&lt;3&gt;}

\hat{y}^{<3>}

\hat{y}^{&lt;3&gt;}

a^{<1>}

a^{&lt;1&gt;}

a^{<2>}

a^{&lt;2&gt;}

W_{ax}

W_{ax}

W_{aa}

W_{aa}

W_{ay}

W_{ay}

This is the forward step

Whiteboard time

We also have the backward propagation

L(\hat{y^{}}, y^{})

L(\hat{y^{}}, y^{})

x^{<1>}

x^{&lt;1&gt;}

\hat{y}^{<1>}

\hat{y}^{&lt;1&gt;}

x^{<2>}

x^{&lt;2&gt;}

\hat{y}^{<2>}

\hat{y}^{&lt;2&gt;}

x^{<3>}

x^{&lt;3&gt;}

\hat{y}^{<3>}

\hat{y}^{&lt;3&gt;}

a^{<1>}

a^{&lt;1&gt;}

a^{<2>}

a^{&lt;2&gt;}

a^{<3>}

a^{&lt;3&gt;}

x^{<4>}

x^{&lt;4&gt;}

\hat{y}^{<4>}

\hat{y}^{&lt;4&gt;}

x^{<5>}

x^{&lt;5&gt;}

\hat{y}^{<5>}

\hat{y}^{&lt;5&gt;}

a^{<4>}

a^{&lt;4&gt;}

L^{<1>}

L^{&lt;1&gt;}

L^{<2>}

L^{&lt;2&gt;}

L^{<3>}

L^{&lt;3&gt;}

L^{<4>}

L^{&lt;4&gt;}

L^{<5>}

L^{&lt;5&gt;}

Different types of RNNs (whiteboard)

GRU and LSTM

BidirectionalRNN

Machine Translation

A few problems with these algorithms (bias)

Exercises

Feedback

Thank you :)

Questions?

hannelita@gmail.com

@hannelita