Deep Learning
- Crash Course - part II -
UNIFEI - May, 2018
[See part I here]
Topics
- Why is deep learning so popular now?
- Introduction to Machine Learning
- Introduction to Neural Networks
- Deep Neural Networks
- Training and measuring the performance of the N.N.
- Regularization
- Creating Deep Learning Projects
- CNNs and RNNs
- (OPTIONAL): Tensorflow intro
7. DL Projects: Optimization tricks
1. Normalize the input data
1. Normalize the input data
1. Normalize the input data
It helps to minimize the cost function
Not normalized J
Normalized J
Not normalized J
Normalized J
2. Exploding Gradients
The derivatives can get too big
Weight initialization => prevents Z from blowing up
(for ReLU)
3. Gradient checking
"Debugger" for backprop
Whiteboard time
4. Alternatives to the gradient descent
Is there any problem with the gradient descent?
If m is large, what does it happen?
If m is large, is this computation be fast?
What can we do?
Split this large block of computation
Mini-batch gradient descent
1. Split the training data in batches
X
Y
Example: Training data = 1 million
Split into batches of 1k elements
1k batches of 1k elements
2. Algorithm (whiteboard)
What is the size of the mini-batch?
t == m ----> gradient descent
t == 1 ----> stochastic gradient descent
We tend to select a value between 1 and m; usually a multiple of 64
Why does mini-batch gradient descent work?
Benefits of vectorization
Making progress without needing to wait for too long
We tend to select a value between 1 and m; usually a multiple of 64
Can you think of other strategies that could make gradient descent faster?
If we make the learning rate smaller, it could fix the problem
What would be an ideal solution?
Slower learning
Faster learning
Can we tell the gradient to go faster once it finds a good direction?
Gradient descent with momentum
It helps the gradient to take a more straight forward path
What is the inconvenience of these strategies?
Now we have much more parameters to manually control
Hyperparameters
Are there MOAR strategies?
RMSProp
Adam (Momentum + RMSProp)
The previous strategies work with the derivatives.
Is there something else we could do to seep up our algorithms?
with a variable value
Bigger in the earlier phases
Smaller in the later phases
Learning rate decay
Did you notice something interesting about the previous formula?
How do we select hyperparameters?
A: Batch normalization
It normalizes the hidden layers, and not only the input data
Batch Normalization
We normalize a, so that we train W and b faster
In practice, we compute a normalized value for Z:
Batch Normalization
Batch
Norm
Batch Normalization
We act on the cache Z
What are the parameters
?
They are learnable parameters
Parameters of the Neural Net:
Batch Normalization
We often use B.N. with mini-batch/stochastic gradient descent
Why does this process work?
The values of the hidden units will be on the same scale
It also mixes weights from early layers with later layers
It reduces the covariant shift
Topics
- Why is deep learning so popular now?
- Introduction to Machine Learning
- Introduction to Neural Networks
- Deep Neural Networks
- Training and measuring the performance of the N.N.
- Regularization
- Creating Deep Learning Projects
- CNNs and RNNs
- (OPTIONAL): Tensorflow intro
How do we extract information from an image?
Speech/text
J'aime le Canada <3
Detect vertical edges
Detect horizontal edges
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
How can we obtain the "edges"?
We can "scan" the image
What is the mathematical operation that represents a scan?
Convolution
<3
Canada
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
Filter
3x3
*
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Filter
3x3
*
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | |||
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | ||
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | |
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | |||
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
shift = 1 unit
Positive numbers = light colours
Negative numbers = dark colours
Whiteboard time - examples of filters
This is what we call a Convolutional Neural Network
Why does this work?
The filter works like a matrix of weights
w1 | w2 | w3 |
---|---|---|
w4 | w5 | w6 |
w7 | w8 | w9 |
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
Do you notice any problem?
1. The image shrinks
2. The edges are not that often used as the middle parts of the image
Padding
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
Can you detect anything else we could do differently?
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
shift = 1 unit
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
Stride
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
shift = 2 units
What about RGB Images?
We need an extra dimension
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6x3
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Filter
3x3x3
*
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Number of channels
It must be the same in the image and in the filter
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6x3
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Filter
3x3x3
*
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
= 4x4 (output dimension)
We can combine multiple filters
Whiteboard time
Let's try to make it look like a neural net
Hidden Layer
Hidden Layer
Types of layers in a CNN
- Convolutional layer
- Pooling layer
- Fully Connected layer
Types of layers in a CNN
- Convolutional layer
- Pooling layer => reduce the size (ex: Max Pooling. It has no parameters)
- Fully Connected layer
LeNet (handwriting)
Advantages of convolutions
- Parameter sharing (reusing filters to detect features)
- Sparsity of connections (each output depends on a small number of values)
MOAR
AlexNet
Inception Network
There is much MOAR
- Object detection
- Face recognition
Now it's time for text!
Given a sentence, identify the person's name
Professor Luiz Eduardo loves Canada
How can we identify the names?
Obtain the words from a vocabulary (~10k words)
Compare each word
What should we take into consideration to identify a name?
We must consider the context
Let's compare it with a simple neural net
Hidden Layer
Hidden Layer
Let's compare it with a simple neural net
Hidden Layer
Hidden Layer
(close enough)
This is the forward step
Whiteboard time
We also have the backward propagation
Different types of RNNs (whiteboard)
GRU and LSTM
BidirectionalRNN
Machine Translation
A few problems with these algorithms (bias)
Thank you :)
Questions?
hannelita@gmail.com
@hannelita
Deep Learning
By Hanneli Tavante (hannelita)
Deep Learning
- 2,017