- Crash Course - part II -
UNIFEI - May, 2018
It helps to minimize the cost function
Not normalized J
Normalized J
Not normalized J
Normalized J
The derivatives can get too big
Weight initialization => prevents Z from blowing up
(for ReLU)
"Debugger" for backprop
Whiteboard time
Is there any problem with the gradient descent?
If m is large, what does it happen?
If m is large, is this computation be fast?
What can we do?
Split this large block of computation
1. Split the training data in batches
X
Y
Split into batches of 1k elements
1k batches of 1k elements
2. Algorithm (whiteboard)
t == m ----> gradient descent
t == 1 ----> stochastic gradient descent
We tend to select a value between 1 and m; usually a multiple of 64
Benefits of vectorization
Making progress without needing to wait for too long
We tend to select a value between 1 and m; usually a multiple of 64
Can you think of other strategies that could make gradient descent faster?
If we make the learning rate smaller, it could fix the problem
What would be an ideal solution?
Slower learning
Faster learning
It helps the gradient to take a more straight forward path
Is there something else we could do to seep up our algorithms?
with a variable value
Bigger in the earlier phases
Smaller in the later phases
Learning rate decay
It normalizes the hidden layers, and not only the input data
We normalize a, so that we train W and b faster
In practice, we compute a normalized value for Z:
Batch
Norm
We act on the cache Z
What are the parameters
?
They are learnable parameters
Parameters of the Neural Net:
We often use B.N. with mini-batch/stochastic gradient descent
Why does this process work?
The values of the hidden units will be on the same scale
It also mixes weights from early layers with later layers
It reduces the covariant shift
How do we extract information from an image?
J'aime le Canada <3
Detect vertical edges
Detect horizontal edges
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
How can we obtain the "edges"?
We can "scan" the image
What is the mathematical operation that represents a scan?
<3
Canada
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
Filter
3x3
*
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Filter
3x3
*
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | |||
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | ||
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | |
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | |||
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
shift = 1 unit
Why does this work?
w1 | w2 | w3 |
---|---|---|
w4 | w5 | w6 |
w7 | w8 | w9 |
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
Do you notice any problem?
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
Can you detect anything else we could do differently?
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
shift = 1 unit
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
Stride
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
-5 | -4 | 0 | 8 |
---|---|---|---|
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
shift = 2 units
We need an extra dimension
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6x3
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Filter
3x3x3
*
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Number of channels
It must be the same in the image and in the filter
3 | 0 | 1 | 2 | 7 | 4 |
---|---|---|---|---|---|
1 | 5 | 8 | 9 | 3 | 1 |
2 | 7 | 2 | 5 | 1 | 3 |
0 | 1 | 3 | 1 | 7 | 8 |
4 | 2 | 1 | 6 | 2 | 8 |
2 | 4 | 5 | 2 | 3 | 9 |
6x6x3
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
Filter
3x3x3
*
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
---|---|---|
1 | 0 | -1 |
1 | 0 | -1 |
= 4x4 (output dimension)
Whiteboard time
Hidden Layer
Hidden Layer
AlexNet
Inception Network
What should we take into consideration to identify a name?
We must consider the context
Hidden Layer
Hidden Layer
Hidden Layer
Hidden Layer
(close enough)
This is the forward step
Whiteboard time
Questions?
hannelita@gmail.com
@hannelita