# Deep Learning

- Crash Course - part II -

UNIFEI - May, 2018

# [See part I here]

## Topics

1. Why is deep learning so popular now?
2. Introduction to Machine Learning
3. Introduction to Neural Networks
4. Deep Neural Networks
5. Training and measuring the performance of the N.N.
6. Regularization
7. Creating Deep Learning Projects
8. CNNs and RNNs
9. (OPTIONAL): Tensorflow intro

# 1. Normalize the input data

x_1 = 10000
$x_1 = 10000$
x_2 = 0.026
$x_2 = 0.026$

# 1. Normalize the input data

x_1 = 10000
$x_1 = 10000$
x_2 = 0.026
$x_2 = 0.026$

# 1. Normalize the input data

It helps to minimize the cost function

Not normalized J

Normalized J

Not normalized J

Normalized J

The derivatives can get too big

Weight initialization => prevents Z from blowing up

variance(w_i) = \frac{2}{n}
$variance(w_i) = \frac{2}{n}$

(for ReLU)

"Debugger" for backprop

Whiteboard time

# 4. Alternatives to the gradient descent

Is there any problem with the gradient descent?

If m is large, what does it happen?

J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [L(\hat{y}, y) ]
$J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [L(\hat{y}, y) ]$

If m is large, is this computation be fast?

What can we do?

Split this large block of computation

\sum_{i=1}^{batch} [L(\hat{y}, y) ]
$\sum_{i=1}^{batch} [L(\hat{y}, y) ]$
\sum_{i=1}^{batch} [L(\hat{y}, y) ]
$\sum_{i=1}^{batch} [L(\hat{y}, y) ]$
\sum_{i=1}^{batch} [L(\hat{y}, y) ]
$\sum_{i=1}^{batch} [L(\hat{y}, y) ]$
J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [batches]
$J(W, b) = \frac{1}{m} \sum_{i=1}^{m} [batches]$

1. Split the training data in batches

X

Y

X^{\{1\}}
$X^{\{1\}}$
X^{\{2\}}
$X^{\{2\}}$
X^{\{3\}}
$X^{\{3\}}$
Y^{\{1\}}
$Y^{\{1\}}$
Y^{\{2\}}
$Y^{\{2\}}$
Y^{\{3\}}
$Y^{\{3\}}$

# Example: Training data = 1 million

Split into batches of 1k elements

1k batches of 1k elements

2. Algorithm (whiteboard)

# What is the size of the mini-batch?

t == m ----> gradient descent

t == 1 ----> stochastic gradient descent

We tend to select a value between 1 and m; usually a multiple of 64

## Why does mini-batch gradient descent work?

Benefits of vectorization

Making progress without needing to wait for too long

We tend to select a value between 1 and m; usually a multiple of 64

Can you think of other strategies that could make gradient descent faster?

If we make the learning rate smaller, it could fix the problem

What would be an ideal solution?

Slower learning

Faster learning

# Can we tell the gradient to go faster once it finds a good direction?

dW =
$dW =$
\beta VdW + (1 - \beta )dW
$\beta VdW + (1 - \beta )dW$

It helps the gradient to take a more straight forward path

Vdb =\beta Vdb + (1 - \beta )db
$Vdb =\beta Vdb + (1 - \beta )db$
V
$V$

# Now we have much more parameters to manually control

\alpha, \beta, batch(size)
$\alpha, \beta, batch(size)$

# The previous strategies work with the derivatives.

Is there something else we could do to seep up our algorithms?

\alpha
$\alpha$

with a variable value

Bigger in the earlier phases

Smaller in the later phases

Learning rate decay

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0
$\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0$

# Did you notice something interesting about the previous formula?

\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0
$\alpha = \frac{1}{1+ decayRate+epochNum} + \alpha_0$

# A: Batch normalization

It normalizes the hidden layers, and not only the input data

# Batch Normalization

We normalize a, so that we train W and b faster

In practice, we compute a normalized value for Z:

Z^{(i)}_{norm} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}
$Z^{(i)}_{norm} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \varepsilon}}$
\tilde{Z^{(i)}} = \gamma Z^{(i)}_{norm} + \beta
$\tilde{Z^{(i)}} = \gamma Z^{(i)}_{norm} + \beta$

# Batch Normalization

X
$X$
x_1
$x_1$
x_2
$x_2$
x_3
$x_3$
\hat{y}
$\hat{y}$
z^{[1]}_1=w^{[1]T}+b^{[1]}_1
$z^{[1]}_1=w^{[1]T}+b^{[1]}_1$
a^{[1]}_1 = \sigma(z^{[1]}_1)
$a^{[1]}_1 = \sigma(z^{[1]}_1)$
z^{[1]}_2=w^{[1]T}+b^{[1]}_2
$z^{[1]}_2=w^{[1]T}+b^{[1]}_2$
a^{[1]}_2 = \sigma(z^{[1]}_2)
$a^{[1]}_2 = \sigma(z^{[1]}_2)$
z^{[1]}_3=w^{[1]T}+b^{[1]}_3
$z^{[1]}_3=w^{[1]T}+b^{[1]}_3$
a^{[1]}_3 = \sigma(z^{[1]}_3)
$a^{[1]}_3 = \sigma(z^{[1]}_3)$
z^{[2]}_1=w^{[2]T}+b^{[2]}_1
$z^{[2]}_1=w^{[2]T}+b^{[2]}_1$
a^{[2]}_1 = \sigma(z^{[2]}_1)
$a^{[2]}_1 = \sigma(z^{[2]}_1)$
Z^{[1]}
$Z^{[1]}$
W^{[1]}, b^{[1]}
$W^{[1]}, b^{[1]}$
\gamma^{[1]}, \beta^{[1]}
$\gamma^{[1]}, \beta^{[1]}$

Batch

Norm

\tilde{Z^{[1]}}
$\tilde{Z^{[1]}}$
A^{[1]} = g^{[1]}(\tilde{Z^{[1]}})
$A^{[1]} = g^{[1]}(\tilde{Z^{[1]}})$

# Batch Normalization

We act on the cache Z

What are the parameters

\gamma, \beta
$\gamma, \beta$

?

They are learnable parameters

Parameters of the Neural Net:

W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}
$W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$
\gamma^{[1]}, \beta^{[1]}, \gamma^{[2]}, \beta^{[2]}
$\gamma^{[1]}, \beta^{[1]}, \gamma^{[2]}, \beta^{[2]}$

# Batch Normalization

We often use B.N. with mini-batch/stochastic gradient descent

Why does this process work?

The values of the hidden units will be on the same scale

It also mixes weights from early layers with later layers

It reduces the covariant shift

## Topics

1. Why is deep learning so popular now?
2. Introduction to Machine Learning
3. Introduction to Neural Networks
4. Deep Neural Networks
5. Training and measuring the performance of the N.N.
6. Regularization
7. Creating Deep Learning Projects
8. CNNs and RNNs
9. (OPTIONAL): Tensorflow intro

How do we extract information from an image?

# Speech/text

Detect vertical edges

Detect horizontal edges

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

How can we obtain the "edges"?

We can "scan" the image

What is the mathematical operation that represents a scan?

# Convolution

<3

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

Filter

3x3

*

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1

Filter

3x3

*

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
= 3*1 + 0*0 +1* -1 + 1*1+5*0+8*-1+2*1+7*0+2*-1
$= 3*1 + 0*0 +1* -1 + 1*1+5*0+8*-1+2*1+7*0+2*-1$
=-5
$=-5$
-5
=
$=$
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4
=
$=$
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0
=
$=$
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
=
$=$
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10
=
$=$
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
$=$

shift = 1 unit

# This is what we call a Convolutional Neural Network

Why does this work?

# The filter works like a matrix of weights

w1 w2 w3
w4 w5 w6
w7 w8 w9
3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9
1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
$=$

Do you notice any problem?

# 2. The edges are not that often used as the middle parts of the image

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

Can you detect anything else we could do differently?

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
$=$

shift = 1 unit

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

Stride

1 0 -1
1 0 -1
1 0 -1
-5 -4 0 8
-10 -2 2 3
0 -2 -4 -7
-3 -2 -3 -16
=
$=$

shift = 2 units

We need an extra dimension

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6x3

1 0 -1
1 0 -1
1 0 -1

Filter

3x3x3

*

1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1

Number of channels

It must be the same in the image and in the filter

3 0 1 2 7 4
1 5 8 9 3 1
2 7 2 5 1 3
0 1 3 1 7 8
4 2 1 6 2 8
2 4 5 2 3 9

6x6x3

1 0 -1
1 0 -1
1 0 -1

Filter

3x3x3

*

1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1
1 0 -1

= 4x4 (output dimension)

Whiteboard time

# Let's try to make it look like a neural net

x_1
$x_1$
x_2
$x_2$
x_3
$x_3$

Hidden Layer

\hat{y}
$\hat{y}$

Hidden Layer

a^{[1]}
$a^{[1]}$
a^{[1]}_1
$a^{[1]}_1$
a^{[1]}_3
$a^{[1]}_3$
a^{[1]}_2
$a^{[1]}_2$
a^{[2]}
$a^{[2]}$
a^{[2]}_1
$a^{[2]}_1$
X = a^{[0 ]}
$X = a^{[0 ]}$

# Types of layers in a CNN

1. Convolutional layer
2. Pooling layer
3. Fully Connected layer

# Types of layers in a CNN

1. Convolutional layer
2. Pooling layer => reduce the size (ex: Max Pooling. It has no parameters)
3. Fully Connected layer

# LeNet (handwriting)

1. Parameter sharing (reusing filters to detect features)
2. Sparsity of connections (each output depends on a small number of values)

# MOAR

AlexNet

Inception Network

# There is much MOAR

• Object detection
• Face recognition

# Given a sentence, identify the person's name

### Professor Luiz Eduardo loves Canada

x^{<1>}
$x^{<1>}$
x^{<2>}
$x^{<2>}$
x^{<3>}
$x^{<3>}$
x^{<4>}
$x^{<4>}$
x^{<5>}
$x^{<5>}$
0
$0$
1
$1$
1
$1$
0
$0$
0
$0$
y^{<1>}
$y^{<1>}$
y^{<2>}
$y^{<2>}$
y^{<3>}
$y^{<3>}$
y^{<4>}
$y^{<4>}$
y^{<5>}
$y^{<5>}$
T_x = 5
$T_x = 5$
T_y = 5
$T_y = 5$

# How can we identify the names?

## Compare each word

What should we take into consideration to identify a name?

x^{<1>}
$x^{<1>}$
\hat{y}^{<1>}
$\hat{y}^{<1>}$
x^{<2>}
$x^{<2>}$
\hat{y}^{<2>}
$\hat{y}^{<2>}$
x^{<1>}
$x^{<1>}$
\hat{y}^{<1>}
$\hat{y}^{<1>}$
x^{<2>}
$x^{<2>}$
\hat{y}^{<2>}
$\hat{y}^{<2>}$
x^{<3>}
$x^{<3>}$
\hat{y}^{<3>}
$\hat{y}^{<3>}$

We must consider the context

x^{<1>}
$x^{<1>}$
\hat{y}^{<1>}
$\hat{y}^{<1>}$
x^{<2>}
$x^{<2>}$
\hat{y}^{<2>}
$\hat{y}^{<2>}$
x^{<3>}
$x^{<3>}$
\hat{y}^{<3>}
$\hat{y}^{<3>}$
a^{<1>}
$a^{<1>}$
a^{<2>}
$a^{<2>}$

# Let's compare it with a simple neural net

x_1
$x_1$
x_2
$x_2$
x_3
$x_3$

Hidden Layer

\hat{y}
$\hat{y}$

Hidden Layer

a^{[1]}
$a^{[1]}$
a^{[1]}_1
$a^{[1]}_1$
a^{[1]}_3
$a^{[1]}_3$
a^{[1]}_2
$a^{[1]}_2$
a^{[2]}
$a^{[2]}$
a^{[2]}_1
$a^{[2]}_1$
X = a^{[0 ]}
$X = a^{[0 ]}$

# Let's compare it with a simple neural net

x_1
$x_1$
x_2
$x_2$
x_3
$x_3$

Hidden Layer

\hat{y}
$\hat{y}$

Hidden Layer

a^{[1]}
$a^{[1]}$
a^{[1]}_1
$a^{[1]}_1$
a^{[1]}_3
$a^{[1]}_3$
a^{[1]}_2
$a^{[1]}_2$
a^{[2]}
$a^{[2]}$
a^{[2]}_1
$a^{[2]}_1$
x_1
$x_1$

(close enough)

x^{<1>}
$x^{<1>}$
\hat{y}^{<1>}
$\hat{y}^{<1>}$
x^{<2>}
$x^{<2>}$
\hat{y}^{<2>}
$\hat{y}^{<2>}$
x^{<3>}
$x^{<3>}$
\hat{y}^{<3>}
$\hat{y}^{<3>}$
a^{<1>}
$a^{<1>}$
a^{<2>}
$a^{<2>}$
W_{ax}
$W_{ax}$
W_{aa}
$W_{aa}$
W_{ay}
$W_{ay}$

This is the forward step

Whiteboard time

# We also have the backward propagation

L(\hat{y^{}}, y^{})
$L(\hat{y^{}}, y^{})$
x^{<1>}
$x^{<1>}$
\hat{y}^{<1>}
$\hat{y}^{<1>}$
x^{<2>}
$x^{<2>}$
\hat{y}^{<2>}
$\hat{y}^{<2>}$
x^{<3>}
$x^{<3>}$
\hat{y}^{<3>}
$\hat{y}^{<3>}$
a^{<1>}
$a^{<1>}$
a^{<2>}
$a^{<2>}$
a^{<3>}
$a^{<3>}$
x^{<4>}
$x^{<4>}$
\hat{y}^{<4>}
$\hat{y}^{<4>}$
x^{<5>}
$x^{<5>}$
\hat{y}^{<5>}
$\hat{y}^{<5>}$
a^{<4>}
$a^{<4>}$
L^{<1>}
$L^{<1>}$
L^{<2>}
$L^{<2>}$
L^{<3>}
$L^{<3>}$
L^{<4>}
$L^{<4>}$
L^{<5>}
$L^{<5>}$

# Feedback

## Thank you :)

Questions?

hannelita@gmail.com

@hannelita

#### Deep Learning

By Hanneli Tavante (hannelita)

• 481