CS6910: Fundamentals of Deep Learning
Lecture 4: Feedforward Neural Networks, Backpropagation
Mitesh M. Khapra
Department of Computer Science and Engineering, IIT Madras
Learning Objectives
At the end of this lecture, student will master the fundamentals of feed forward neural networks and mathematical formulations of back propagation technique
References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation and Andrej Karpathy Lecture (CS231n Winter 2016) on Bckpropagation, Neural Networks
|
---|
Module 4.1: Feedforward Neural Networks (a.k.a.
multilayered network of neurons)
The input to the network is an n-dimensional vector
The network contains \(\text L-1\) hidden layers (2, in this case) having \(\text n\) neurons each
Finally, there is one output layer containing \(\text k\) neurons (say, corresponding to \(\text k\) classes)
\(a_2\)
\(a_3\)
Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (\(a_i\) and \(h_i\) are vectors)
\(x_1\)
\(x_2\)
\(x_n\)
pre-activation and activation (\(a_i\) and \(h_i\) are vectors)
and activation (\(a_i\) and \(h_i\) are vectors)
(\(a_i\) and \(h_i\) are vectors)
The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer
\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i (0 < i < L\))
\(W_L \in \R^{n \times k}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer (\(L = 3\)) in this case)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
The pre-activation at layer \(i\) is given by
\(a_i(x) = b_i +W_ih_{i-1}(x)\)
The activation at layer \(i\) is given by
\(h_i(x) = g(a_i(x))\)
where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)
The activation at the output layer is given by
\(f(x) = h_L(x)=O(a_L(x))\)
where \(O\) is the output activation function (for example, softmax, linear, etc.)
To simplify notation we will refer to \(a_i(x)\) as \(a_i\) and \(h_i(x)\) as \(h_i\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
The pre-activation at layer \(i\) is given by
\(a_i = b_i +W_ih_{i-1}\)
The activation at layer \(i\) is given by
\(h_i = g(a_i)\)
where \(g\) is called the activation function (for example, logistic, tanh, linear, etc.)
The activation at the output layer is given by
\(f(x) = h_L=O(a_L)\)
where \(O\) is the output activation function (for example, softmax, linear, etc.)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Data: \(\lbrace x_i,y_i \rbrace_{i=1}^N\)
\(\hat y_i = f(x_i) = O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)\)
Model:
\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)
Algorithm: Gradient Descent with Back-propagation (we will see soon)
Objective/Loss/Error function: Say,
\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)
\(\text {In general,}\) \(min \mathscr{L}(\theta)\)
Parameters:
where \(\mathscr{L}(\theta)\) is some function of the parameters
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Module 4.1: Learning Parameters of Feedforward Neural Networks (Intuition)
The story so far...
We have introduced feedforward neural networks
We are now interested in finding an algorithm for learning the parameters of this model
|
---|
Recall our gradient descent algorithm
We can write it more concisely as
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(w_{t+1} \gets w_t - \eta \nabla w_t\)
\(b_{t+1} \gets b_t - \eta \nabla b_t\)
\(t \gets 0;\)
Algorithm: gradient_descent()
\(max\_iterations \gets 1000; \)
end
while \(t\)++ \(< max\_iterations\) do
\(Initialize w_0,b_0;\)
Recall our gradient descent algorithm
We can write it more concisely as
where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial w_t},\frac {\partial \mathscr{L}(\theta)}{\partial b_t}]^T\)
Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)
We can still use the same algorithm for learning the parameters of our model
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(t \gets 0;\)
Algorithm: gradient_descent()
\(max\_iterations \gets 1000; \)
\(Initialize \theta_0 = [w_0,b_0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(t \gets 0;\)
Algorithm: gradient_descent()
\(max\_iterations \gets 1000; \)
\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
Recall our gradient descent algorithm
We can write it more concisely as
Now, in this feedforward neural network, instead of \(\theta = [w,b]\) we have \(\theta = [W_1, ..., W_L, b_1, b_2, ..., b_L]\)
We can still use the same algorithm for learning the parameters of our model
where \(\nabla \theta_t = [\frac {\partial \mathscr{L}(\theta)}{\partial W_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial W_{L,t}}, \frac {\partial \mathscr{L}(\theta)}{\partial b_{1,t}},.,\frac {\partial \mathscr{L}(\theta)}{\partial b_{L,t}}]^T\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Except that now our \(\nabla \theta \) looks much more nasty
\(\nabla \theta \) is thus composed of
\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{n \times k},\)
\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k}\)
We need to answer two questions
How to choose the loss function \(\mathscr{L}(\theta)?\)
\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{n \times k},\)
\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)
How to compute \(\nabla \theta\) which is composed of
|
---|
Module 4.3: Output Functions and Loss Functions
We need to answer two questions
How to choose the loss function \(\mathscr{L}(\theta)?\)
\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{n \times k},\)
\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)
How to compute \(\nabla \theta\) which is composed of
|
---|
The choice of loss function depends on the problem at hand
We will illustrate this with the help of two examples
Consider our movie example again but this time we are interested in predicting ratings
Here \(y_i \in \R ^3\)
The loss function should capture how much \(\hat y_i\) deviates from \(y_i\)
If \(y_i \in \R ^n\) then the squared error loss can capture this deviation
\(\mathscr {L}(\theta) = \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^3 (\hat y_{ij} - y_{ij})^2\)
Neural network with \(L - 1\) hidden layers
isActor Damon
isDirector
Nolan
imdb
Rating
Critics
Rating
RT
Rating
\(y_i =\) {\(7.5 8.2 7.7\)}
\(x_i\)
\(. .\)
. . . . . .
A related question: What should the output function '\(O\)' be if \(y_i \in \R\)?
More specifically, can it be the logistic function?
No, because it restricts \(\hat y_i\) to a value between \(0\) & \(1\) but we want \(\hat y_i \in \R\)
So, in such cases it makes sense to have '\(O\)' as linear function
\(f(x) = h_L = O(a_L) \)
\(= W_Oa_L + b_O \)
\(\hat y_i = f(X_i)\) is no longer bounded between \(0\) and \(1\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Intentionally left blank
Neural network with \(L - 1\) hidden layers
Apple
Mango
\(y =\) [\(1 0 0 0\)]
Orange
Banana
Now let us consider another problem for which a different loss function would be appropriate
Suppose we want to classify an image into 1 of \(k\) classes
Here again we could use the squared error loss to capture the deviation
But can you think of a better function?
Neural network with \(L - 1\) hidden layers
Apple
Mango
\(y =\) [\(1 0 0 0\)]
Orange
Banana
Notice that \(y\) is a probability distribution
Therefore we should also ensure that \(\hat y\) is a probability distribution
What choice of the output activation '\(O\)' will ensure this ?
\(a_L = W_Lh_{L-1} + b_L\)
\(\hat y_j = O(a_L)_j = \cfrac {e^{a_{L,j}}}{\sum_{i=1}^k e^{a_{L,i}}}\)
\(O(a_L)_j\) is the \(j^{th}\) element of \(\hat y\) and \(a_{L,j}\) is the \(j^{th}\) element of the vector \(a_L\).
This function is called the \(softmax\) function
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Notice that \(y\) is a probability distribution
Therefore we should also ensure that \(\hat y\) is a probability distribution
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Neural network with \(L - 1\) hidden layers
Apple
Mango
\(y =\) [\(1 0 0 0\)]
Orange
Banana
Now that we have ensured that both \(y\) & \(\hat y\) are probability distributions can you think of a function which captures the difference between them?
Cross-entropy
\(\mathscr {L}(\theta) = - \displaystyle \sum_{c=1}^k y_c \log \hat y_c \)
Notice that
\(y_c = 1 \text {if} c = \ell\) (the true class label)
\( = 0 \text {otherwise}\)
\(\because\) \(\mathscr {L}(\theta) = - \log \hat y_\ell\)
So, for classification problem (where you have to choose \(1\) of \(K\) classes), we use the following objective function
or
\(\mathscr {L}(\theta) = - \log \hat y_\ell\)
\(- \mathscr {L}(\theta) = \log \hat y_\ell\)
But wait!
Is \(\hat y_\ell\) a function of \(\theta = [W_1,W_2, . ,W_L, b_1, b_2,., b_L]?\)
Yes, it is indeed a function of \(\theta\)
\(\hat y_\ell = [O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)]_\ell\)
What does \(\hat y_\ell\) encode?
It is the probability that \(x\) belongs to the \(\ell^{th}\) class (bring it as close to \(1\)).
\(\log \hat y_\ell\) is called the \(log\text -likelihood\) of the data.
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Output Activation | ||
Loss Function |
Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often
Outputs |
---|
Real Values | Probabilities |
Linear
Softmax
Squared Error
Cross Entropy
Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often
For the rest of this lecture we will focus on the case where the output activation is a softmax function and the loss function is cross entropy
Output Activation | ||
Loss Function |
Outputs |
---|
Real Values | Probabilities |
Linear
Softmax
Squared Error
Cross Entropy
Module 4.4: Backpropagation (Intuition)
We need to answer two questions
How to choose the loss function \(\mathscr{L}(\theta)?\)
\(\nabla W_1, \nabla W_2, ..., \nabla W_{L-1} \in \R^{n \times n}, \nabla W_L \in \R^{n \times k},\)
\(\nabla b_1, \nabla b_2, ..., \nabla b_{L-1} \in \R^{n}, \nabla b_L \in \R^{k} ?\)
How to compute \(\nabla \theta\) which is composed of
|
---|
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_1\)
\(h_2\)
\(a_1\)
\(a_2\)
\(a_3\)
\(\hat {y} = f(x)\)
\(b_1\)
\(b_2\)
\(b_3\)
\(t \gets 0;\)
Algorithm: gradient_descent()
\(max\_iterations \gets 1000; \)
\(Initialize \theta_0 = [w_0,b_0];\)
end
while
\(t\)++ \(< max\_iterations\)
do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
Let us focus on this one weight (\(W_{112}\)).
To learn this weight using SGD we need a formula for \(\frac{\partial \mathscr{L}(\theta)}{\partial W_{112}}\).
We will see how to calculate this.
\(W_{211}\)
\(W_{311}\)
\(W_{111}\)
\(W_{112}\)
First let us take the simple case when we have a deep but thin network.
In this case it is easy to find the derivative by chain rule.
\(W_{111}\)
\(x_1\)
\(a_{11}\)
\(W_{211}\)
\(a_{21}\)
\(W_{L11}\)
\(a_{L1}\)
\(\hat y = f(x)\)
\(\mathscr {L} (\theta)\)
\(h_{11}\)
\(h_{21}\)
First let us take the simple case when we have a deep but thin network.
In this case it is easy to find the derivative by chain rule.
(just compressing the chain rule)
\(x_1\)
\(a_{11}\)
\(a_{21}\)
\(a_{L1}\)
\(\hat y = f(x)\)
\(\mathscr {L} (\theta)\)
\(h_{11}\)
\(h_{21}\)
\(W_{111}\)
\(W_{211}\)
\(W_{L11}\)
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details
We get a certain loss at the output and we try to figure out who is responsible for this loss
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
We get a certain loss at the output and we try to figure out who is responsible for this loss
So, we talk to the output layer and say "Hey! You are not producing the desired output, better take responsibility".
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
We get a certain loss at the output and we try to figure out who is responsible for this loss
So, we talk to the output layer and say "Hey! You are not producing the desired output, better take responsibility".
The output layer says "Well, I take responsibility for my part but please understand that I am only as good as the hidden layer and weights below me". After all.
\(f(x) = \hat y = O(W_Lh_{L-1} + b_L)\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me.
\(b_2\)
\(W_2\)
\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"
So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me.
\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"
So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"
We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
But instead of talking to them directly, it is easier to talk to them through the hidden layers and output layers (and this is exactly what the chain rule allows us to do)
\(W_1\)
\(b_1\)
We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model)
The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me.
\(W_L\) and \(b_L\) take full responsibility but \(h_L\) says "Well, please understand that I am only as good as the pre-activation layer"
So, we talk to \(W_L, b_L\) and \(h_L\) and ask them "What is wrong with you?"
\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}}}_{\text{}}\)
\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)
\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)
\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)
\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to
the weights
Talk to the previous hidden layer
\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{111}}}_{\text{}}\)
\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)
\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)
\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)
\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.
Module 4.5: Backpropagation: Computing Gradients w.r.t. the Output Units
\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{L11}}}_{\text{}}\)
\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)
\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)
\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)
\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.
\(=\)
Let us first consider the partial derivative
w.r.t. \(i\)-th output
\(\mathscr {L}(\theta) = - \log \hat y_\ell\)
\(\color {blue} \cfrac {\partial}{\partial \hat y_i}(\mathscr {L}(\theta)) \color {black} =\)
(\(\ell= \) true class label)
More Compactly,
\(\cfrac {\partial}{\partial \hat y_i}(\mathscr {L}(\theta)) = - \cfrac {\mathbb {1}_{i=l}}{\hat y_\ell} \)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\( \cfrac {\partial}{\partial \hat y_i}(- \log \hat y_\ell) \)
\(= - \cfrac{1}{\hat y_\ell}\)
\(\text {if} i = \ell \)
\(= 0\)
\(otherwise\)
\(\cfrac {\partial}{\partial \hat y_i}(\mathscr {L}(\theta)) = - \cfrac {\mathbb {1}_{i=l}}{\hat y_\ell} \)
We can now talk about the gradient
w.r.t. the vector \(\hat y\)
\(= - \cfrac{1}{\hat y_\ell} e_\ell\)
where \(e (\ell)\) is a k-dimensional vector
whose \(l\)-th element is \(1\) and all other
elements are \(0\).
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(\nabla_{\hat y} \mathscr {L}(\theta) = \)
\(= - \cfrac{1}{\hat y_\ell}\)
\(\cfrac {\partial \mathscr {L}(\theta)}{\partial a_{Li}} = \cfrac {\partial(- \log \hat y_\ell)}{\partial a_{Li}}\)
What we are actually interested in is
\( = \cfrac {\partial(- \log \hat y_\ell)}{\partial \hat y_\ell} \cfrac {\partial \hat y_\ell}{\partial a_{Li}}\)
\( \hat y_\ell = \cfrac {\exp (a_{L\ell})}{\sum_{i} \exp (a_{Li})}\)
Having established this, we will now derive the full expression on the next slide
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Does \(\hat y_\ell\) depend on \(a_{Li} ?\)
Indeed, it does.
\(\cfrac {\partial \mathscr {L}(\theta)}{\partial a_{Li}} = - (\mathbb {1}_{(\ell=i)} - \hat y_i)\)
So far we have derived the partial derivative w.r.t. the \(i\)-th element of \(a_L\)
We can now write the gradient w.r.t. the vector \(a_L\)
\(\nabla_{a_L} \mathscr {L}(\theta) = \)
\(= \)
\(= - (e (\ell)-\hat y)\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden Units
\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{L11}}}_{\text{}}\)
\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)
\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)
\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)
\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.
Chain rule along multiple paths: If a function \(p(z)\) can be written as a function of intermediate results \(q_i (z)\) then we have :
In our case:
\(p(z)\) is the loss function \(\mathscr{L} (\theta)\)
\(z=h_{ij}\)
\(q_m(z) = a_{Lm}\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
Intentionally left blank
Now consider these two vectors,
\(\nabla_{a_{i+1}} \mathscr {L}(\theta) = \)
\(; W_{i+1, \cdot ,j}= \)
\(W_{i+1, \cdot ,j}= \) is the \(j\)-th column of \(W_{i+1};\)
\((W_{i+1, \cdot ,j})^T \nabla a_{i+1} \mathscr {L} (\theta) = \)
\(a_{i+1} = W_{i+1}h_{ij} + b_{i+1}\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
see that,
\( \displaystyle \sum_{m=1}^k \cfrac {\partial \mathscr {L} (\theta)}{\partial a_{i+1,m}} W_{i+1,m,j}\)
We have, \(\cfrac {\partial \mathscr {L} (\theta)}{\partial h_{ij}} =(W_{i+1, \cdot ,j})^T \nabla a_{i+1} \mathscr {L} (\theta) \)
We can now write the gradient w.r.t. \(h_i\)
\(\nabla_{\text h_i} \mathscr {L}(\theta) = \)
\(=(W_{i+1})^T (\nabla a_{i+1} \mathscr {L} (\theta)) \)
We are almost done except that we do not know how to calculate \(\nabla a_{i+1} \mathscr {L} (\theta)\) for \(i<L-1\)
We will see how to compute that
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(\nabla_{\text h_i} \mathscr {L}(\theta) = \)
\([\because h_{ij}=g(a_{ij})]\)
\(\nabla_{\text a_i} \mathscr {L}(\theta) = \)
\(= \nabla_{\text h_i} \mathscr {L}(\theta) \odot [...,g'(a_{ik}),...]\)
\(-\log \hat y_\ell\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
Module 4.7: Backpropagation: Computing Gradients w.r.t. Parameters
\(\underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial W_{L11}}}_{\text{}}\)
\(= \underbrace{\frac{\partial \mathscr{L}(\theta)}{\partial \hat y} \frac{\partial \hat y}{\partial a_3}}\)
\(\underbrace {\frac{\partial a_3}{\partial h_2} \frac{\partial h_2}{\partial a_2}}\)
\(\underbrace{\frac{\partial a_2}{\partial h_1} \frac{\partial h_1}{\partial a_1}}\)
\(\underbrace{\frac{\partial a_1}{\partial W_{111}}}\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
Our focus is on \(Cross\) \(entropy\) \(loss\) and \(Softmax\) output.
Recall that,
\( \text{a}_\text{k} = \text{b}_\text{k} + W_k \text{h}_\text{k-1}\)
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
Intentionally left blank
Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
Lets take a simple example of a \(W_k \in \R^{3 \times 3}\) and see what each entry looks like
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(\nabla_{W_k} \mathscr {L}(\theta) = \)
\(\nabla_{a_k} \mathscr {L}(\theta) \cdot \text h _ {\text k-1} ^T \)
Finally, coming to the biases
\( a_{ki} = b_{ki} + W_{kij} h_{k-1,j}\)
\(\nabla_{\text b_ \text k} \mathscr {L}(\theta) = \)
We can now write the gradient w.r.t. the vector \(b_k\)
\(= \nabla_{\text a_ \text k} \mathscr {L}(\theta) \)
\(-\log \hat y_\ell\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
Module 4.8: Backpropagation: Pseudo code
Finally, we have all the pieces of the puzzle
We can now write the full learning algorithm
\(\nabla_{\text a_ \text L} \mathscr {L}(\theta) \)
(gradient w.r.t. output layer)
\(\nabla_{\text h_ \text k} \mathscr {L}(\theta), \nabla_{\text a_ \text k} \mathscr {L}(\theta) \)
(gradient w.r.t. hidden layers, \(1 \leq k < L\))
\(\nabla_{\text W_ \text k} \mathscr {L}(\theta), \nabla_{\text b_ \text k} \mathscr {L}(\theta) \)
(gradient w.r.t. weights and biases, \(1 \leq k < L\))
\(t \gets 0;\)
Algorithm: gradient_descent()
\(max\_iterations \gets 1000; \)
\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,\hat y=forward\) _ \(propagation(\theta_t)\)
\(\nabla \theta_t = backward\)_\(propagation(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,y,\hat y)\)
Algorithm: forward_propagation(\(\theta\))
\(a_k = b_k + W_k h_{k-1} ;\)
\(h_k = g(a_k)\)
\(a_L = b_L + W_L h_{L-1} ;\)
\(\hat y = O(a_L) ;\)
end
for \(k = 1\) to \(L-1\) do
Algorithm: back_propagation(\(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,y, \hat y\))
\(\nabla _ {a_L} \mathscr {L} (\theta) = - (e(y) - \hat y); \)
Just do a forward propagation and compute all \(h_i \text{\textquoteright}\)s, \(a_i\text{\textquoteright}\)s, and \(\hat y\)
// Compute output gradient ;
// Compute gradients w.r.t. parameters ;
\(\nabla _ {W_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) h_{k-1}^T ;\)
\(\nabla _ {b_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) ;\)
// Compute gradients w.r.t. layer below ;
\(\nabla _ {h_{k-1}} \mathscr {L} (\theta) = W_k^T \nabla _ {a_k} \mathscr {L} (\theta) ;\)
// Compute gradients w.r.t. layer below (pre-activation) ;
\(\nabla _ {a_{k-1}} \mathscr {L} (\theta) = \nabla _ {h_{k-1}} \mathscr {L} (\theta) \odot [...,g' (a_{k-1,j}),...];\)
end
for \(k = 1\) to \(L-1\) do
Module 4.9: Derivative of the activation function
Now, the only thing we need to figure out is how to compute \(g'\)
\(g(z) = \sigma (z)\)
Logistic function |
---|
|
---|
\(=\cfrac {1}{1+e^{-z}}\)
\(g'(z) = (-1) \cfrac {1}{(1+e^{-z})^2} \cfrac {d}{dz} (1+e^{-z})\)
\(= (-1) \cfrac {1}{(1+e^{-z})^2} (-e^{-z})\)
\(= \cfrac {1}{(1+e^{-z})} \cfrac {1+e^{-z}-1}{1+e^{-z}}\)
\(=g(z) (1-g(z)\))
\(g(z) = tanh (z)\)
\(=\cfrac {e^z-e^{-z}}{e^z+e^{-z}}\)
\(g'(z) = \cfrac {\Bigg ((e^z+e^{-z}) \frac {d}{dz}(e^z-e^{-z}) \allowbreak - (e^z-e^{-z}) \frac {d}{dz} (e_z+e^{-z})\Bigg )}{(e^z+e^{-z})^2} \)
\(=\cfrac {(e^z+e^{-z})^2-(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)
\(=1- \cfrac {(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)
\(=1-(g(z))^2\)
tanh function |
---|
|
---|
CS6910: Lecture 4
By Mitesh Khapra
CS6910: Lecture 4
Sigmoid Neuron to Feedforward Neural Networks
- 671