Department of Computer Science and Engineering, IIT Madras
|
---|
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
The input layer can be called the \(0\)-th layer and the output layer can be called the (\(L\))-th layer
\(W_i \in \R^{n \times n}\) and \(b_i \in \R^n\) are the weight and bias between layers \(i-1\) and \(i (0 < i < L\))
\(W_L \in \R^{n \times k}\) and \(b_L \in \R^k\) are the weight and bias between the last hidden layer and the output layer (\(L = 3\)) in this case)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(a_i(x) = b_i +W_ih_{i-1}(x)\)
\(h_i(x) = g(a_i(x))\)
\(f(x) = h_L(x)=O(a_L(x))\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(a_i = b_i +W_ih_{i-1}\)
\(h_i = g(a_i)\)
\(f(x) = h_L=O(a_L)\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(\hat y_i = f(x_i) = O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)\)
\(\theta = W_1, ..., W_L, b_1, b_2, ..., b_L (L = 3)\)
\(min \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^k (\hat y_{ij} - y_{ij})^2\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
|
---|
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(w_{t+1} \gets w_t - \eta \nabla w_t\)
\(b_{t+1} \gets b_t - \eta \nabla b_t\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
end
while \(t\)++ \(< max\_iterations\) do
\(Initialize w_0,b_0;\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize \theta_0 = [w_0,b_0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
|
---|
|
---|
\(\mathscr {L}(\theta) = \cfrac {1}{N} \displaystyle \sum_{i=1}^N \sum_{j=1}^3 (\hat y_{ij} - y_{ij})^2\)
Neural network with \(L - 1\) hidden layers
isActor Damon
isDirector
Nolan
imdb
Rating
Critics
Rating
RT
Rating
\(y_i =\) {\(7.5 8.2 7.7\)}
\(x_i\)
\(. .\)
. . . . . .
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Intentionally left blank
Neural network with \(L - 1\) hidden layers
Apple
Mango
\(y =\) [\(1 0 0 0\)]
Orange
Banana
Neural network with \(L - 1\) hidden layers
Apple
Mango
\(y =\) [\(1 0 0 0\)]
Orange
Banana
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Neural network with \(L - 1\) hidden layers
Apple
Mango
\(y =\) [\(1 0 0 0\)]
Orange
Banana
\(\mathscr {L}(\theta) = - \displaystyle \sum_{c=1}^k y_c \log \hat y_c \)
\(\hat y_\ell = [O(W_3 g(W_2 g(W_1 x + b_1) + b_2) + b_3)]_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_L=\hat {y} = f(x)\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Output Activation | ||
Loss Function |
Outputs |
---|
Real Values | Probabilities |
Linear
Softmax
Squared Error
Cross Entropy
Output Activation | ||
Loss Function |
Outputs |
---|
Real Values | Probabilities |
Linear
Softmax
Squared Error
Cross Entropy
|
---|
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_1\)
\(h_2\)
\(a_1\)
\(a_2\)
\(a_3\)
\(\hat {y} = f(x)\)
\(b_1\)
\(b_2\)
\(b_3\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize \theta_0 = [w_0,b_0];\)
end
while
\(t\)++ \(< max\_iterations\)
do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(W_{111}\)
\(x_1\)
\(a_{11}\)
\(W_{211}\)
\(a_{21}\)
\(W_{L11}\)
\(a_{L1}\)
\(\hat y = f(x)\)
\(\mathscr {L} (\theta)\)
\(h_{11}\)
\(h_{21}\)
\(x_1\)
\(a_{11}\)
\(a_{21}\)
\(a_{L1}\)
\(\hat y = f(x)\)
\(\mathscr {L} (\theta)\)
\(h_{11}\)
\(h_{21}\)
\(W_{111}\)
\(W_{211}\)
\(W_{L11}\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(W_1\)
\(b_1\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to
the weights
Talk to the previous hidden layer
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
\(=\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\( \cfrac {\partial}{\partial \hat y_i}(- \log \hat y_\ell) \)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(W_3\)
\(b_3\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
Intentionally left blank
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
\(-\log \hat y_\ell\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(W_1\)
\(W_1\)
\(b_1\)
\(W_2\)
\(b_2\)
\(b_3\)
\(W_3\)
Talk to the
weight directly
Talk to the output layer
Talk to the previous hidden layer
and now talk to the weights
Talk to the previous hidden layer
\(-\log \hat y_\ell\)
\(a_2\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(a_1\)
\(h_2\)
\(h_1\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
Intentionally left blank
\(-\log \hat y_\ell\)
\(a_3\)
\(x_1\)
\(x_2\)
\(x_n\)
\(W_1\)
\(W_1\)
\(b_1\)
\(b_3\)
\(W_3\)
\(b_2\)
\(W_2\)
\(t \gets 0;\)
\(max\_iterations \gets 1000; \)
\(Initialize\) \(\theta_0 = [W_1^0,...,W_L^0,b_1^0,...,b_L^0];\)
end
while \(t\)++ \(< max\_iterations\) do
\(\theta_{t+1} \gets \theta_t - \eta \nabla \theta_t\)
\(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,\hat y=forward\) _ \(propagation(\theta_t)\)
\(\nabla \theta_t = backward\)_\(propagation(h_1,h_2,...,h_{L-1},a_1,a_2,...,a_L,y,\hat y)\)
\(a_k = b_k + W_k h_{k-1} ;\)
\(h_k = g(a_k)\)
\(a_L = b_L + W_L h_{L-1} ;\)
\(\hat y = O(a_L) ;\)
end
for \(k = 1\) to \(L-1\) do
\(\nabla _ {a_L} \mathscr {L} (\theta) = - (e(y) - \hat y); \)
\(\nabla _ {W_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) h_{k-1}^T ;\)
\(\nabla _ {b_k} \mathscr {L} (\theta) = \nabla _ {a_k} \mathscr {L} (\theta) ;\)
\(\nabla _ {h_{k-1}} \mathscr {L} (\theta) = W_k^T \nabla _ {a_k} \mathscr {L} (\theta) ;\)
\(\nabla _ {a_{k-1}} \mathscr {L} (\theta) = \nabla _ {h_{k-1}} \mathscr {L} (\theta) \odot [...,g' (a_{k-1,j}),...];\)
end
for \(k = 1\) to \(L-1\) do
\(g(z) = \sigma (z)\)
Logistic function |
---|
|
---|
\(=\cfrac {1}{1+e^{-z}}\)
\(g'(z) = (-1) \cfrac {1}{(1+e^{-z})^2} \cfrac {d}{dz} (1+e^{-z})\)
\(= (-1) \cfrac {1}{(1+e^{-z})^2} (-e^{-z})\)
\(= \cfrac {1}{(1+e^{-z})} \cfrac {1+e^{-z}-1}{1+e^{-z}}\)
\(=g(z) (1-g(z)\))
\(g(z) = tanh (z)\)
\(=\cfrac {e^z-e^{-z}}{e^z+e^{-z}}\)
\(g'(z) = \cfrac {\Bigg ((e^z+e^{-z}) \frac {d}{dz}(e^z-e^{-z}) \allowbreak - (e^z-e^{-z}) \frac {d}{dz} (e_z+e^{-z})\Bigg )}{(e^z+e^{-z})^2} \)
\(=\cfrac {(e^z+e^{-z})^2-(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)
\(=1- \cfrac {(e^z-e^{-z})^2}{(e^z+e^{-z})^2}\)
\(=1-(g(z))^2\)
tanh function |
---|
|
---|