Machine Learning in Intelligent Transportation

Session 4: Deep FeedForward Model & Backpropagation

Ahmad Haj Mosa

PwC Austria & Alpen Adria Universität Klagenfurt

Klagenfurt 2020

Deep Feedforward Networks

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning (Adaptive Computation and Machine Learning series) (Page 163). The MIT Press..

Text

  • Also called feedfarward neural networks, or multilayer perceptrons (MLP)

  • The goal of feedfarward networks is to approximate some function   \( y = f^*(x)\)

  • information flows through the function being evaluated from \(x\) , through the intermediate computations used to define \(f\) , and finally to the output \(y\) .

 

  •   There are no feedback connections in which outputs of the model are fed back into itself.

 

Input layer

Hidden layers

Output layer

Deep Feedforward Model

Layer 1

x_1
x_2
x_3
+1

Layer 2

Layer 3

Text

a_1^{(2)}
a_2^{(2)}
a_3^{(2)}
+1
a_1^{(3)}
a_1^{(2)}=f(W_{11}^{(1)}x_1+W_{12}^{(1)}x_2+W_{13}^{(1)}x_3+b_1^{(1)})
a_2^{(2)}=f(W_{21}^{(1)}x_1+W_{22}^{(1)}x_2+W_{23}^{(1)}x_3+b_2^{(1)})
a_3^{(2)}=f(W_{31}^{(1)}x_1+W_{32}^{(1)}x_2+W_{33}^{(1)}x_3+b_3^{(1)})
a_1^{(3)}=f(W_{11}^{(2)}a_1^{(2)}+W_{12}^{(2)}a_2^{(2)}+W_{13}^{(2)}a_3^{(2)}+b_1^{(2)})

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

Deep Feedforward Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)


  • Let \(z_i^{(L)}=\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)}) \) be the activation model


  • \( f(z_i^{(L)})= \frac {1}{ 1+ e^{-z_i^{(L)}}} \) is the sigmoid activation function


  • \( a_i^{(L)}=f(z_i^{(L)})=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \) is the output model


  • \( y\) is the ground truth output (ie \( a_{1}^{(2)}\))

  • \(h_{W}(x)\) is the final output of the network

  • \(J(W)=\parallel h_{W}(x) - y \parallel^2 \) is the cost function


Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(2)}a_j^{(L-1)})) \)

  • The Gradient Descent is given by
W_{ij}^{(L)}=W_{ij}^{(L)}-\rho \frac{\partial}{\partial W_{ij}^{(L)}}J(W_{ij}^{(L)})
a_i^{(L)}
z_i^{(L)}
W_{i,*}^{(L-1)}

function of

function of

function of

J

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}
z_i^{(L)}
W_{i,*}^{(L-1)}

function of

function of

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}=\frac{\partial J(W_{ij}^{(L-1)})}{\partial a_{i}^{(L)}}\frac{\partial a_{i}^{(L)}}{\partial z_{i}^{(L)}}\frac{\partial z_{i}^{(L)}}{\partial W_{ij}^{(L-1)}}

function of

Text

J

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}
z_i^{(L)}
W_{i,*}^{(L-1)}

function of

function of

\frac{\partial z_{i}^{(L)}}{\partial W_{ij}^{(L-1)}}= \frac{\partial }{\partial W_{ij}^{(L-1)}} (\sum_{k=1}^{r^{(L-1)}}(W_{ik}^{(L-1)}a_k^{(L-1)}))=\frac{\partial }{\partial W_{ij}^{(L-1)}} (W_{ij}^{(L-1)}a_j^{(L-1)})=a_j^{(L-1)}

function of

J

Text

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}
z_i^{(L)}
W_{i,*}^{(L-1)}

function of

function of

\frac{\partial a_{i}^{(L)}}{\partial z_{i}^{(L)}}= \frac{\partial }{\partial z_{i}^{(L)}} f(z_{i}^{(L)})=f(z_{i}^{(L)})(1-f(z_{i}^{(L)})

function of

J

Text

f(z_{i}^{(L)})= \frac {1}{ 1+ e^{-z_i^{(L)}}}

where

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}
z_i^{(L)}
W_{i,*}^{(L-1)}

function of

function of

\frac{\partial J}{\partial a_{i}^{(L)}}= \frac{\partial }{\partial a_{i}^{(L)}} \frac{1}{2}\parallel h_{W}(x) - y \parallel^2=(y-a_{i}^{(L)})

function of

J

If \(L\) is the output layer

Chain Model

a_i^{(L)}
\frac{\partial J}{\partial a_{i}^{(L)}}= \sum_{k=1}^{r^{(L+1)}}\frac{\partial J}{\partial a_{k}^{(L+1)}}\frac{\partial a_{k}^{(L+1)}}{\partial a_{i}^{(L)}}
J

If \(L\) is a hidden layer

a_1^{(L+1)}
a_2^{(L+1)}
a_r^{(L+1)}
\frac{\partial a_{k}^{(L+1)}}{\partial a_{i}^{(L)}}= \frac{\partial }{\partial a_{i}^{(L)}} \sum_{t=1}^{r^{(L)}}(W_{kt}^{(L)}a_t^{(L)})=W_{ki}^{(L)}
\frac{\partial J}{\partial a_{i}^{(L)}}= \sum_{k=1}^{r^{(L+1)}}\frac{\partial J}{\partial a_{k}^{(L+1)}}W_{ki}^{(L)}

Chain Models

a_i^{(L)}
J

If we denote \( \frac{\partial J}{\partial a_{i}^{(L)}}\) as \( \delta_i^{(L)}\)

a_1^{(L+1)}
a_2^{(L+1)}
a_r^{(L+1)}
\frac{\partial J}{\partial a_{i}^{(L)}}= \sum_{k=1}^{r^{(L+1)}}\frac{\partial J}{\partial a_{k}^{(L+1)}}W_{ki}^{(L)}
\delta_i^{(L)}= \sum_{k=1}^{r^{(L+1)}}\delta_k^{(L+1)}W_{ki}^{(L)}

Backpropagation formulas

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}(y-a_{i}^{(L)})

If \(L\) is an output layer

If \(L\) is a hidden layer

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}\sum_{k=1}^{r^{(L+1)}}\delta_k^{(L+1)}W_{ki}^{(L)}

The Backpropagation Algorithm

  • Repeat
    • Perform a feedforward pass, computing the activations for all layers
    • For the output layer , set:  


    • For the hidden layer


    • Update the weights


    • If the target is achieved (minimum cost or maximum iterations) stop training
\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}(y-a_{i}^{(L)})
\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}\sum_{k=1}^{r^{(L+1)}}\delta_k^{(L+1)}W_{ki}^{(L)}
W_{ij}^{(L-1)}=W_{ij}^{(L-1)}-\rho \frac{\partial}{\partial W_{ij}^{(L-1)}}J(W_{ij}^{(L-1)})

The Backpropagation Code

	
import numpy as np

# define the sigmoid function
def sigmoid(x, derivative=False):

    if (derivative == True):
        return x * (1 - x)
    else:
        return 1 / (1 + np.exp(-x))

# choose a random seed for reproducible results
np.random.seed(1)

# learning rate
alpha = .1

# number of nodes in the hidden layer
num_hidden = 3

# inputs
X = np.array([  
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 0],
    [1, 1, 0],
    [1, 0, 1],
    [1, 1, 1],
])

# outputs
# x.T is the transpose of x, making this a column vector
y = np.array([[0, 1, 0, 1, 1, 0]]).T

# initialize weights randomly with mean 0 and range [-1, 1]
# the +1 in the 1st dimension of the weight matrices is for the bias weight
hidden_weights = 2*np.random.random((X.shape[1] + 1, num_hidden)) - 1
output_weights = 2*np.random.random((num_hidden + 1, y.shape[1])) - 1

# number of iterations of gradient descent
num_iterations = 10000

# for each iteration of gradient descent
for i in range(num_iterations):

    # forward phase
    # np.hstack((np.ones(...), X) adds a fixed input of 1 for the bias weight
    input_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), X))
    hidden_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), sigmoid(np.dot(input_layer_outputs, hidden_weights))))
    output_layer_outputs = np.dot(hidden_layer_outputs, output_weights)

    # backward phase
    # output layer error term
    output_error = output_layer_outputs - y
    # hidden layer error term
    # [:, 1:] removes the bias term from the backpropagation
    hidden_error = hidden_layer_outputs[:, 1:] * (1 - hidden_layer_outputs[:, 1:]) * np.dot(output_error, output_weights.T[:, 1:])

    # partial derivatives
    hidden_pd = input_layer_outputs[:, :, np.newaxis] * hidden_error[: , np.newaxis, :]
    output_pd = hidden_layer_outputs[:, :, np.newaxis] * output_error[:, np.newaxis, :]

    # average for total gradients
    total_hidden_gradient = np.average(hidden_pd, axis=0)
    total_output_gradient = np.average(output_pd, axis=0)

    # update weights
    hidden_weights += - alpha * total_hidden_gradient
    output_weights += - alpha * total_output_gradient

# print the final outputs of the neural network on the inputs X
print("Output After Training: \n{}".format(output_layer_outputs))
Made with Slides.com