Machine Learning in Intelligent Transportation

Session 4: Deep FeedForward Model & Backpropagation

Ahmad Haj Mosa

PwC Austria & Alpen Adria Universität Klagenfurt

Klagenfurt 2020

Deep Feedforward Networks

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning (Adaptive Computation and Machine Learning series) (Page 163). The MIT Press..

Text

Also called feedfarward neural networks, or multilayer perceptrons (MLP)

The goal of feedfarward networks is to approximate some function \( y = f^*(x)\)

information flows through the function being evaluated from \(x\) , through the intermediate computations used to define \(f\) , and finally to the output \(y\) .

There are no feedback connections in which outputs of the model are fed back into itself.

Input layer

Hidden layers

Output layer

Deep Feedforward Model

Layer 1

x_1

x_2

x_3

Layer 2

Layer 3

Text

a_1^{(2)}

a_2^{(2)}

a_3^{(2)}

a_1^{(3)}

a_1^{(2)}=f(W_{11}^{(1)}x_1+W_{12}^{(1)}x_2+W_{13}^{(1)}x_3+b_1^{(1)})

a_2^{(2)}=f(W_{21}^{(1)}x_1+W_{22}^{(1)}x_2+W_{23}^{(1)}x_3+b_2^{(1)})

a_3^{(2)}=f(W_{31}^{(1)}x_1+W_{32}^{(1)}x_2+W_{33}^{(1)}x_3+b_3^{(1)})

a_1^{(3)}=f(W_{11}^{(2)}a_1^{(2)}+W_{12}^{(2)}a_2^{(2)}+W_{13}^{(2)}a_3^{(2)}+b_1^{(2)})

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

Deep Feedforward Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

Let \(z_i^{(L)}=\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)}) \) be the activation model

\( f(z_i^{(L)})= \frac {1}{ 1+ e^{-z_i^{(L)}}} \) is the sigmoid activation function

\( a_i^{(L)}=f(z_i^{(L)})=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \) is the output model

\( y\) is the ground truth output (ie \( a_{1}^{(2)}\))

\(h_{W}(x)\) is the final output of the network

\(J(W)=\parallel h_{W}(x) - y \parallel^2 \) is the cost function

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(2)}a_j^{(L-1)})) \)

The Gradient Descent is given by

W_{ij}^{(L)}=W_{ij}^{(L)}-\rho \frac{\partial}{\partial W_{ij}^{(L)}}J(W_{ij}^{(L)})

a_i^{(L)}

z_i^{(L)}

W_{i,*}^{(L-1)}

function of

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}

z_i^{(L)}

W_{i,*}^{(L-1)}

function of

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}=\frac{\partial J(W_{ij}^{(L-1)})}{\partial a_{i}^{(L)}}\frac{\partial a_{i}^{(L)}}{\partial z_{i}^{(L)}}\frac{\partial z_{i}^{(L)}}{\partial W_{ij}^{(L-1)}}

function of

Text

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}

z_i^{(L)}

W_{i,*}^{(L-1)}

function of

\frac{\partial z_{i}^{(L)}}{\partial W_{ij}^{(L-1)}}= \frac{\partial }{\partial W_{ij}^{(L-1)}} (\sum_{k=1}^{r^{(L-1)}}(W_{ik}^{(L-1)}a_k^{(L-1)}))=\frac{\partial }{\partial W_{ij}^{(L-1)}} (W_{ij}^{(L-1)}a_j^{(L-1)})=a_j^{(L-1)}

function of

Text

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}

z_i^{(L)}

W_{i,*}^{(L-1)}

function of

\frac{\partial a_{i}^{(L)}}{\partial z_{i}^{(L)}}= \frac{\partial }{\partial z_{i}^{(L)}} f(z_{i}^{(L)})=f(z_{i}^{(L)})(1-f(z_{i}^{(L)})

function of

Text

f(z_{i}^{(L)})= \frac {1}{ 1+ e^{-z_i^{(L)}}}

where

Chain Model

The general neuron model is given by:

\(a_i^{(L)}=f(\sum_{j=1}^{r^{(L-1)}}(W_{ij}^{(L-1)}a_j^{(L-1)})) \)

a_i^{(L)}

z_i^{(L)}

W_{i,*}^{(L-1)}

function of

\frac{\partial J}{\partial a_{i}^{(L)}}= \frac{\partial }{\partial a_{i}^{(L)}} \frac{1}{2}\parallel h_{W}(x) - y \parallel^2=(y-a_{i}^{(L)})

function of

If \(L\) is the output layer

Chain Model

a_i^{(L)}

\frac{\partial J}{\partial a_{i}^{(L)}}= \sum_{k=1}^{r^{(L+1)}}\frac{\partial J}{\partial a_{k}^{(L+1)}}\frac{\partial a_{k}^{(L+1)}}{\partial a_{i}^{(L)}}

If \(L\) is a hidden layer

a_1^{(L+1)}

a_2^{(L+1)}

a_r^{(L+1)}

\frac{\partial a_{k}^{(L+1)}}{\partial a_{i}^{(L)}}= \frac{\partial }{\partial a_{i}^{(L)}} \sum_{t=1}^{r^{(L)}}(W_{kt}^{(L)}a_t^{(L)})=W_{ki}^{(L)}

\frac{\partial J}{\partial a_{i}^{(L)}}= \sum_{k=1}^{r^{(L+1)}}\frac{\partial J}{\partial a_{k}^{(L+1)}}W_{ki}^{(L)}

Chain Models

a_i^{(L)}

If we denote \( \frac{\partial J}{\partial a_{i}^{(L)}}\) as \( \delta_i^{(L)}\)

a_1^{(L+1)}

a_2^{(L+1)}

a_r^{(L+1)}

\frac{\partial J}{\partial a_{i}^{(L)}}= \sum_{k=1}^{r^{(L+1)}}\frac{\partial J}{\partial a_{k}^{(L+1)}}W_{ki}^{(L)}

\delta_i^{(L)}= \sum_{k=1}^{r^{(L+1)}}\delta_k^{(L+1)}W_{ki}^{(L)}

Backpropagation formulas

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}(y-a_{i}^{(L)})

If \(L\) is an output layer

If \(L\) is a hidden layer

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}\sum_{k=1}^{r^{(L+1)}}\delta_k^{(L+1)}W_{ki}^{(L)}

The Backpropagation Algorithm

Repeat
- Perform a feedforward pass, computing the activations for all layers
- For the output layer , set:
- For the hidden layer
- Update the weights
- If the target is achieved (minimum cost or maximum iterations) stop training

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}(y-a_{i}^{(L)})

\frac{\partial J(W_{ij}^{(L-1)})}{\partial W_{ij}^{(L-1)}}= a_j^{(L-1)} f(z_{i}^{(L)})^{\prime}\sum_{k=1}^{r^{(L+1)}}\delta_k^{(L+1)}W_{ki}^{(L)}

W_{ij}^{(L-1)}=W_{ij}^{(L-1)}-\rho \frac{\partial}{\partial W_{ij}^{(L-1)}}J(W_{ij}^{(L-1)})

The Backpropagation Code

	
import numpy as np

# define the sigmoid function
def sigmoid(x, derivative=False):

    if (derivative == True):
        return x * (1 - x)
    else:
        return 1 / (1 + np.exp(-x))

# choose a random seed for reproducible results
np.random.seed(1)

# learning rate
alpha = .1

# number of nodes in the hidden layer
num_hidden = 3

# inputs
X = np.array([  
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 0],
    [1, 1, 0],
    [1, 0, 1],
    [1, 1, 1],
])

# outputs
# x.T is the transpose of x, making this a column vector
y = np.array([[0, 1, 0, 1, 1, 0]]).T

# initialize weights randomly with mean 0 and range [-1, 1]
# the +1 in the 1st dimension of the weight matrices is for the bias weight
hidden_weights = 2*np.random.random((X.shape[1] + 1, num_hidden)) - 1
output_weights = 2*np.random.random((num_hidden + 1, y.shape[1])) - 1

# number of iterations of gradient descent
num_iterations = 10000

# for each iteration of gradient descent
for i in range(num_iterations):

    # forward phase
    # np.hstack((np.ones(...), X) adds a fixed input of 1 for the bias weight
    input_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), X))
    hidden_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), sigmoid(np.dot(input_layer_outputs, hidden_weights))))
    output_layer_outputs = np.dot(hidden_layer_outputs, output_weights)

    # backward phase
    # output layer error term
    output_error = output_layer_outputs - y
    # hidden layer error term
    # [:, 1:] removes the bias term from the backpropagation
    hidden_error = hidden_layer_outputs[:, 1:] * (1 - hidden_layer_outputs[:, 1:]) * np.dot(output_error, output_weights.T[:, 1:])

    # partial derivatives
    hidden_pd = input_layer_outputs[:, :, np.newaxis] * hidden_error[: , np.newaxis, :]
    output_pd = hidden_layer_outputs[:, :, np.newaxis] * output_error[:, np.newaxis, :]

    # average for total gradients
    total_hidden_gradient = np.average(hidden_pd, axis=0)
    total_output_gradient = np.average(output_pd, axis=0)

    # update weights
    hidden_weights += - alpha * total_hidden_gradient
    output_weights += - alpha * total_output_gradient

# print the final outputs of the neural network on the inputs X
print("Output After Training: \n{}".format(output_layer_outputs))