NEURAL NETWORKS

and beyond....

Aadharsh Aadhithya A

Amrita Vishwa VIdhyapeetham

Center for Computational Engineering and Networking

NEURAL NETWORKS

and beyond....

Introduction
Perceptron
Single-layered Neural Network
Multi layered Neural Network

Implicit Layers
Implicit Function Theorem
Opt Net
Future Directions

Why NN???????

What AI?

Why NN???????

What AI?

Goals Of AI??

Why NN???????

What AI?

Turings Test on Intellegence

What exactly is Intellegenceee??

Why NN???????

What AI?

Turings Test on Intellegence

Why NN are able to "Mimic Intelligence"

Why NN???????

What AI?

Turings Test on Intellegence

Are NN's Really Intellgent?

Why NN???????

What AI?

Are NN's Really Intellgent? No. Why Then?

Introduction

Why Deep Learning?? (Applications)

What Are Neural Networks??

Models, That mimic Human Intelligence?

How to mimic?

Associations
Connections

But, Where Are Associations Stored??

By Mid 1800's, It was discovered that brain was made up of connected cells called "Neurons"
Neurons Excite and Simulate each other.
Neurons connect to other neurons. The processing/capacity of the brain is a function of these connections

MP Neuron

End to END Learning

Learn mapping directly from input to output without the need for intermediate representations

Overcome cumbersome methods like feature extraction, multiple processing
steps.

End to END Learning

Perceptrons

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

Perceptrons

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

\left ( \sum_{i = 1}^{m}x_i \cdot w_i \right )

Perceptrons

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

g \left ( \sum_{i = 1}^{m}x_i \cdot w_i \right )

Perceptrons

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

Perceptrons

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

Perceptrons

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

\hat{y} = g \left ( \sum_{i = 1}^{m}x_i \cdot w_i \right )

\text{Output}

\text{Aggregation}

\text{Non linear} \\ \text{Activation Function}

Single layer neural network

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

Single layer neural network

\Sigma

\int

\hat{y}

x_1

w_1

x_3

x_2

w_2

w_3

Single layer neural network

x_1

w_1

x_3

x_2

w_2

w_3

\hat{y}

g(z)

Single layer neural network

x_1

w_{11}

x_3

x_2

w_{12}

w_{13}

\hat{y}_1

g(z_{1})

\hat{y}_2

g(z_{2})

w_{21}

w_{22}

w_{23}

Single layer neural network

x_1

w_1

x_3

x_2

w_2

w_3

\hat{y}

g(z)

Single layer neural network

1

0.5

3

2

0.2

0.4

\hat{y}

g(z)

x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\ y = 1 \\ \text{random weights , } \\ w = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.4 \end{bmatrix} \\

Single layer neural network

1

0.5

3

2

0.2

0.4

\hat{y}

g(z)

x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\ y = 1 \\ \text{random weights , } \\ w = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.4 \end{bmatrix} \\

z = w^T \cdot x

[1 \cdot 0.5 + 2 \cdot 0.2 + 3 \cdot 0.4]

\text{z = 2.1}

Single layer neural network

\text{Activation Function}

Sigmoid Function

g(z) = \frac{1}{1 + e^{-z} }

g'(z) = g(z) \cdot (1 - g(z) )

Single layer neural network

g(z) = \frac{1}{1 + e^{-z} }

1

0.5

3

2

0.2

0.4

\hat{y}

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

Single layer neural network

g(z) = \frac{1}{1 + e^{-z} }

1

0.5

3

2

0.2

0.4

\hat{y}

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

OOPs! 🤥

Wasn't the true output 1?

Single layer neural network

g(z) = \frac{1}{1 + e^{-z} }

1

0.5

3

2

0.2

0.4

\hat{y}

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

OOPs! 🤥

Wasn't the true output 1?

We Should Punish the Network , so

It Behaves Properly

Single layer neural network

g(z) = \frac{1}{1 + e^{-z} }

1

0.5

3

2

0.2

0.4

\hat{y}

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

OOPs! 🤥

Wasn't the true output 1?

We Should Punish the Network , so

It Behaves Properly

LOSS FUNCTIONS

Answer:

Single layer neural network

LOSS FUNCTIONS

Binary Cross Entropy Loss

L(y , \hat{y}) = - y \cdot log(\hat{y} ) - (1-y) \cdot log(1 - \hat{y})

L(y , \hat{y} ) = \begin{cases} - log(\hat{y}) && y == 1 \\ - log(1 - \hat{y}) && y == 0 \end{cases}

Judge?? Nah!, Classify 😎

y = 1 \\ L(y , \hat{y} ) = - log(\hat{y})

y = 0 \\ L(y, \hat{y}) = -log(1-\hat{y})

The Gradient of the Loss Function dictates whether to increase or decrease the wights and bias of a neural network

"Gradient" points up the curve in the increasing direction , so we need to move int the opposite direction

w = w + (-1) \cdot \alpha (\frac{\partial L(y,\hat{y} )}{\partial w})

\alpha \rightarrow scalar \rightarrow Learning \, Rate

Single layer neural network

1

0.5

3

2

0.2

0.4

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

L(y , \hat{y})

L(y , \hat{y}) = - y \cdot log(\hat{y} ) - (1-y) \cdot log(1 - \hat{y})

Single layer neural network

1

0.5

3

2

0.2

0.4

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

L(y , \hat{y}) = - y \cdot log(\hat{y} ) - (1-y) \cdot log(1 - \hat{y})

L(1 , 0.8009) = - 1 \cdot log(0.8909) - (1-1) \cdot log(1 - 0.8909)

Single layer neural network

1

0.5

3

2

0.2

0.4

g(z)

\text{z = 2.1}

\hat{y} = 0.8909

L(1 , 0.8009) = 0.1155

Single layer neural network

1

0.5

3

2

0.2

0.4

\text{z = 2.1}

\hat{y} = 0.8909

L(1 , 0.8009) = 0.1155

\frac{\partial L}{\partial \hat{y} }

\frac{\partial L}{\partial \hat{y}} = \frac{-y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}

\frac{ \partial L(1 , 0.8909) }{ \partial \hat{y}}= - \frac{1}{0.89} = -1.123

z

g(z)

Single layer neural network

1

0.5

3

2

0.2

0.4

\text{z = 2.1}

\hat{y} = 0.8909

L(1 , 0.8009) = 0.1155

\frac{\partial L}{\partial \hat{y} }=-1.123

z

g(z)

\frac{\partial L}{\partial z }

\frac{ \partial L} {\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}

\left( -\frac{1} {\hat{y}} \right) \cdot \left( \hat{y} (1 - \hat{y}) \right)

-1.123 \cdot (0.8909(1-0.8909)) \\ = -0.1092

Single layer neural network

1

3

2

\text{z = 2.1}

\hat{y} = 0.8909

L(1 , 0.8009) = 0.1155

\frac{\partial L}{\partial \hat{y} }=-1.123

z

g(z)

\frac{\partial L}{\partial z } = -0.1092

\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w_i}

\frac{\partial z}{\partial w_i} =\frac{\partial \sum w_i x_i}{\partial w_i} = x_i

\frac{\partial L}{\partial w_1} = -0.1092*1 \\

\frac{\partial L}{\partial w_2} = -0.1092*2 \\

\frac{\partial L}{\partial w_3} = -0.1092*3 \\

\frac{\partial L}{\partial w_1}

\frac{\partial L}{\partial w_3}

\frac{\partial L}{\partial w_2}

Single layer neural network

\text{z = 2.1}

\hat{y} = 0.8909

L(1 , 0.8009) = 0.1155

\frac{\partial L}{\partial \hat{y} }=-1.123

\frac{\partial L}{\partial z } = -0.1092

\frac{\partial L}{\partial w_1} = -0.1092 \\

\frac{\partial L}{\partial w_2} = -0.2184 \\

\frac{\partial L}{\partial w_3} = -0.3276 \\

w_1 = w_1 - \alpha \cdot \frac{\partial L}{\partial w_1}

w_2 = w_2 - \alpha \cdot \frac{\partial L}{\partial w_2}

w_3 = w_3 - \alpha \cdot \frac{\partial L}{\partial w_3}

\alpha \rightarrow Learning \ rate

Single layer neural network

\text{z = 2.1}

\hat{y} = 0.8909

L(1 , 0.8009) = 0.1155

\frac{\partial L}{\partial \hat{y} }=-1.123

\frac{\partial L}{\partial z } = -0.1092

\frac{\partial L}{\partial w_1} = -0.1092 \\

\frac{\partial L}{\partial w_2} = -0.2184 \\

\frac{\partial L}{\partial w_3} = -0.3276 \\

w_1 = 0.5 - 1 \cdot (-0.1092)

w_2 = 0.2- 1 \cdot (-0.2184)

w_3 = 0.4- 1 \cdot (-0.3276)

\alpha \rightarrow Learning \ rate = 1

w_1 = 0.6092 \\ w_2 = 0.4184 \\ w_3 = 0.7276

Single layer neural network

x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\ y = 1 \\ \text{updated weights , } \\ w = \begin{bmatrix} 0.6092 \\ 0.4184 \\ 0.7276 \end{bmatrix} \\

1

0.6092

3

2

0.4184

0.7276

\hat{y}

g(z)

z = w^T \cdot x

[1 \cdot 0.6092 + 2 \cdot 0.4184 + 3 \cdot 0.7276] \\ = 3.6288

g(z) = g(3.6288) = 0.9739

L(1, 0.9739) = 0.02

Old \ loss = 0.1155

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

g(z_{1})^1

g(z_{2})^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

g(z_{1})^2

\hat{y}

W^i = \begin{bmatrix} w^i_{11} & w^i_{12} & w^i_{13} & \cdots&w^i_{1n} \\ w^i_{21} & w^i_{22} & w^i_{23} & \cdots&w^i_{2n} \\ \vdots & \ddots& \ddots & \vdots \\ w^i_{m1} & w^i_{m2} & w^i_{m3} & \cdots&w^i_{mn} \\ \end{bmatrix}

m \rightarrow \text{no .of neurons in } i^{th} \text{ layer} \\ n \rightarrow \text{no .of neurons in } (i-1)^{th} \text{ layer}

X^i = \begin{bmatrix} x_1 \\ x_2\\ \vdots \\ x_n \end{bmatrix}

X^i \rightarrow \text{Output of } (i-1)^{th} \text{ layer} \\ n \rightarrow \text{no .of neurons in } (i-1)^{th} \text{ layer}

b^i = \begin{bmatrix} b_1^i \\ b_2^i \\ \vdots \\ b_m^i \end{bmatrix}

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

g(z_{1})^2

\hat{y}

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

L(y,\hat{y})

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

g(z_{1})^2

\hat{y}

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

L(y,\hat{y})

Forward Pass

z^i = W^i \cdot x^i + b^i \\ y^i = g(z^i)

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

g(z_{1})^2

\hat{y}

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

L(y,\hat{y})

Forward Pass

z^1 = W^1 \cdot x^1 + b^1 \\

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

g(z_{1})^2

\hat{y}

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

L(y,\hat{y})

Forward Pass

y^1 = g(z^1)

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

Forward Pass

\cdots \\ y^i = g(z^i)

g(z_{1})^1

g(z_{2})^1

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

Forward Pass

\cdots \\ y^i = g(z^i)

g(z_{1})^1

g(z_{2})^1

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

Multi Layer Network

x_1

w_{11}^1

x_3

x_2

w_{12}^1

w_{13}^1

w_{21}^1

w_{22}^1

w_{23}^1

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\cdots \\ y^i = g(z^i)

g(z_{1})^1

g(z_{2})^1

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

w_{j k}^i \rightarrow i^{th} \text{layer ,} j^{th} \text{neuron , mapping to } k^{th} \text{ input}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\text{changing } \hat{y} \text{ changes } L(y,\hat{y})

\frac{\partial L(y , \hat{y} )}{\partial \hat{y} } \frac{\partial \hat{y} }{\partial w^i_{jk}}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\text{changing } \hat{y} \text{ changes } L(y,\hat{y})

\frac{\partial L(y , \hat{y} )}{\partial \hat{y} } \frac{\partial \hat{y} }{\partial w^i_{jk}}

We can derive

To Be Computed

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\text{changing } \hat{y} \text{ changes } L(y,\hat{y})

\frac{\partial L(y , \hat{y} )}{\partial \hat{y} } \frac{\partial \hat{y} }{\partial w^i_{jk}}

We can derive

To Be Computed

Here , We Can Resort To Using The Chain Rule .

How G changes with x?

Changing x changes h(x)

\frac{d h(x)}{dx}

Changing h changes g

\frac{d g}{dh}

\frac{dg} {dx} = \frac{d h(x)}{dx} \cdot \frac{d g}{dh}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

w_{11}^2

w_{12}^2

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\text{How do we reach } w^i_{jk} \text{ From} L(y,\hat{y})

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\text{How do we reach } w^i_{jk} \text{ From } L(y,\hat{y}) ??

Follow The RED Path !

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{How } L(y , \hat{y} ) \text{ changes with } \hat{y}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^2

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1

\Sigma \\ z_2

g(z_2)

g(z_`)

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{How } \hat{y} \text{ changes with } (z_1)^{i+1}

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^{i+1}

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1^i

\Sigma \\ z_2^i

g(z_2)^i

g(z_1)^i

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{How } z_1^{i+1} \text{ changes with } g(z_i^i)

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

\frac{\partial z_1^{i+1} }{\partial g(z_i^i)}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^{i+1}

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1^i

\Sigma \\ z_2^i

g(z_2)^i

g(z_1)^i

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{How } g(z_1)^i \text{ changes with } z_1^i

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

\frac{\partial z_1^{i+1} }{\partial g(z_i^i)}

\frac{\partial g(z_1)^i }{\partial z_1^i}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^{i+1}

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1^i

\Sigma \\ z_2^i

g(z_2)^i

g(z_1)^i

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{How } z_1^i \text{ changes with } w^i_{jk}

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

\frac{\partial z_1^{i+1} }{\partial g(z_i^i)}

\frac{\partial g(z_1)^i }{\partial z_1^i}

\frac{\partial z_1^i}{\partial w^i_{jk}}

Multi Layer Network

x_1

x_3

x_2

w_{j k}^i

\hat{y}

L(y,\hat{y})

BackPropagation

\Sigma

\int

\Sigma\\ (z_1)^{i+1}

(g(z_1))^2

\Sigma

\int

\Sigma\\ z_1^i

\Sigma \\ z_2^i

g(z_2)^i

g(z_1)^i

w_{11}^1

w_{12}^1

w_{22}^1

w_{23}^1

\cdots

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}} =

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{How } L(y,\hat{y} ) \text{ changes with } w^i_{jk}

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

\frac{\partial z_1^{i+1} }{\partial g(z_i^i)}

\frac{\partial g(z_1)^i }{\partial z_1^i}

\frac{\partial z_1^i}{\partial w^i_{jk}}

\frac{\partial g(z_1)^i }{\partial z_1^i}

\frac{\partial z_1^{i+1} }{\partial g(z_i^i)}

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

Multi Layer Network

BackPropagation

\frac{\partial L(y , \hat{y} )}{\partial w^i_{jk}} =

\frac{\partial z_1^i}{\partial w^i_{jk}}

\frac{\partial g(z_1)^i }{\partial z_1^i}

\frac{\partial z_1^{i+1} }{\partial g(z_i^i)}

\frac{\partial \hat{y}}{\partial (z_1)^{i+1}}

\frac{\partial L(y,\hat{y})}{\partial \hat{y}}

\text{ This is for one weights } w_{jk}^i \\ \text{for the matrix } W^i \text{ We Have a Detailed Derivation in the accompanying material}

(Project report)

For Softmax output layer and Sigmoid Activation function

Multi Layer Network

BackPropagation

Source : Project Report , Group01.pdf

Multi Layer Network

BackPropagation

Source : Project Report , Group01.pdf

Multi Layer Network

BackPropagation

Source : Project Report , Group01.pdf

Multi Layer Network

BackPropagation

So , now , we have all the vectorized components to build our chain

Source : Project Report , Group01.pdf

Multi Layer Network

Full Story

Source : Project Report , Group01.pdf

Multi Layer Network

import numpy as np  # import numpy library
from util.paramInitializer import initialize_parameters  # import function to initialize weights and biases


class LinearLayer:
    """
        This Class implements all functions to be executed by a linear layer
        in a computational graph

        Args:
            input_shape: input shape of Data/Activations
            n_out: number of neurons in layer
            ini_type: initialization type for weight parameters, default is "plain"
                      Opitons are: plain, xavier and he

        Methods:
            forward(A_prev)
            backward(upstream_grad)
            update_params(learning_rate)

    """

    def __init__(self, input_shape, n_out, ini_type="plain"):
        """
        The constructor of the LinearLayer takes the following parameters

        Args:
            input_shape: input shape of Data/Activations
            n_out: number of neurons in layer
            ini_type: initialization type for weight parameters, default is "plain"
        """

        self.m = input_shape[1]  # number of examples in training data
        # `params` store weights and bias in a python dictionary
        self.params = initialize_parameters(input_shape[0], n_out, ini_type)  # initialize weights and bias
        self.Z = np.zeros((self.params['W'].shape[0], input_shape[1]))  # create space for resultant Z output

    def forward(self, A_prev):
        """
        This function performs the forwards propagation using activations from previous layer

        Args:
            A_prev:  Activations/Input Data coming into the layer from previous layer
        """

        self.A_prev = A_prev  # store the Activations/Training Data coming in
        self.Z = np.dot(self.params['W'], self.A_prev) + self.params['b']  # compute the linear function

    def backward(self, upstream_grad):
        """
        This function performs the back propagation using upstream gradients

        Args:
            upstream_grad: gradient coming in from the upper layer to couple with local gradient
        """

        # derivative of Cost w.r.t W
        self.dW = np.dot(upstream_grad, self.A_prev.T)

        # derivative of Cost w.r.t b, sum across rows
        self.db = np.sum(upstream_grad, axis=1, keepdims=True)

        # derivative of Cost w.r.t A_prev
        self.dA_prev = np.dot(self.params['W'].T, upstream_grad)

    def update_params(self, learning_rate=0.1):
        """
        This function performs the gradient descent update

        Args:
            learning_rate: learning rate hyper-param for gradient descent, default 0.1
        """
        self.params['W'] = self.params['W'] - learning_rate * self.dW  # update weights
        self.params['b'] = self.params['b'] - learning_rate * self.db  # update bias(es)

Multi Layer Network

import numpy as np  # import numpy library


class SigmoidLayer:
    """
    This file implements activation layers
    inline with a computational graph model

    Args:
        shape: shape of input to the layer

    Methods:
        forward(Z)
        backward(upstream_grad)

    """

    def __init__(self, shape):
        """
        The consturctor of the sigmoid/logistic activation layer takes in the following arguments

        Args:
            shape: shape of input to the layer
        """
        self.A = np.zeros(shape)  # create space for the resultant activations

    def forward(self, Z):
        """
        This function performs the forwards propagation step through the activation function

        Args:
            Z: input from previous (linear) layer
        """
        self.A = 1 / (1 + np.exp(-Z))  # compute activations

    def backward(self, upstream_grad):
        """
        This function performs the  back propagation step through the activation function
        Local gradient => derivative of sigmoid => A*(1-A)

        Args:
            upstream_grad: gradient coming into this layer from the layer above

        """
        # couple upstream gradient with local gradient, the result will be sent back to the Linear layer
        self.dZ = upstream_grad * self.A*(1-self.A)

Multi Layer Network

def compute_stable_bce_cost(Y, Z):
    """
    This function computes the "Stable" Binary Cross-Entropy(stable_bce) Cost and returns the Cost and its
    derivative w.r.t Z_last(the last linear node) .
    The Stable Binary Cross-Entropy Cost is defined as:
    => (1/m) * np.sum(max(Z,0) - ZY + log(1+exp(-|Z|)))
    Args:
        Y: labels of data
        Z: Values from the last linear node

    Returns:
        cost: The "Stable" Binary Cross-Entropy Cost result
        dZ_last: gradient of Cost w.r.t Z_last
    """
    m = Y.shape[1]

    cost = (1/m) * np.sum(np.maximum(Z, 0) - Z*Y + np.log(1+ np.exp(- np.abs(Z))))
    dZ_last = (1/m) * ((1/(1+np.exp(- Z))) - Y)  # from Z computes the Sigmoid so P_hat - Y, where P_hat = sigma(Z)

    return cost, dZ_last

Multi Layer Network

def data_set(n_points, n_classes):
  x = np.random.uniform(-1,1, size=(n_points, n_classes)) # Generate (x,y) points 
  mask = np.logical_or ( np.logical_and(x[:,0] > 0.0, x[:,1] > 0.0),  np.logical_and(x[:,0] < 0.0, x[:,1] < 0.0)) # True for 1st & 3rd quadrants
  y = 1*mask
  return x,y


no_of_points = 10000
no_of_classes = 2
X_train, Y_train = data_set(no_of_points, no_of_classes)


for i in range(10):
    print(f'The point is {X_train[i,0]} , {X_train[i,1]} and the class is {Y_train[i]}')

Multi Layer Network

# define training constants
learning_rate = 0.6
number_of_epochs = 5000

np.random.seed(48) # set seed value so that the results are reproduceable
                  # (weights will now be initailzaed to the same pseudo-random numbers, each time)


# Our network architecture has the shape: 
#                   (input)--> [Linear->Sigmoid] -> [Linear->Sigmoid] -->(output)  

#------ LAYER-1 ----- define hidden layer that takes in training data 
Z1 = LinearLayer(input_shape=X_train.shape, n_out=4, ini_type='xavier')
A1 = SigmoidLayer(Z1.Z.shape)

#------ LAYER-2 ----- define output layer that takes in values from hidden layer
Z2= LinearLayer(input_shape=A1.A.shape, n_out= 1, ini_type='xavier')
A2= SigmoidLayer(Z2.Z.shape)

Multi Layer Network

costs = [] # initially empty list, this will store all the costs after a certian number of epochs

# Start training
for epoch in range(number_of_epochs):
    
    # ------------------------- forward-prop -------------------------
    Z1.forward(X_train)
    A1.forward(Z1.Z)
    
    Z2.forward(A1.A)
    A2.forward(Z2.Z)
    
    # ---------------------- Compute Cost ----------------------------
    cost, dZ2 = compute_stable_bce_cost(Y_train, Z2.Z)
    
    # print and store Costs every 100 iterations and of the last iteration.
    if (epoch % 100) == 0:
        print("Cost at epoch#{}: {}".format(epoch, cost))
        costs.append(cost)
    
    # ------------------------- back-prop ----------------------------
    
    Z2.backward(dZ2)
    
    A1.backward(Z2.dA_prev)
    Z1.backward(A1.dZ)
    
    # ----------------------- Update weights and bias ----------------
    Z2.update_params(learning_rate=learning_rate)
    Z1.update_params(learning_rate=learning_rate)
    
#     if (epoch % 100) == 0:
#         plot_decision_boundary(lambda x: predict_dec(Zs=[Z1, Z2], As=[A1, A2], X=x.T, thresh=0.5),  X=X_train.T, Y=Y_train , save=True)

Multi Layer Network

def predict_loc(X, Zs, As, thresh=0.5):
    """
    helper function to predict on data using a neural net model layers

    Args:
        X: Data in shape (features x num_of_examples)
        Y: labels in shape ( label x num_of_examples)
        Zs: All linear layers in form of a list e.g [Z1,Z2,...,Zn]
        As: All Activation layers in form of a list e.g [A1,A2,...,An]
        thresh: is the classification threshold. All values >= threshold belong to positive class(1)
                and the rest to the negative class(0).Default threshold value is 0.5
    Returns::
        p: predicted labels
        probas : raw probabilities
        accuracy: the number of correct predictions from total predictions
    """
    m = X.shape[1]
    n = len(Zs)  # number of layers in the neural network
    p = np.zeros((1, m))

    # Forward propagation
    Zs[0].forward(X)
    As[0].forward(Zs[0].Z)
    for i in range(1, n):
        Zs[i].forward(As[i-1].A)
        As[i].forward(Zs[i].Z)
    probas = As[n-1].A

    # convert probas to 0/1 predictions
    for i in range(0, probas.shape[1]):
        if probas[0, i] >= thresh:  # 0.5  the default threshold
            p[0, i] = 1
        else:
            p[0, i] = 0

    # print results
    print ("predictions: " + str(p))

Multi Layer Network

Implicit Layers

z_{i+1} = f(z_i)

f(z_i , z_{i+1}) = 0

Explicit Layers

Implicit Layers

But Why?? 😕

Implicit Layers

But Why?? 😕

"Instead of specifying how to compute the layer's output from the input , we specify the conditions that we want the layer's output to satisfy"

Implicit Layers

But Why?? 😕

Recent Years Have seen several Efficient ways to Differentiate through constructs like Argmin and Argmax

argmin \, \, f(z_i , z_{i+1})

then , Backpropagation Requires Gradients

\frac{\partial}{\partial z_{i}} \, \, f(z_i , z_{i+1})

1.

Gould, S. et al. On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization. arXiv:1607.05447 [cs, math] (2016).

Implicit Layers

But How?? 😕

Differentiating Through Implicit Layers?? 🤔

Implicit Layers

But Why?? 😕

Differentiating Through Implicit Layers?? 🤔

The Implicit Function Theorem

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

\text{for some} , a_0 \epsilon \mathbb{R}^n \,\, z_0 \epsilon \mathbb{R}^m \\ f(a_0 , z_0) = 0 \\ \text{f is continuously differentiable with non singular jacobian} \frac{\partial f(a_0 , z_0)}{\partial a_0}

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

\text{for some} , a_0 \epsilon \mathbb{R}^n \,\, z_0 \epsilon \mathbb{R}^m \\ f(a_0 , z_0) = 0 \\ \text{f is continuously differentiable with non singular jacobian} \frac{\partial f(a_0 , z_0)}{\partial a_0}

Then the Implicit function theorem tells ,

\text{There Exists a local function } z^* \text{ such that}

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

\text{for some} , a_0 \epsilon \mathbb{R}^n \,\, z_0 \epsilon \mathbb{R}^m \\ f(a_0 , z_0) = 0 \\ \text{f is continuously differentiable with non singular jacobian} \frac{\partial f(a_0 , z_0)}{\partial a_0}

Then the Implicit function theorem tells ,

\text{There Exists a local function } z^* \text{ such that}

z_0 = z^*(a_0) \\ f(a,z^*(a)) = 0 \forall a \, in \, the \, neighbourhood \\ z^* \, is \, differentiable \, in \, the \, neighbourhood

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

Now , we know that there exists a local function z*,

f(a , z^*(a)) = 0 \, \forall \, \epsilon S_{a_0} \\ \frac{\partial f(a , z^*(a)) }{\partial a} + \frac{\partial f(a , z^*(a)) }{\partial z^*(a)} \frac{\partial z^*(a)}{\partial a} = 0 \, \, \, \forall a \epsilon S_{a_0}

S_{a_0} \text{Is the set of a's in the neighbourhood of } a_0 \text{ such that} f(a , z^*(a)) = 0 \, \forall a \epsilon S_{a_0}

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

Now , we know that there exists a local function z*,

f(a , z^*(a)) = 0 \, \forall \, \epsilon S_{a_0} \\ \frac{\partial f(a , z^*(a)) }{\partial a} + \frac{\partial f(a , z^*(a)) }{\partial z^*(a)} \frac{\partial z^*(a)}{\partial a} = 0 \, \, \, \forall a \epsilon S_{a_0}

S_{a_0} \text{Is the set of a's in the neighbourhood of } a_0 \text{ such that} f(a , z^*(a)) = 0 \, \forall a \epsilon S_{a_0}

\frac{\partial z^*(a_0)}{\partial a} = - \left [ \frac{\partial f(a_0 , z^*(a_0)) }{\partial z^*(a)} \right ]^{-1} \frac{\partial f(a_0 , z^*(a_0)) }{\partial a}

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

\frac{\partial z^*(a_0)}{\partial a} = - \left [ \frac{\partial f(a_0 , z^*(a_0)) }{\partial z^*(a)} \right ]^{-1} \frac{\partial f(a_0 , z^*(a_0)) }{\partial a}

\text{For Fixed Point solution mappings } f(a , z^*(a) ) = z_0 , \text{ This can be extended as , }

\frac{\partial f(a , z^*(a)) }{\partial a} + \frac{\partial f(a , z^*(a)) }{\partial z^*(a)} \frac{\partial z^*(a)}{\partial a} = z_0 \, \, \, \forall a \epsilon S_{a_0}

Implicit Layers

The Implicit Function Theorem

f : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^m

\frac{\partial z^*(a_0)}{\partial a} = - \left [ \frac{\partial f(a_0 , z^*(a_0)) }{\partial z^*(a)} \right ]^{-1} \frac{\partial f(a_0 , z^*(a_0)) }{\partial a}

\text{For Fixed Point solution mappings } f(a , z^*(a) ) = z_0 , \text{ This can be extended as , }

\frac{\partial f(a , z^*(a)) }{\partial a} + \frac{\partial f(a , z^*(a)) }{\partial z^*(a)} \frac{\partial z^*(a)}{\partial a} = z_0 \, \, \, \forall a \epsilon S_{a_0}

\frac{\partial z^*(a_0)}{\partial a} = \left [ I - \frac{\partial f(a_0 , z^*(a_0)) }{\partial z^*(a)} \right ]^{-1} \frac{\partial f(a_0 , z^*(a_0)) }{\partial a}

Implicit Layers

The Implicit Function Theorem

\frac{\partial z^*(a_0)}{\partial a} = \left [ I - \frac{\partial f(a_0 , z^*(a_0)) }{\partial z^*(a)} \right ]^{-1} \frac{\partial f(a_0 , z^*(a_0)) }{\partial a}

Now this can be connected to standard auto diffs tools..

Implicit Layers

Implicit Layers , Bring In Structure to the layers.

Encode Domain knowledge

Differentiable Optimization

Implicit Layers

Implicit Layers , Bring In Structure to the layers.

Encode Domain knowledge

Differentiable Optimization

z^* = argmin f(z,x) \, \, z \epsilon \textit{Constraint \, Set}(x)

Implicit Layers

OPTNET

Amos, Brandon, and J. Zico Kolter. 2019. OptNet: Differentiable Optimization as a Layer inNeural Networks.arXiv:1703.00443 [cs, math, stat](14 October 2019). arXiv: 1703.00443

z^* = argmin \, \, \frac{1}{2} z^T Q(x) z + p(z)^Tz \\ subject \, \, to \, \, \, \, A(x) z = b(x) \\ G(x)z \leq h(x)

Implicit Layers

OPTNET

z^* = argmin \, \, \frac{1}{2} z^T Q(x) z + p(z)^Tz \\ subject \, \, to \, \, \, \, A(x) z = b(x) \\ G(x)z \leq h(x)

KKT Conditions

"Necessary and Sufficient

Conditions for optimality "

( z^* , \nu^* , \lambda^*)

Implicit Form

Implicit Function Theorem

Implicit Layers

Our Trails And Fails

Implicit Layers

Our Trails And Fails

Adrian-Vasile Duka,
Neural Network based Inverse Kinematics Solution for Trajectory Tracking of a Robotic Arm,
Procedia Technology,Volume 12, 2014, Pages 20-27, ISSN 2212-0173,
https://doi.org/10.1016/j.protcy.2013.12.451.
(https://www.sciencedirect.com/science/article/pii/S2212017313006361)
Abstract: Planar two and three-link manipulators are often used in Robotics as testbeds for various algorithms or theories. In this paper, the case of a three-link planar manipulator is considered. For this type of robot a solution to the inverse kinematics problem, needed for generating desired trajectories in the Cartesian space (2D) is found by using a feed-forward neural network.
Keywords: robotic arm; planar manipulator; inverse kinematics; trajectory; neural networks

Adrian-VasileDuka Showed Training Neural Networks , for inverse kinematics , for a planar 3 link manipulator based on input - end effector coordinates using 1 hidden layer and 100 neurons

Implicit Layers

Our Trails And Fails

We Tried to Simulate Obstacle , by removing a circular region from randomly generated data..

Then Tried Appending the previous architecture with an optnet layer.

Implicit Layers

Our Trails And Fails

We Tried to Simulate Obstacle , by removing a circular region from randomly generated data..

Then Tried Appending the previous architecture with an optnet layer.

But we hit a roadblock..

class Link3IK(nn.Module):

def __init__(self,n,m,p):
    super().__init__()
    torch.manual_seed(0)
    z = cp.Variable(n)
    self.P = cp.Parameter((n,n))
    self.q = cp.Parameter(n)
    self.G = cp.Parameter((m,n))
    self.h = cp.Parameter(m)
    self.A = cp.Parameter((p,n))
    self.b = cp.Parameter(p)
    self.nn_output = cp.Parameter(3)

    scale_factor = 1e-4
    self.Ptch = torch.nn.Parameter(scale_factor*torch.randn(n,n))
    self.qtch = torch.nn.Parameter(scale_factor*torch.randn(n))
    self.Gtch = torch.nn.Parameter(scale_factor*torch.randn(m,n))
    self.htch = torch.nn.Parameter(scale_factor*torch.randn(m))
    self.Atch = torch.nn.Parameter(scale_factor*torch.randn(p,n))
    self.btch = torch.nn.Parameter(scale_factor*torch.randn(p))

    self.objective = cp.Minimize(0.5*cp.sum_squares(self.P @ self.z) + self.q @ self.z )
    self.constraints = [self.G@self.z-self.h <= 0 , self.A@self.z == self.b]
    raise NotImplementedError # include nn_output in the cvxpy problem.
    self.problem = cp.Problem(self.objective, self.constraints) 
    self.net = nn.Sequential(
        nn.Linear(2, 100), 
        nn.Sigmoid(), 
        nn.Linear(100, 3), 
        nn.Softmax()
    )
    self.cvxpylayer = CvxpyLayer(self.problem, parameters=[self.P,self.q,self.G,self.h,self.A,self.b,self.nn_output], variables=[self.z])

def forward(self, X):
    nn_output = self.net(X)
    output = self.cvxpylayer(self.Ptch, self.qtch, self.Gtch, self.htch, self.Atch, self.btch, nn_output)[0]
    return output

In order to include nn_output , we needed knowledge about convex spaces.. 😔

Future Potential Prospects

1.

Maric, F., Giamou, M., Khoubyarian, S., Petrovic, I. & Kelly, J. Inverse Kinematics for Serial Kinematic Chains via Sum of Squares Optimization. 2020 IEEE International Conference on Robotics and Automation (ICRA) 7101–7107 (2020) doi:10.1109/ICRA40945.2020.9196704.

Maric et al ., Casted Inverse kinematics as a sum of squares QCQP problem and solved using a custom solver

We could use that formulation to solve these problems using CVXPYlayers

Based On our understanding....

Future Potential Prospects

1.

Maric, F., Giamou, M., Khoubyarian, S., Petrovic, I. & Kelly, J. Inverse Kinematics for Serial Kinematic Chains via Sum of Squares Optimization. 2020 IEEE International Conference on Robotics and Automation (ICRA) 7101–7107 (2020) doi:10.1109/ICRA40945.2020.9196704.

Reinforcement Learning

Learning Constraints in the state space

Control

Things like LQE for Starting

Based On our understanding....

Role of BIAS

Introduce an additional degree of freedom
Offset any systematic errors or inconsistencies in the data
Shift the decision boundary and better fit the training data
In case of class imbalance shift decision boundary towards the minority class

Bias Variance Tradeoff

Minimize both bias and
variance to achieve optimal predictive performance

High Bias

Underfits the data
Cannot capture complexity
Poor predictive performance

High Variance

Sensitive to data fluctuations
Fits noise and underlying patterns
Poor performance on unseen data

Bias Variance Tradeoff

Right balance for optimal performance

Regularization,
cross-validation, and Ensembling

Vanishing Gradient Problem

Gradient of loss function w.r.t weights become very small

Prevalent in deeper networks as gradient is propagated through multiple layers making it smaller

Weights in earlier layers do not receive sufficient updates

Vanishing Gradient Problem

Activation functions like sigmoid have limited range causing derivatives to be mall for large inputs

Initialising small weights can make the gradients smaller

Vanishing Gradient Problem

Solution

ReLU, weight initialisation methods

(Xavier Initialisation)

Skip connections,
which allow the gradient to bypass some of the layers and propagate more directly to the earlier layers.

Advantages of Deeper Networks

Increased Representational Power

Learn Complex features

Improved Feature Extraction

Lower layers can learn basic features while higher layers learn abstract features

More discrimination of features

Better Generalization

Can filter noise and focus on genaralising patterns