foundations of data science for everyone

X: Neural Networks
Farid Qamar 

1

NN: Neural Networks

origins

1943

McCulloch & Pitts 1943

M-P Neuron

1943

McCulloch & Pitts 1943

M-P Neuron

1943

McCulloch & Pitts 1943

M-P Neuron

1943

McCulloch & Pitts 1943

M-P Neuron

\text{if }x_1 + x_2 + x_3 \ge \theta \text{ then } y = 1
\text{else then } y = 0

1943

McCulloch & Pitts 1943

M-P Neuron

\text{if } \sum_{i=1}^3 x_i \ge \theta \text{ then } y = 1
\text{else then } y = 0
\text{if } \sum_{i=1}^3 x_i \ge \theta \text{ then } y = 1
\text{if else then } y = 0

Question:

If        is binary (1 or 0) or boolean(True/False)

what value of      corresponds to the logical operator AND ?

x_i
\theta

1943

McCulloch & Pitts 1943

M-P Neuron

If        is binary (1 or 0):

x_i

AND

OR

1958

Frank Rosenblatt 1958

Perceptron

1958

Frank Rosenblatt 1958

Perceptron

\text{if } \sum_{i=1}^N w_i x_i + b \ge \theta \text{ then } y = 1
\text{else then } y = 0
x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

output

+b
w_i
b

weights

bias

1958

Frank Rosenblatt 1958

Perceptrons are linear classifiers: make their predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector

y = \sum_i w_i x_i +b
x_1
x_2

Frank Rosenblatt 1958

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

output

linear regression:

+b
w_i
b

weights

bias

Perceptrons are linear classifiers: make their predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector

y = f(\sum_i w_i x_i +b)
f

activation function

f

perceptron

y = \begin{cases} 1 \text{ if } \sum_i (w_i x_i) + b \ge Z \\ 0 \text{ if } \sum_i (w_i x_i) + b < Z \end{cases}
f
f
y

Frank Rosenblatt 1958

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

output

linear regression:

+b
w_i
b

weights

bias

Perceptrons are linear classifiers: make their predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector

y = f(\sum_i w_i x_i +b)
f

activation function

f

sigmoid

\sigma = \frac{1}{1+e^{-z}}
f
f
y
\sigma = \frac{1}{1+e^{-x}}

Sigmoid

\max(0,x)

tanh

\tanh(x)

ReLU

\max(0.1x,x)

Leaky ReLU

\begin{cases} x & x \ge 0 \\ \alpha(e^x-1) & x < 0 \end{cases}

Maxout

\max(w_1^Tx+b_1,w_2^Tx+b_2)

ELU

Frank Rosenblatt 1958

Perceptron

output

linear regression:

+b
w_i
b

weights

bias

f

activation function

f

Frank Rosenblatt 1958

Perceptron

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

Frank Rosenblatt 1958

Perceptron

July 8, 1958

NEW NAVY DEVICE LEARNS BY DOING

Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Bureau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy's demonstration

2

MLP: Multilayer Perceptron

Deep Learning

multilayer perceptron (MLP)

x_1
x_2
x_3
b_1
b_2
b_3
b_4
b

output

layer of perceptrons

multilayer perceptron (MLP)

x_1
x_2
x_3
b_1
b_2
b_3
b_4
b

output

hidden layer

input layer

output layer

1970: multilayer perceptron architecture

Fully connected: all nodes go to all nodes of the next layer

multilayer perceptron (MLP)

x_1
x_2
x_3
b_1
b_2
b_3
b_4
b

output

layer of perceptrons

w_{11}
w_{14}
w_{13}
w_{12}

multilayer perceptron (MLP)

x_1
x_2
x_3
b_1
b_2
b_3
b_4
b

output

layer of perceptrons

w_{21}
w_{24}
w_{23}
w_{22}

multilayer perceptron (MLP)

x_1
x_2
x_3
b_1
b_2
b_3
b_4
b

output

layer of perceptrons

w_{31}
w_{34}
w_{33}
w_{32}

multilayer perceptron (MLP)

x_1
x_2
x_3
w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + b_1

output

layer of perceptrons

Fully connected: all nodes go to all nodes of the next layer

multilayer perceptron (MLP)

x_1
x_2
x_3
w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + b_1

output

layer of perceptrons

Fully connected: all nodes go to all nodes of the next layer

w_{12}x_1 + w_{22}x_2 + w_{32}x_3 + b_2
w_{13}x_1 + w_{23}x_2 + w_{33}x_3 + b_3
w_{14}x_1 + w_{24}x_2 + w_{34}x_3 + b_4

learned parameters

: weight

sets the sensitivity of a neuron

: bias

up-down weights a neuron

w_{ij}
b_{i}

multilayer perceptron (MLP)

x_1
x_2
x_3
f(w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + b_1)

output

layer of perceptrons

Fully connected: all nodes go to all nodes of the next layer

f(w_{12}x_1 + w_{22}x_2 + w_{32}x_3 + b_2)
f(w_{13}x_1 + w_{23}x_2 + w_{33}x_3 + b_3)
f(w_{14}x_1 + w_{24}x_2 + w_{34}x_3 + b_4)

: weight

sets the sensitivity of a neuron

: bias

up-down weights a neuron

b_{i}

: activation function

turns neurons on-off

f
w_{ij}

3

DNN: Deep Neural Networks

hyperparameters

output

hidden layer

input layer

output layer

EXERCISE

how many parameters?

output

hidden layer

input layer

output layer

EXERCISE

how many parameters?

21

output

hidden layer 1

input layer

output layer

EXERCISE

how many parameters?

hidden layer 2

output

hidden layer 1

input layer

output layer

EXERCISE

how many parameters?

hidden layer 2

35

output

hidden layer 1

input layer

output layer

how many hyperparameters?

hidden layer 2

  1. number of layers - 1
  2. number of neurons/layer -
  3. activation function/layer -
  4. layer connectivity -
  5. optimization metric - 1
  6. optimization method - 1
  7. parameters in optimization - M

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l
N_l
N_l^{??}

EXERCISE

4

DNN: Deep Neural Networks

training DNN

deep neural networks

1986: Deep Neural Nets

Fully connected: all nodes go to all nodes of the next layer

: activation function

turns neurons on-off

f
\sigma = \frac{1}{1+e^{-x}}

Sigmoid

: weight

sets the sensitivity of a neuron

: bias

up-down weights a neuron

b_{i}
w_{ij}
\vec{y}=f_N(...(f_1(\vec{x}W_i+b_1...W_N+b_N)))

back-propagation

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

+b
y

A linear model:

\vec{y} = f(\vec{x}W+b)

back-propagation

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

+b
y

A linear model:

: prediction

: target

y_{predicted}
y_{true}

Error (e.g.):

L_2 = \sum(y_{true} - y_{predicted})^2
\vec{y} = f(\vec{x}W+b)

back-propagation

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

+b
y

A linear model:

Error (e.g.):

Need to find the best parameters by finding the minimum of 

L_2

: prediction

: target

y_{predicted}
y_{true}
L_2 = \sum(y_{true} - y_{predicted})^2
\vec{y} = f(\vec{x}W+b)

back-propagation

x_1
w_1
x_2
w_2
w_N
x_N

.

.

.

+b
y

A linear model:

Need to find the best parameters by finding the minimum of 

L_2

Stochastic Gradient Descent

Error (e.g.):

: prediction

: target

y_{predicted}
y_{true}
L_2 = \sum(y_{true} - y_{predicted})^2
\vec{y} = f(\vec{x}W+b)

back-propagation

How does gradient descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_1
x_2
b_1
b_2
b_3
b_4

output

w_{1}
w_{2}
w_{3}
w_{4}
w_{5}
w_{6}
w_{7}
w_{8}
w_{9}
\text{output} = \frac{1}{1+e^{-\frac{w_7}{1+e^{-w_1x_1+w_4x_2+b_1}}-\frac{w_8}{1+e^{-w_2x_1+w_5x_2+b_2}}-\frac{w_9}{1+e^{-w_3x_1+w_6x_2+b_3}}-b_4}}

back-propagation

Rumelhart et al., 1986

Define cost function, e.g.

feed data forward through network and calculate cost metric

C = \frac{1}{2}|y-a^L|^2=\frac{1}{2}\sum_j(y_j-a^L_j)^2

for each layer, calculate effect of small changes on next layer

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

  1. randomly assign weights and biases everywhere
  2. forward propagate through the network to calculate the output (predict the target)
  3. calculate the cost metric (the error in the prediction)
  4. backwards propagate through the network, updating weights and biases using stochastic gradient descent
  5. stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

Repeat!

Punch Line

Simply put: Deep Neural Networks are essentially linear models with a bunch of parameters

Simply put: Deep Neural Networks are essentially linear models with a bunch of parameters

Because they have so many parameters they are difficult to "interpret" (no easy feature extraction)

they are a

Black Box

but that is ok because they are prediction machines

Punch Line

resources

Neural Networks and Deep Learning

an excellent and free book on NN and DL

http://neuralnetworksanddeeplearning.com/index.html

Made with Slides.com