foundations of data science for everyone

X: Neural Networks

Farid Qamar

this slide deck: https://slides.com/faridqamar/fdfse_1 0

1

NN: Neural Networks

origins

1943

McCulloch & Pitts 1943

M-P Neuron

1943

McCulloch & Pitts 1943

M-P Neuron

1943

McCulloch & Pitts 1943

M-P Neuron

1943

McCulloch & Pitts 1943

M-P Neuron

\text{if }x_1 + x_2 + x_3 \ge \theta \text{ then } y = 1

\text{else then } y = 0

1943

McCulloch & Pitts 1943

M-P Neuron

\text{if } \sum_{i=1}^3 x_i \ge \theta \text{ then } y = 1

\text{else then } y = 0

\text{if } \sum_{i=1}^3 x_i \ge \theta \text{ then } y = 1

\text{if else then } y = 0

Question:

If is binary (1 or 0) or boolean(True/False)

what value of corresponds to the logical operator AND ?

x_i

\theta

1943

McCulloch & Pitts 1943

M-P Neuron

If is binary (1 or 0):

x_i

AND

OR

1958

Frank Rosenblatt 1958

Perceptron

1958

Frank Rosenblatt 1958

Perceptron

\text{if } \sum_{i=1}^N w_i x_i + b \ge \theta \text{ then } y = 1

\text{else then } y = 0

x_1

w_1

x_2

w_2

w_N

x_N

.

output

+b

w_i

b

weights

bias

1958

Frank Rosenblatt 1958

Perceptrons are linear classifiers: make their predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector

y = \sum_i w_i x_i +b

x_1

x_2

Frank Rosenblatt 1958

x_1

w_1

x_2

w_2

w_N

x_N

.

output

linear regression:

+b

w_i

b

weights

bias

Perceptrons are linear classifiers: make their predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector

y = f(\sum_i w_i x_i +b)

f

activation function

f

perceptron

y = \begin{cases} 1 \text{ if } \sum_i (w_i x_i) + b \ge Z \\ 0 \text{ if } \sum_i (w_i x_i) + b < Z \end{cases}

f

y

Frank Rosenblatt 1958

x_1

w_1

x_2

w_2

w_N

x_N

.

output

linear regression:

+b

w_i

b

weights

bias

Perceptrons are linear classifiers: make their predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector

y = f(\sum_i w_i x_i +b)

f

activation function

f

sigmoid

\sigma = \frac{1}{1+e^{-z}}

f

y

\sigma = \frac{1}{1+e^{-x}}

Sigmoid

\max(0,x)

tanh

\tanh(x)

ReLU

\max(0.1x,x)

Leaky ReLU

\begin{cases} x & x \ge 0 \\ \alpha(e^x-1) & x < 0 \end{cases}

Maxout

\max(w_1^Tx+b_1,w_2^Tx+b_2)

ELU

Frank Rosenblatt 1958

Perceptron

output

linear regression:

+b

w_i

b

weights

bias

f

activation function

f

Frank Rosenblatt 1958

Perceptron

x_1

w_1

x_2

w_2

w_N

x_N

.

Frank Rosenblatt 1958

Perceptron

July 8, 1958

NEW NAVY DEVICE LEARNS BY DOING

Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Bureau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy's demonstration

2

MLP: Multilayer Perceptron

Deep Learning

multilayer perceptron (MLP)

x_1

x_2

x_3

b_1

b_2

b_3

b_4

b

output

layer of perceptrons

multilayer perceptron (MLP)

x_1

x_2

x_3

b_1

b_2

b_3

b_4

b

output

hidden layer

input layer

output layer

1970: multilayer perceptron architecture

Fully connected: all nodes go to all nodes of the next layer

multilayer perceptron (MLP)

x_1

x_2

x_3

b_1

b_2

b_3

b_4

b

output

layer of perceptrons

w_{11}

w_{14}

w_{13}

w_{12}

multilayer perceptron (MLP)

x_1

x_2

x_3

b_1

b_2

b_3

b_4

b

output

layer of perceptrons

w_{21}

w_{24}

w_{23}

w_{22}

multilayer perceptron (MLP)

x_1

x_2

x_3

b_1

b_2

b_3

b_4

b

output

layer of perceptrons

w_{31}

w_{34}

w_{33}

w_{32}

multilayer perceptron (MLP)

x_1

x_2

x_3

w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + b_1

output

layer of perceptrons

Fully connected: all nodes go to all nodes of the next layer

multilayer perceptron (MLP)

x_1

x_2

x_3

w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + b_1

output

layer of perceptrons

Fully connected: all nodes go to all nodes of the next layer

w_{12}x_1 + w_{22}x_2 + w_{32}x_3 + b_2

w_{13}x_1 + w_{23}x_2 + w_{33}x_3 + b_3

w_{14}x_1 + w_{24}x_2 + w_{34}x_3 + b_4

learned parameters

: weight

sets the sensitivity of a neuron

: bias

up-down weights a neuron

w_{ij}

b_{i}

multilayer perceptron (MLP)

x_1

x_2

x_3

f(w_{11}x_1 + w_{21}x_2 + w_{31}x_3 + b_1)

output

layer of perceptrons

Fully connected: all nodes go to all nodes of the next layer

f(w_{12}x_1 + w_{22}x_2 + w_{32}x_3 + b_2)

f(w_{13}x_1 + w_{23}x_2 + w_{33}x_3 + b_3)

f(w_{14}x_1 + w_{24}x_2 + w_{34}x_3 + b_4)

: weight

sets the sensitivity of a neuron

: bias

up-down weights a neuron

b_{i}

: activation function

turns neurons on-off

f

w_{ij}

3

DNN: Deep Neural Networks

hyperparameters

output

hidden layer

input layer

output layer

EXERCISE

how many parameters?

output

hidden layer

input layer

output layer

EXERCISE

how many parameters?

21

output

hidden layer 1

input layer

output layer

EXERCISE

how many parameters?

hidden layer 2

output

hidden layer 1

input layer

output layer

EXERCISE

how many parameters?

hidden layer 2

35

output

hidden layer 1

input layer

output layer

how many hyperparameters?

hidden layer 2

number of layers - 1
number of neurons/layer -
activation function/layer -
layer connectivity -
optimization metric - 1
optimization method - 1
parameters in optimization - M

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

N_l^{??}

EXERCISE

http://playground.tensorflow.org/

4

DNN: Deep Neural Networks

training DNN

deep neural networks

1986: Deep Neural Nets

Fully connected: all nodes go to all nodes of the next layer

: activation function

turns neurons on-off

f

\sigma = \frac{1}{1+e^{-x}}

Sigmoid

: weight

sets the sensitivity of a neuron

: bias

up-down weights a neuron

b_{i}

w_{ij}

\vec{y}=f_N(...(f_1(\vec{x}W_i+b_1...W_N+b_N)))

back-propagation

x_1

w_1

x_2

w_2

w_N

x_N

.

+b

y

A linear model:

\vec{y} = f(\vec{x}W+b)

back-propagation

x_1

w_1

x_2

w_2

w_N

x_N

.

+b

y

A linear model:

: prediction

: target

y_{predicted}

y_{true}

Error (e.g.):

L_2 = \sum(y_{true} - y_{predicted})^2

\vec{y} = f(\vec{x}W+b)

back-propagation

x_1

w_1

x_2

w_2

w_N

x_N

.

+b

y

A linear model:

Error (e.g.):

Need to find the best parameters by finding the minimum of

L_2

: prediction

: target

y_{predicted}

y_{true}

L_2 = \sum(y_{true} - y_{predicted})^2

\vec{y} = f(\vec{x}W+b)

back-propagation

x_1

w_1

x_2

w_2

w_N

x_N

.

+b

y

A linear model:

Need to find the best parameters by finding the minimum of

L_2

Stochastic Gradient Descent

Error (e.g.):

: prediction

: target

y_{predicted}

y_{true}

L_2 = \sum(y_{true} - y_{predicted})^2

\vec{y} = f(\vec{x}W+b)

back-propagation

How does gradient descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_1

x_2

b_1

b_2

b_3

b_4

output

w_{1}

w_{2}

w_{3}

w_{4}

w_{5}

w_{6}

w_{7}

w_{8}

w_{9}

\text{output} = \frac{1}{1+e^{-\frac{w_7}{1+e^{-w_1x_1+w_4x_2+b_1}}-\frac{w_8}{1+e^{-w_2x_1+w_5x_2+b_2}}-\frac{w_9}{1+e^{-w_3x_1+w_6x_2+b_3}}-b_4}}

back-propagation

Rumelhart et al., 1986

Define cost function, e.g.

feed data forward through network and calculate cost metric

C = \frac{1}{2}|y-a^L|^2=\frac{1}{2}\sum_j(y_j-a^L_j)^2

for each layer, calculate effect of small changes on next layer

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

Forward Propagation

back-propagation algorithm:

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

randomly assign weights and biases everywhere
forward propagate through the network to calculate the output (predict the target)
calculate the cost metric (the error in the prediction)
backwards propagate through the network, updating weights and biases using stochastic gradient descent
stop if error is less than a set amount, or after a set number of iterations...otherwise, return to step 2 and repeat

back-propagation algorithm:

Error Estimation

Back Propagation

Repeat!

Punch Line

Simply put: Deep Neural Networks are essentially linear models with a bunch of parameters

Because they have so many parameters they are difficult to "interpret" (no easy feature extraction)

they are a

Black Box

but that is ok because they are prediction machines