BMI 707 Lecture 2: Backprop, Perceptrons, and MLPs
Andrew Beam, PhD
Department of Biomedical Informatics
March 28th, 2018

twitter: @AndrewLBeam
WHAT IS A NEURAL NET?

WHAT IS A NEURAL NET?

A neural net is made up of 3 things
WHAT IS A NEURAL NET?

A neural net is made up of 3 things
The network structure

WHAT IS A NEURAL NET?

A neural net is made up of 3 things
The network structure
The loss function

WHAT IS A NEURAL NET?

A neural net is made up of 3 things
The network structure
The optimizer


The loss function
NEURAL NETWORK STRUCTURE


NEURAL NET STRUCTURE

A neural net is a modular way to build a classifier
Inputs
Output
WHAT IS AN ARTIFICIAL NEURON?

The neuron is the basic functional unit a neural network
Inputs
Output
WHAT IS AN ARTIFICIAL NEURON?

The neuron is the basic functional unit a neural network
Inputs
Output
WHAT IS AN ARTIFICIAL NEURON?

The neuron is the basic functional unit a neural network
A neuron does two things, and only two things
WHAT IS AN ARTIFICIAL NEURON?

The neuron is the basic functional unit a neural network
Weight for
A neuron does two things, and only two things
Weight for
1) Weighted sum of inputs
WHAT IS AN ARTIFICIAL NEURON?

The neuron is the basic functional unit a neural network
Weight for
A neuron does two things, and only two things
Weight for
1) Weighted sum of inputs
2) Nonlinear transformation
WHAT IS AN ARTIFICIAL NEURON?


is known as the activation function, and there are many choices
Sigmoid


Hyperbolic Tangent

WHAT IS AN ARTIFICIAL NEURON?


is known as the activation function, and there are many choices
Sigmoid


Hyperbolic Tangent

Today
WHAT IS AN ARTIFICIAL NEURON?


is known as the activation function, and there are many choices
Sigmoid


Hyperbolic Tangent

Today
HW 1B
WHAT IS AN ARTIFICIAL NEURON?

Summary: A neuron produces a single number that is a nonlinear transformation of its input connections
A neuron does two things, and only two things
= a number
NEURAL NETWORK STRUCTURE

Inputs
Output
Neural nets are organized into layers
NEURAL NETWORK STRUCTURE

Inputs
Output
Input Layer
Neural nets are organized into layers
NEURAL NETWORK STRUCTURE

Inputs
Output
Neural nets are organized into layers
1st Hidden Layer
Input Layer
NEURAL NETWORK STRUCTURE

Inputs
Output
Neural nets are organized into layers
A single hidden unit
1st Hidden Layer
Input Layer
NEURAL NETWORK STRUCTURE

Inputs
Output
Input Layer
Neural nets are organized into layers
1st Hidden Layer
A single hidden unit
2nd Hidden Layer
NEURAL NETWORK STRUCTURE

Inputs
Output
Input Layer
Neural nets are organized into layers
1st Hidden Layer
A single hidden unit
2nd Hidden Layer
Output Layer
LOSS FUNCTIONS

Output
Output Layer
We need a way to measure how well the network is performing, e.g. is it making good predictions?
LOSS FUNCTIONS

Output
Output Layer
We need a way to measure how well the network is performing, e.g. is it making good predictions?
Loss function: A function that returns a single number which indicates how closely a prediction matches the ground truth label
LOSS FUNCTIONS

Output
Output Layer
We need a way to measure how well the network is performing, e.g. is it making good predictions?
small loss = good
big loss = bad
Loss function: A function that returns a single number which indicates how closely a prediction matches the ground true label
LOSS FUNCTIONS

A classic loss function for binary classification is binary cross-entropy
LOSS FUNCTIONS

A classic loss function for binary classification is binary cross-entropy
y | p | Loss |
---|---|---|
0 | 0.1 | 0.1 |
0 | 0.9 | 2.3 |
1 | 0.1 | 2.3 |
1 | 0.9 | 0.1 |
OUTPUT LAYER & LOSS

Output Layer
The output layer needs to "match" the loss function
- Correct shape
- Correct scale
OUTPUT LAYER & LOSS

Output Layer
The output layer needs to "match" the loss function
For binary cross-entropy, network needs to produce a single probability
OUTPUT LAYER & LOSS

Output Layer
The output layer needs to "match" the loss function
One unit in output layer to represent this probability
For binary cross-entropy, network needs to produce a single probability
OUTPUT LAYER & LOSS

Output Layer
The output layer needs to "match" the loss function
One unit in output layer to represent this probability
For binary cross-entropy, network needs to produce a single probability
Activation function must "squash" output to be between 0 and 1

OUTPUT LAYER & LOSS

Output Layer
The output layer needs to "match" the loss function
One unit in output layer to represent this probability
For binary cross-entropy, network needs to produce a single probability
Activation function must "squash" output to be between 0 and 1

We can change the output layer & loss to model many different kinds of data
- Multiple classes
- Continuous response (i.e. regression)
- Survival data
- Combinations of the above

THE OPTIMIZER
Question:
Now that we have specified:
- A network
- Loss function
How do we find the values for the weights that gives us the smallest possible value for the loss function?

How do we minimize the loss function?

Stochastic Gradient Decscent
- Give weights random initial values
- Evaluate partial derivative of each weight with respect negative log-likelihood at current weight value on a mini-batch
- Take a step in direction opposite to the gradient
- Rinse and repeat
THE OPTIMIZER

How do we minimize the loss function?
THE OPTIMIZER

Many variations on basic idea of SGD are available
PERCEPTRON BY HAND
PERCEPTRONS

Let's say we'd like to have a single neural learn a simple function
y
X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
Observations
PERCEPTRONS

How do we make a prediction for each observations?
y
X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | 0 |
Observations
Predictions

For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | 0 |
Predictions

For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
First compute the weighted sum:
Predictions

For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
First compute the weighted sum:
Transform to probability:
Predictions

For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
First compute the weighted sum:
Transform to probability:
Round to get prediction:
Predictions

Putting it all together:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
X1 | X2 | y | h | p | |
---|---|---|---|---|---|
0 | 0 | 0 | -0.5 | 0.38 | 0 |
0 | 1 | 1 | -1.5 | 0.18 | 0 |
1 | 0 | 1 | 0.5 | .62 | 1 |
1 | 1 | 1 | -0.5 | 0.38 | 0 |
Fill out this table
Room for Improvement

Our neural net isn't so great... how do we make it better?
What do I even mean by better?
Room for Improvement

Let's define how we want to measure the network's performance.
There are many ways, but let's use squared-error:
Room for Improvement

Let's define how we want to measure the network's performance.
There are many ways, but let's use squared-error:
Now we need to find values for that make this error as small as possible
ALL OF ML IN ONE SLIDE

Our task is learning values for such the the difference between the predicted and actual values is as small as possible.
Learning from Data

So, how we find the "best" values for
Learning from Data

So, how we find the "best" values for
hint: calculus
Learning from Data

Recall (without PTSD) that the derivative of a function tells you how it is changing at any given location.
If the derivative is positive, it means it's going up.
If the derivative is negative, it means it's going down.
Learning from Data

Simple strategy:
- Start with initial values for
- Take partial derivatives of loss function
with respect to
- Subtract the derivative (also called the gradient) from each
The Backpropagation Algorithm



The Backpropagation Algorithm

Our perception performs the following computations
And we want to minimize this quantity
The Backpropagation Algorithm

Our perception performs the following computations
And we want to minimize this quantity
We'll compute the gradients for each parameter by "back-propagating" errors through each component of the network
The Backpropagation Algorithm

For we need to compute
Computations
Loss
To get there, we will use the chain rule
This is "backprop"
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
The Backpropagation Algorithm

Let's break it into pieces
Computations
Loss
Putting it all together
Gradient Descent with Backprop

1) Compute the gradient for
2) Update
is the learning rate
For some number of iterations we will:
3) Repeat until "convergence"
Learning Rules for each Parameter

Gradient for
Gradient for
Gradient for
Update for
Update for
Update for
is the learning rate
Learning Rules for each Parameter

is the learning rate
Fill in new table!
Update for
Update for
Update for
Gradient for
Gradient for
Gradient for
IMPLEMENTATION IN PYTHON

Another Example

X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 0 |
1 | 0 | 0 |
1 | 1 | 1 |
Final Example

X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
What Happened?

Why didn't this work?
What Happened?

Why didn't this work?
Is this relationship "harder" in some sense?
What Happened?

Why didn't this work?
Is this relationship "harder" in some sense?
Let's plot it and see.
MULTILAYER PERCEPTRONS

Perceptron -> MLP

With a small change, we can turn our perceptron model into a multilayer perceptron
- Instead of just one linear combination, we are going to take several, each with a different set of weights (called a hidden unit)
- Each linear combination will be followed by a nonlinear activation
- Each of these nonlinear features will be fed into the logistic regression classifier
- All of the weights are learned end-to-end via SGD
MLPs learn a set of nonlinear features directly from data
"Feature learning" is the hallmark of deep learning approachs
MULTILAYER PERCEPTRONS (MLPs)

Let's set up the following MLP with 1 hidden layer that has 3 hidden units:
Each neuron in the hidden layer is going to do exactly the same thing as before.
MULTILAYER PERCEPTRONS (MLPs)

Computations are:
MULTILAYER PERCEPTRONS (MLPs)

Computations are:
Output layer weight derivatives
MULTILAYER PERCEPTRONS (MLPs)

Computations are:
Output layer weight derivatives
MULTILAYER PERCEPTRONS (MLPs)

Computations are:
Hidden layer weight derivatives
Output layer weight derivatives
MULTILAYER PERCEPTRONS (MLPs)

Computations are:
Hidden layer weight derivatives
Output layer weight derivatives
(if we use a sigmoid activation function)
MLP Terminology


Forward pass = computing probability from input
MLP Terminology

Forward pass = computing probability from input
MLP Terminology
Backward pass = computing derivatives from output

Forward pass = computing probability from input
MLP Terminology
Backward pass = computing derivatives from output
Hidden layers are often called "dense" layers
MULTILAYER PERCEPTRONS (MLPs)

We can increase the flexibility by adding more layers

MULTILAYER PERCEPTRONS (MLPs)

We can increase the flexibility by adding more layers

but we run the risk of overfitting...
REGULARIZATION

REGULARIZATION

One of the biggest problems with neural networks is overfitting.
Regularization schemes combat overfitting in a variety of different ways
REGULARIZATION

A perceptron represents the following optimization problem:
where
REGULARIZATION

One way to regularize is introduce penalties and change
REGULARIZATION

to
One way to regularize is introduce penalties and change
REGULARIZATION

A familiar why to regularize is introduce penalties and change
to
where R(W) is often the L1 or L2 norm of W. These are the well known ridge and LASSO penalties, referred to as weight decay by neural net community
L2 REGULARIZATION

We can limit the size of the L2 norm of the weight vector:
where
L1/L2 REGULARIZATION

We can limit the size of the L2 norm of the weight vector:
where
We can do the same for the L1 norm. What do these penalties do?
SHRINKAGE

L1 and L2 penalties shrink the weights towards 0


L2 Penalty
L1 Penalty
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. New York: Springer series in statistics, 2001.
SHRINKAGE

L1 and L2 penalties shrink the weights towards 0

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. New York: Springer series in statistics, 2001.
SHRINKAGE

L1 and L2 penalties shrink the weights towards 0

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. New York: Springer series in statistics, 2001.
Why is this a "good" idea?
STOCHASTIC REGULARIZATION

Often, we will inject noise into the neural network during training. By far the most popular way to do this is dropout
STOCHASTIC REGULARIZATION

Often, we will inject noise into the neural network during training. By far the most popular way to do this is dropout
Given a hidden layer, we are going to set each element of the hidden layer to 0 with probability p each SGD update.

STOCHASTIC REGULARIZATION

One way to think of this is the network is trained by bagged versions of the network. Bagging reduces variance.
STOCHASTIC REGULARIZATION

One way to think of this is the network is trained by bagged versions of the network. Bagging reduces variance.
Others have argued this is an approximate Bayesian model

STOCHASTIC REGULARIZATION

Many have argued that SGD itself provides regularization

INITIALIZATION REGULARIZATION

The weights in a neural network are given random values initially
INITIALIZATION REGULARIZATION

The weights in a neural network are given random values initially
There is an entire literature on the best way to do this initialization
INITIALIZATION REGULARIZATION

The weights in a neural network are given random values initially
There is an entire literature on the best way to do this initialization
- Normal
- Truncated Normal
- Uniform
- Orthogonal
- Scaled by number of connections
- etc
INITIALIZATION REGULARIZATION

Try to "bias" the model into initial configurations that are easier to train
INITIALIZATION REGULARIZATION

Try to "bias" the model into initial configurations that are easier to train
Very popular way is to do transfer learning
INITIALIZATION REGULARIZATION

Try to "bias" the model into initial configurations that are easier to train
Very popular way is to do transfer learning
Train model on auxiliary task where lots of data is available

INITIALIZATION REGULARIZATION

Try to "bias" the model into initial configurations that are easier to train
Very popular way is to do transfer learning
Train model on auxiliary task where lots of data is available

Use final weight values from previous task as initial values and "fine tune" on primary task

STRUCTURAL REGULARIZATION

However, the key advantage of neural nets is the ability to easily include properties of the data directly into the model through the network's structure
Convolutional neural networks (CNNs) are a prime example of this (Kun will discuss CNNs)
Conclusions

Backprop, perceptrons, and MLPS are the "building" blocks of neural nets
You'll get a chance to demonstrate your mastery in HW 1A and 1B.
We will reuse these concepts for the rest of the semester.
Conclusions


BMI 707 - Lecture 2: Backprop, Perceptrons, and MLPs
By beamandrew
BMI 707 - Lecture 2: Backprop, Perceptrons, and MLPs
- 1,827