Andrew Beam, PhD
Department of Biomedical Informatics
February 23rd, 2018
twitter: @AndrewLBeam
Deep learning is a specific kind of machine learning
- Machine learning automatically learns relationships using data
- Deep learning refers to large neural networks
- These neural networks have millions of parameters and hundreds of layers (e.g. they are structurally deep)
- Most important: Deep learning is not magic!
Say we want to build a model to predict the likelihood of a have a heart attack (MI) based on blood pressure (BP) and BMI
A neural net is a modular way to build a classifier
Inputs
Output
Probability of MI
The neuron is the basic functional unit a neural network
Inputs
Output
Probability of MI
The neuron is the basic functional unit a neural network
A neuron does two things, and only two things
The neuron is the basic functional unit a neural network
Weight for
A neuron does two things, and only two things
Weight for
1) Weighted sum of inputs
The neuron is the basic functional unit a neural network
Weight for
A neuron does two things, and only two things
Weight for
1) Weighted sum of inputs
2) Nonlinear transformation
is known as the activation function
Sigmoid
Hyperbolic Tangent
Summary: A neuron produces a single number that is a nonlinear transformation of its input connections
A neuron does two things, and only two things
= a number
Summary: A neuron produces a single number that is a nonlinear transformation of its input connections
A neuron does two things, and only two things
= a number
This simple formula allows for an amazing amount of expressiveness
Let's say we'd like to have a single neural learn a simple function
y
X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
Observations
How do we make a prediction for each observations?
y
X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | 0 |
Observations
For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | 0 |
For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | 0 |
First compute the weighted sum:
For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
First compute the weighted sum:
Transform to probability:
For the first observation:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
First compute the weighted sum:
Transform to probability:
Round to get prediction:
Putting it all together:
Assume we have the following values
w1 | w2 | b |
---|---|---|
1 | -1 | -0.5 |
X1 | X2 | y | h | p | |
---|---|---|---|---|---|
0 | 0 | 0 | -0.5 | 0.38 | 0 |
0 | 1 | 1 | |||
1 | 0 | 1 | |||
1 | 1 | 1 |
Fill out this table
Our neural net isn't so great... how do we make it better?
What do I even mean by better?
Let's define how we want to measure the network's performance.
There are many ways, but let's use squared-error:
Let's define how we want to measure the network's performance.
There are many ways, but let's use squared-error:
Now we need to find values for that make this error as small as possible
Our task is learning values for such the the difference between the predicted and actual values is as small as possible.
So, how we find the "best" values for
So, how we find the "best" values for
hint: calculus
Recall (without PTSD) that the derivative of a function tells you how it is changing at any given location.
If the derivative is positive, it means it's going up.
If the derivative is negative, it means it's going down.
Simple strategy:
- Start with initial values for
- Take partial derivatives of loss function
with respect to
- Subtract the derivative (also called the gradient) from each
Gradient for
Gradient for
Gradient for
Update for
Update for
Update for
Gradient for
Gradient for
Gradient for
Update for
Update for
Update for
Fill in new table!
train <- function(X,y,w,b,iter=10,lr=1) {
w_new <- w
b_new <- b
for(i in 1:iter) {
preds <- plogis(X %*% w_new + b_new)
grad_w1 <- (preds - y)*preds*(1 - preds)*X[,1]
grad_w2 <- (preds - y)*preds*(1 - preds)*X[,2]
grad_b <- (preds - y)*(preds*(1 - preds))
w_new[1] <- w_new[1] - lr*sum(grad_w1)
w_new[2] <- w_new[2] - lr*sum(grad_w2)
b_new <- b_new - lr*sum(grad_b)
error <- (y - preds)^2
print(paste0("Error at iteration ",i,": ",mean(error)))
}
return(list(w=w_new,b=b_new))
}
X <- rbind(c(0,0),c(0,1),c(1,0),c(1,1))
y <- c(0,1,1,1)
w <- as.vector(c(1,-1))
b <- -0.5
train(X,y,w,b)
X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 0 |
1 | 0 | 0 |
1 | 1 | 1 |
X1 | X2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
Why didn't this work?
Why didn't this work?
Is this relationship "harder" in some sense?
Why didn't this work?
Is this relationship "harder" in some sense?
Let's plot it and see.
Inputs
Output
Neural nets are organized into layers
Probability of MI
Inputs
Output
Neural nets are organized into layers
Probability of MI
Inputs
Output
Input Layer
Neural nets are organized into layers
Probability of MI
Inputs
Output
Neural nets are organized into layers
1st Hidden Layer
Input Layer
Probability of MI
Inputs
Output
Neural nets are organized into layers
A single hidden unit
1st Hidden Layer
Input Layer
Probability of MI
Inputs
Output
Input Layer
Neural nets are organized into layers
1st Hidden Layer
A single hidden unit
2nd Hidden Layer
Probability of MI
Inputs
Output
Input Layer
Neural nets are organized into layers
1st Hidden Layer
A single hidden unit
2nd Hidden Layer
Output Layer
Probability of MI
Neural networks are one of the oldest ideas in machine learning and AI
- Date back to 1940s
- Long history of "hype" cycles - boom and bust
- Were *not* state of the are machine learning technique for most of their existence
- Why are they popular now?
Several key advancements have enabled the modern deep learning revolution
GPU enable training of huge neural nets on extremely large datasets
Several key advancements have enabled the modern deep learning revolution
Transfer Learning
Train big model
on large dataset
Refine model
on smaller dataset
Several key advancements have enabled the modern deep learning revolution
Methodological advancements have made deeper networks easier to train
Architecture
Optimizers
Activation Functions
Several key advancements have enabled the modern deep learning revolution
Easy to use frameworks dramatically lower the barrier to entry
Automatic differentiation allows easy prototyping
+
Dates back to the late 1980s
Imagenet Database
Large Scale Visual
Recognition Challenge (ILSVRC)
Pivotal event occurred in 2012 which laid the blueprint for successful deep learning model
In 2011, a misclassification rate of 25% was near state of the art on ILSVRC
In 2012, Geoff Hinton and two graduate students, Alex Krizhevsky and Ilya Sutskever, entered ILSVRC with one of the first deep neural networks trained on GPUs, now known as "Alexnet"
Result: An error rate of 16%, nearly half what the second place entry was able to achieve.
The computer vision world immediately took notice
Alexnet paper has ~ 16,000 citations since being published in 2012!
Most algorithms expect "tabular" data
y | X1 | X2 | X3 | X4 |
---|---|---|---|---|
0 | 7 | 52 | 17 | 654 |
0 | 23 | 2752 | 4 | 1 |
1 | 786 | 27 | 0 | 5 |
0 | 354 | 7527 | 89 | 68 |
The problem with tabular data
What is this a picture of?
What is this a picture of?
The problem with tabular data
What is this a picture of?
Tabular data throws away too much information!
The problem with tabular data
Images are just 2D arrays of numbers
Goal is to build f(image) = 1
CNNs look at small connected groups of pixels using "filters"
Image credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Images have a local correlation structure
Near by pixels are likely to be more similar than pixels that are far away
CNNs exploit this through convolutions of small image patches
CNNs exploit strong prior information about images
In 2016 Google built a deep learning model to automatically diagnose patients with diabetic retinopathy
Why did this work so well?
- Huge dataset of over 100,000 images
- High quality annotations - each image was rated by 3-7 opthamologists
- Transfer learning - neural network was originally trained on Imagenet!
- For the cost of a GPU (~$1,000) it's possible to read 240 million images/day at accuracy on par with best ophthalmologists!
Implications
- Many subsequent studies have followed this formula
- Ingredients: Deep learning + high quality database of ~100,000 medical images + transfer learning
- Many medical imaging tasks in radiology, pathology, dermatology, and opthamology can be fully automated in a similar manner
- Similar results emerging from non-image data
How will this technology change medical practice, reimbursement, and other policies?
https://arxiv.org/pdf/1711.05225.pdf
http://beamandrew.github.io