Andrew Beam, PhD
Instructor
Department of Biomedical Informatics
Harvard School of Public Health
October 25th, 2017
twitter: @AndrewLBeam
We are offering BMI 707 in the spring second session. In this class you will:
If you like today's deep learning tapas, sign up for our course and have the full meal
http://beamandrew.github.io
Source: https://deepmind.com/blog/alphago-zero-learning-scratch/
Human data no longer needed
Medical imaging already is being changed
Medical imaging already is being changed
Inferring drug-resistance status in tuberculosis from sequence data using deep learning
Joint work with: Michael Chen, Maha Farat
What does this patient have?
A six-year old boy has a high fever that has lasted for three days. He has extremely red eyes and a rash on the main part of his body in addition to a swollen and red strawberry tongue. Remaining symptoms include swollen lymph nodes in the neck and Irritability
What does this patient have?
A six-year old boy has a high fever that has lasted for three days. He has extremely red eyes and a rash on the main part of his body in addition to a swollen and red strawberry tongue. Remaining symptoms include swollen lymph nodes in the neck and Irritability
Image credit: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
What does this patient have?
A six-year old boy has a high fever that has lasted for three days. He has extremely red eyes and a rash on the main part of his body in addition to a swollen and red strawberry tongue. Remaining symptoms include swollen lymph nodes in the neck and irritability
A neural net is made up of 3 things
A neural net is made up of 3 things
The network structure
A neural net is made up of 3 things
The network structure
The loss function
A neural net is made up of 3 things
The network structure
The optimizer
The loss function
A neural net is a modular way to build a classifier
Inputs
Output
The neuron is the basic functional unit a neural network
Inputs
Output
The neuron is the basic functional unit a neural network
Inputs
Output
The neuron is the basic functional unit a neural network
A neuron does two things, and only two things
The neuron is the basic functional unit a neural network
Weight for
A neuron does two things, and only two things
Weight for
1) Weighted sum of inputs
The neuron is the basic functional unit a neural network
Weight for
A neuron does two things, and only two things
Weight for
1) Weighted sum of inputs
2) Nonlinear transformation
is known as the activation function, and there are many choices
Sigmoid
Hyperbolic Tangent
Summary: A neuron produces a single number that is a nonlinear transformation of its input connections
A neuron does two things, and only two things
= a number
Inputs
Output
Neural nets are organized into layers
Inputs
Output
Input Layer
Neural nets are organized into layers
Inputs
Output
Neural nets are organized into layers
1st Hidden Layer
Input Layer
Inputs
Output
Neural nets are organized into layers
A single hidden unit
1st Hidden Layer
Input Layer
Inputs
Output
Input Layer
Neural nets are organized into layers
1st Hidden Layer
A single hidden unit
2nd Hidden Layer
Inputs
Output
Input Layer
Neural nets are organized into layers
1st Hidden Layer
A single hidden unit
2nd Hidden Layer
Output Layer
Output
Output Layer
We need a way to measure how well the network is performing, e.g. is it making good predictions?
Output
Output Layer
We need a way to measure how well the network is performing, e.g. is it making good predictions?
Loss function: A function that returns a single number which indicates how closely a prediction matches the ground truth label
Output
Output Layer
We need a way to measure how well the network is performing, e.g. is it making good predictions?
small loss = good
big loss = bad
Loss function: A function that returns a single number which indicates how closely a prediction matches the ground true label
A classic loss function for binary classification is binary cross-entropy
A classic loss function for binary classification is binary cross-entropy
y | p | Loss |
---|---|---|
0 | 0.1 | 0.1 |
0 | 0.9 | 2.3 |
1 | 0.1 | 2.3 |
1 | 0.9 | 0.1 |
Output Layer
The output layer needs to "match" the loss function
- Correct shape
- Correct scale
Output Layer
The output layer needs to "match" the loss function
For binary cross-entropy, network needs to produce a single probability
Output Layer
The output layer needs to "match" the loss function
One unit in output layer to represent this probability
For binary cross-entropy, network needs to produce a single probability
Output Layer
The output layer needs to "match" the loss function
One unit in output layer to represent this probability
For binary cross-entropy, network needs to produce a single probability
Activation function must "squash" output to be between 0 and 1
Output Layer
The output layer needs to "match" the loss function
One unit in output layer to represent this probability
For binary cross-entropy, network needs to produce a single probability
Activation function must "squash" output to be between 0 and 1
We can change the output layer & loss to model many different kinds of data
- Multiple classes
- Continuous response (i.e. regression)
- Survival data
- Combinations of the above
Question:
Now that we have specified:
- A network
- Loss function
How do we find the values for the weights that gives us the smallest possible value for the loss function?
How do we minimize the loss function?
Gradient Decscent
How do we minimize the loss function?
Many variations on basic idea of SGD are available
Several key advancements have enabled the modern deep learning revolution
Advent of massively parallel computing by GPUs. Enabled training of huge neural nets on extremely large datasets
Several key advancements have enabled the modern deep learning revolution
Advent of massively parallel computing by GPUs. Enabled training of huge neural nets on extremely large datasets
Several key advancements have enabled the modern deep learning revolution
Methodological advancements have made deeper networks easier to train
Architecture
Optimizers
Activation Functions
Several key advancements have enabled the modern deep learning revolution
Regularization is key!
Dropout
L1 & L2
Several key advancements have enabled the modern deep learning revolution
Transfer Learning
Train big model
on large dataset
Refine model
on smaller dataset
Several key advancements have enabled the modern deep learning revolution
Robust frameworks and abstractions make iteration faster and less error prone
Automatic differentiation allows easy prototyping
+
Exercise 1
https://github.com/beamandrew/HSPH_lecture
Dates back to the late 1980s
Imagenet Database
Large Scale Visual
Recognition Challenge (ILSVRC)
Pivotal event occurred in an image recognition contest which brought together 3 critical ingredients for the first time:
In 2011, a misclassification rate of 25% was near state of the art on ILSVRC
In 2012, Geoff Hinton and two graduate students, Alex Krizhevsky and Ilya Sutskever, entered ILSVRC with one of the first deep neural networks trained on GPUs, now known as "Alexnet"
Result: An error rate of 16%, nearly half what the second place entry was able to achieve.
The computer vision world immediately took notice
Alexnet paper has ~ 16,000 citations since being published in 2012!
Most algorithms expect "tabular" data
y | X1 | X2 | X3 | X4 |
---|---|---|---|---|
0 | 7 | 52 | 17 | 654 |
0 | 23 | 2752 | 4 | 1 |
1 | 786 | 27 | 0 | 5 |
0 | 354 | 7527 | 89 | 68 |
The problem with tabular data
What is this a picture of?
What is this a picture of?
The problem with tabular data
What is this a picture of?
Tabular data throws away too much information!
The problem with tabular data
Images are just 2D arrays of numbers
Goal is to build f(image) = 1
CNNs look at small connected groups of pixels using "filters"
Image credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Images have a local correlation structure
Near by pixels are likely to be more similar than pixels that are far away
CNNs exploit this through convolutions of small image patches
Example convolution
Pooling provides spatial invariance
Image credit: http://cs231n.github.io/convolutional-networks/
Convolution + pooling + activation = CNN
Image credit: http://cs231n.github.io/convolutional-networks/
CNN formula is relatively simple
Image credit: http://cs231n.github.io/convolutional-networks/
Data augmentation mimics the image generative process
Image credit: http://slideplayer.com/slide/8370683/
CNNs exploit strong prior information about images
https://simplystatistics.org/2017/05/31/deeplearning-vs-leekasso/
https://simplystatistics.org/2017/05/31/deeplearning-vs-leekasso/
http://beamandrew.github.io/deeplearning/2017/06/04/deep_learning_works.html
http://beamandrew.github.io/deeplearning/2017/06/04/deep_learning_works.html
Exercise 2
https://github.com/beamandrew/HSPH_lecture
Barrier to entry for deep learning is actually low
... but a few things might stand in your way:
SEE YOU IN THE SPRING!