layer
linear combo
activations
Recap: A (fully-connected, feed-forward) neural network
layer
input
neuron
learnable weights
hidden
output
\(\dots\)
Forward pass: evaluate, givenĀ the current parameters
linear combination
loss function
(nonlinear) activation
\(\dots\)
\(\nabla_{W^2} \mathcal{L(g,y)}\)
Backward pass: run SGD to learnĀ all parameters
e.g. to update \(W^2\)
\(\dots\)
?
suppose we sampled a particular \((x,y),\) how to findĀ
\(\dots\)
back propagation: reuse of computation
shared
(The demo won't embed in PDF. But the direct link below works.)
[video edited fromĀ 3b1b]
Why do we need a specialized network (hypothesis class)?
For higher-resolution images, or more complex tasks, or larger networks, the number of parameters can grow very fast.
426-by-426 grayscale image
Use the same 2 hidden-layer network to predict what top-10 engineering school seal this image is, need to learn ~3M parameters.
Underfitting
Appropriate
Overfitting
Recall, models with needless parameters tend to overfit
If we know the data is generated by the green curve, it's easy to choose the appropriate quadratic hypothesis class.
so... do we know anything about vision problems?
Why do we humans think
is a 9?
Why do we think any of
is a 9?
[video edited fromĀ 3b1b]
Layered structure are well-suited to model this hierarchical processing.
CNN exploits
to handle images efficiently and effectively.
via
CNN
the same feedforward net as before
typical CNN architecture for image classification
Ā
Ā
Ā
CNN
Ā
Ā
Ā
typical CNN structure for image classification
Convolutional layer might sound foreign, but it's very similar to a fully-connected layer
Convolution result:
filter weights
convolution, activation
dot-product, activation
Forward pass, do
Backward pass, learn
neuron weights
Design choices
neuron count, etc.
Layer | |||
---|---|---|---|
fully-connected | |||
convolutional |
conv specs, etc.
0
1
0
1
1
-1
1
input
filter
convolved output
1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
-1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
example: 1-dimensional convolution
-1
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
-1
0
example: 1-dimensional convolution
0
1
-1
1
1
-1
1
input
filter
convolved output
template matching
1
-2
2
0
convolution interpretation 1:
-1
Ā 1
-1
Ā 1
-1
Ā 1
-1
Ā 1
convolution interpretation 2:
0
1
-1
1
1
-1
1
input
filter
convolved output
1
-2
2
0
"look" locallyĀ through the filter (sparse-connected layer)
0
1
-1
1
1
1
-2
2
0
input
output
0
1
-1
1
1
-1
1
convolve with
=
dot product with
1
-2
2
0
1
0
0
0
0
-1
0
0
0
0
0
1
0
0
0
1
-1
0
1
-1
-1
convolution interpretation 3:
sparse-connected layer with parameter sharingĀ
0
1
0
1
1
convolve with
dot product with
0
1
0
1
1
1
0 | 1 |
convolution interpretation 4:
input
filter
convolved output
0 | 1 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
translational equivarianceĀ
input
filter
[image edited from vdumoulin]
convolved output
example: 2-dimensional convolution
[image edited from vdumoulin]
[image edited from vdumoulin]
stride of 2
input
filter
output
[image edited from vdumoulin]
stride of 2
[image edited from vdumoulin]
stride of 2, with padding of size 1
input
filter
output
[image edited from vdumoulin]
quick summary: hyperparameters for 1d convolution
(e.g. stride of 2)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
these weights are what CNN learn eventually
-1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
quick summary: hyperparameters for 2d convolution
[video credit Lena Voita]
quick summary: convolution interpretation
filter 1
filter 2
input
filters
conv'd output
A tender intro to tensor:
[image credit: tensorflowā]
red
green
blue
color images and channels
each channel encodes a holistic but independentĀ perspective of the same image, similar to:
so channels are often referred to as feature maps
image channels
image width
image
height
3d tensors from color channels
filter 1
filter 2
3d tensors from multiple filters
filter 1
filter 2
channels
3d tensors from multiple filters
2. using multiple filters
channels
width
height
channels
1. color input
Why 3d tensors:
2d convolution
3d convolution
Why we don'tĀ typically do 3d convolution
width
height
channels
... | ||||
... | ||||
... | ||||
... |
output
We don'tĀ typically do 3-dimensional convolution. Instead:
... | ||
... | ||
... | ... | ... |
... | ||
... | ||
... | ... | ... |
... | ||
... | ||
... | ... | ... |
input tensor
multiple filters
multiple output matrices
... |
... |
input tensor
\(k\) filters
output tensor
... | ||
... | ||
... | ... | ... |
2. the use of multiple filtersĀ
1. color input
Every convolutional layer works with 3d tensors:
in doing 2d convolution
ā | ||||
---|---|---|---|---|
ā | ||||
ā | ||||
ā | ||||
ā | ||||
---|---|---|---|---|
cat moves, detection moves
convolution helps detect pattern, but ...
convolution
max pooling
slide w. stride
slide w. stride
no learnable parameter
learnable filter weights
Ā
ReLU
summarizes strongest response
detects pattern
1d max pooling
2d max pooling
[image credit Philip Isola]
large response regardless of exact position of edge
Pooling across spatialĀ locations achieves invariance w.r.t. small translations:
Pooling across spatialĀ locations achieves invariance w.r.t. small translations:
channels
channels
height
width
width
height
so the channelĀ dimension remains unchanged
pooling
applied independently across all channels
[image credit Philip Isola]
CNN renaissance
filter weights
fully-connected neuron weights
label
image
[all max pooling are via 3-by-3 filter, stride of 2]
[image credit Philip Isola]
AlexNet '12
VGG '14
āVery Deep Convolutional Networks for Large-Scale Image Recognitionā, Simonyan & Zisserman. ICLR 2015
[image credit Philip Isola and Kaiming He]
VGG '14
Main developments:
Ā
Ā
Ā
Ā
Ā
Ā
Ā
VGG '14
[He et al: Deep Residual Learning for Image Recognition, CVPR 2016]
[image credit Philip Isola and Kaiming He]
ResNet '16
Ā
Main developments:
We'd love to hear your thoughts.
[video edited fromĀ 3b1b]
[video edited fromĀ 3b1b]
[video edited fromĀ 3b1b]
[video edited fromĀ 3b1b]
\(\dots\)
Now
how to find
?
\(\dots\)
how to find
?
Previously, we found
\(\dots\)
?
suppose we sampled a particular \((x,y),\) how to findĀ
2-dimensional max pooling (example)
... | ||||
... | ||||
... | ||||
input tensor
one filter
2d output