layer
linear combo
activations
layer
input
neuron
learnable weights
hidden
output
a (fully-connected, feed-forward) neural network
Recap
\(\dots\)
Forward pass: evaluate, given the current parameters
linear combination
loss function
(nonlinear) activation
Recap
\(\dots\)
\(\nabla_{W^2} \mathcal{L}(g^{(i)},y^{(i)})\)
Backward pass: run SGD to update all parameters
e.g. to update \(W^2\)
Recap
\(\dots\)
backpropagation: reuse of computation
Recap
[video edited from 3b1b]
Why do we need a specialized network (hypothesis class)?
For higher-resolution images, or more complex tasks, or larger networks, the number of parameters can grow very fast.
426-by-426 grayscale image
Use the same 2 hidden-layer network to predict what top-10 engineering school seal this image is, need to learn ~3M parameters.
Underfitting
Appropriate
Overfitting
Recall, models with needless parameters tend to overfit
If we know the data is generated by the green curve, it's easy to choose the appropriate quadratic hypothesis class.
so... do we know anything about vision problems?
Why do we humans think
is a 9?
Why do we think any of
is a 9?
[video edited from 3b1b]
Layered structure are well-suited to model this hierarchical processing.
CNN exploits
to handle images efficiently and effectively.
via
CNN
the same feedforward net as before
typical CNN architecture for image classification
CNN
typical CNN structure for image classification
Convolutional layer might sound foreign, but it's very similar to a fully-connected layer
Convolution result:
filter weights
convolution, activation
dot-product, activation
Forward pass, do
Backward pass, learn
neuron weights
Design choices
neuron count, etc.
| Layer | |||
|---|---|---|---|
| fully-connected | |||
| convolutional |
conv specs, etc.
0
1
0
1
1
-1
1
input
filter
convolved output
1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
-1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
example: 1-dimensional convolution
-1
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
-1
0
example: 1-dimensional convolution
0
1
-1
1
1
-1
1
input
filter
convolved output
template matching
1
-2
2
0
convolution interpretation 1:
-1
1
-1
1
-1
1
-1
1
convolution interpretation 2:
0
1
-1
1
1
-1
1
input
filter
convolved output
1
-2
2
0
"look" locally through the filter
this local region = receptive field
0
1
-1
1
1
1
-2
2
0
input
output
0
1
-1
1
1
-1
1
convolve with
=
dot product with
1
-2
2
0
1
0
0
0
0
-1
0
0
0
0
0
1
0
0
0
1
-1
0
1
-1
-1
convolution interpretation 3:
sparse-connected layer with parameter sharing
0
1
0
1
1
convolve with
dot product with
0
1
0
1
1
1
| 0 | 1 |
convolution interpretation 4:
input
filter
convolved output
| 0 | 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 |
translational equivariance
input
filter
[image edited from vdumoulin]
convolved output
example: 2-dimensional convolution
[image edited from vdumoulin]
[image edited from vdumoulin]
stride of 2
input
filter
output
[image edited from vdumoulin]
[image edited from vdumoulin]
stride of 2, with padding of size 1
input
filter
output
[image edited from vdumoulin]
quick summary: hyperparameters for 1d convolution
(e.g. stride of 2)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
these weights are what CNN learn eventually
-1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
quick summary: hyperparameters for 2d convolution
[video credit Lena Voita]
quick summary: convolution interpretation
hand-designed filters (e.g. Sobel)
filter 1
filter 2
learned filters detect many patterns
input
filters
conv'd output
A tender intro to tensor:
[image credit: tensorflow]
red
green
blue
color images and channels
each channel is a complete but independent view of the same scene
like when we think of weather:
so channels are often referred to as feature maps
image channels
image width
image
height
3d tensors from color channels
filter 1
filter 2
3d tensors from multiple filters
filter 1
filter 2
channels
3d tensors from multiple filters
2. multiple filters → multiple channels
channels
width
height
channels
1. color input
where do channels come from?
Why we don't typically do 3d convolution
slide along →
slide along ↓
slide along ↗
Convolution shares weights across shifted positions
shifting makes sense spatially (a cat can be anywhere)
but not across channels (red ≠ shifted green)
3D conv is used when the third axis is spatial/temporal (MRI, video)
width
height
channels
| ... | ||||
| ... | ||||
| ... | ||||
| ... |
output
full-depth 2D convolution
| ... | ||
| ... | ||
| ... | ... | ... |
| ... | ||
| ... | ||
| ... | ... | ... |
| ... | ||
| ... | ||
| ... | ... | ... |
input tensor
multiple filters
multiple output matrices
| ... |
| ... |
input tensor
\(k\) filters
output tensor
| ... | ||
| ... | ||
| ... | ... | ... |
2. the use of multiple filters
1. color input
Every convolutional layer works with 3d tensors:
in doing 2d convolution
| ✅ | ||||
|---|---|---|---|---|
| ✅ | ||||
| ✅ | ||||
| ✅ | ||||
| ✅ | ||||
|---|---|---|---|---|
cat moves, detection moves
convolution helps detect pattern, but ...
convolution
max pooling
slide w. stride
slide w. stride
no learnable parameter
learnable filter weights
ReLU
summarizes strongest response
detects pattern
1d max pooling
2d max pooling
[image credit Philip Isola]
large response regardless of exact position of edge
Pooling across spatial locations achieves invariance w.r.t. small translations:
Pooling across spatial locations achieves invariance w.r.t. small translations:
channels
channels
height
width
width
height
so the channel dimension remains unchanged
pooling
applied independently across all channels
[image credit Philip Isola]
CNN renaissance
filter weights
fully-connected neuron weights
label
image
[all max pooling are 3×3 filter, stride 2; pooled outputs not explicitly shown on diagram: 27×27×96, 13×13×256, 6×6×256]
[image credit Philip Isola]
AlexNet '12
VGG '14
“Very Deep Convolutional Networks for Large-Scale Image Recognition”, Simonyan & Zisserman. ICLR 2015
[image credit Philip Isola and Kaiming He]
VGG '14
Main developments:
VGG '14
[He et al: Deep Residual Learning for Image Recognition, CVPR 2016]
[image credit Philip Isola and Kaiming He]
ResNet '16
Main developments:
Even though NNs are universal approximators, matching the architecture to problem structure — visual hierarchy, locality, translational invariance — improves generalization and efficiency.
Convolution slides a small learned filter across the input, detecting local patterns with shared weights — sparse and efficient.
Max pooling summarizes spatial information: "did a pattern occur?" rather than "where exactly?"
Filter weights are learned end-to-end; convolutional layers extract features, fully connected layers classify.
[video edited from 3b1b]
[video edited from 3b1b]
[video edited from 3b1b]
[video edited from 3b1b]
\(\dots\)
Now
how to find
?
\(\dots\)
how to find
?
Previously, we found
\(\dots\)
?
suppose we sampled a particular \((x,y),\) how to find
2-dimensional max pooling (example)
| ... | ||||
| ... | ||||
| ... | ||||
input tensor
one filter
2d output