Shen Shen
October 25, 2024
layer
linear combo
activations
Recap:
layer
input
neuron
learnable weights
hidden
output
[video edited from 3b1b]
[video edited from 3b1b]
[video edited from 3b1b]
Q: Why do we need a specialized network?
Use the same small 2 hidden-layer network, need to learn ~3M parameters
For higher-resolution images (e.g. 1024-1024 already leads to 1-million dimensional as input), or more complex tasks, the number of parameters can just grow very fast.
A: fully-connected nets don't scale well for vision tasks
426-by-426 grayscale image
Underfitting
Appropriate model
Overfitting
Recall, more powerful models also tend to overfitting
[video edited from 3b1b]
[video edited from 3b1b]
[video edited from 3b1b]
layering is compatible with hierarchical structure
CNN cleverly exploits
to handle images efficiently and sensibly.
via
typical CNN structure for image classification
Convolutional layer might sound foreign, but it's very similar to fully connected layer
convolution with filters do these things:
Layer | ||
---|---|---|
fully-connected | ||
convolutional |
filter (kernels) weights
convolution
dot-product
Forward pass, do
Backward pass, learn
neuron weights
0
1
0
1
1
-1
1
input
filter
convolved output
1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
-1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
example: 1-dimensional convolution
-1
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
-1
0
example: 1-dimensional convolution
0
1
-1
1
1
-1
1
input
filter
convolved output
template matching
1
-2
2
0
convolution interpretation:
-1
1
-1
1
-1
1
-1
1
convolution interpretation:
0
1
-1
1
1
-1
1
input
filter
convolved output
1
-2
2
0
"look" locally
0
1
-1
1
1
-1
1
convolve with
=
dot product with
1
-2
2
0
1
0
0
0
0
-1
0
0
0
0
0
1
0
0
0
1
-1
0
1
-1
-1
convolution interpretation:
parameter sharing
0
1
0
1
1
convolve with
dot product with
0
1
0
1
1
1
0 | 1 |
convolution interpretation:
input
filter
convolved output
0 | 1 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
translational equivariance
hyperparameters
(e.g. stride of 2)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
these weights are what CNN learn eventually
-1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
2-dimensional convolution
input
filter
[image edited from vdumoulin]
output
[image edited from vdumoulin]
[image edited from vdumoulin]
stride of 2
input
filter
output
[image edited from vdumoulin]
stride of 2
[image edited from vdumoulin]
stride of 2, with padding of size 1
input
filter
output
[image edited from vdumoulin]
[video credit Lena Voita]
convolution interpretation:
A tender intro to tensor:
[image credit: tensorflow]
red
green
blue
We'd encounter 3d tensor due to:
1. color input
image depth (channels)
image width
image
height
filter 1
filter 2
We'd encounter 3d tensor due to:
2. the use of multiple filters
filter 1
filter 2
image depth (channels)
We'd encounter 3d tensor due to:
2. the use of multiple filters
2. the use of multiple filters
image depth (channels)
image width
image
height
image depth (channels)
1. color input
We'd encounter 3d tensor due to
image width
image
height
image depth (channels)
... | ||||
... | ||||
... | ||||
... |
output
But, we don't typically do 3-dimensional convolution. Instead:
We don't typically do 3-dimensional convolution, because
... | ||||
... | ||||
... | ||||
input tensor
one filter
2d output
... | ||
... | ||
... | ... | ... |
... | ||
... | ||
... | ... | ... |
... | ||
... | ||
... | ... | ... |
input tensor
multiple filters
multiple output matrices
... |
... |
input tensor
\(k\) filters
output tensor
... | ||
... | ||
... | ... | ... |
2. the use of multiple filters
1. color input
We'd encounter 3d tensor due to:
-- in doing 2-dimensional convolution
✅ | ||||
---|---|---|---|---|
✅ | ||||
✅ | ||||
✅ | ||||
✅ | ||||
---|---|---|---|---|
convolution
max pooling
slide w. stride
slide w. stride
1-dimensional pooling
no learnable parameter
filter weights are the learnable parameter
ReLU
2-dimensional max pooling (example)
[image edited from vdumoulin]
[gif adapted from demo source]
2-dimensional max pooling (example)
Pooling across spatial locations achieves invariance w.r.t. small translations:
channel
channel
height
width
width
height
so the channel dimension remains unchanged after pooling.
pooling
Pooling across spatial locations achieves invariance w.r.t. small translations:
[image credit Philip Isola]
large response regardless of exact position of edge
filter weights
fully-connected neuron weights
prediction, loss \(\mathcal{L}\)
[image credit Philip Isola]
[image credit Philip Isola]
AlexNet '12
VGG '14
“Very Deep Convolutional Networks for Large-Scale Image Recognition”, Simonyan & Zisserman. ICLR 2015
[image credit Philip Isola and Kaiming He]
VGG '14
Main developments:
VGG '14
[He et al: Deep Residual Learning for Image Recognition, CVPR 2016]
[image credit Philip Isola and Kaiming He]
ResNet '16
Main developments:
We'd love to hear your thoughts.