
Lecture 7: Convolutional Neural Networks
Intro to Machine Learning

layer
linear combo
activations
Recap: A (fully-connected, feed-forward) neural network
layer
input
neuron
learnable weights
hidden
output
\(\dots\)
Forward pass: evaluate, givenĀ the current parameters
- the model outputs \(g^{(i)}\) = Ā
- the loss incurred on the current data \(\mathcal{L}(g^{(i)}, y^{(i)})\)
- the training error \(J = \frac{1}{n} \sum_{i=1}^{n}\mathcal{L}(g^{(i)}, y^{(i)})\)
linear combination
loss function
(nonlinear) activation
- Randomly pick a data point \((x, y)\)
- Evaluate the gradient \(\nabla_{W^2} \mathcal{L(g,y)}\)Ā
- Update the weights \(W^2 \leftarrow W^2 - \eta \nabla_{W^2} \mathcal{L(g,y})\)Ā
\(\dots\)
\(\nabla_{W^2} \mathcal{L(g,y)}\)
Backward pass: run SGD to learnĀ all parameters
e.g. to update \(W^2\)
\(\dots\)
?
suppose we sampled a particular \((x,y),\) how to findĀ
\(\dots\)
back propagation: reuse of computation
shared
(The demo won't embed in PDF. But the direct link below works.)
convolutional neural networks
-
Why do we need a special network for images?
-
Why is CNN (the) special network for images?
9

Outline
- Vision problem structure
- Convolution
- 1-dimensional and 2-dimensional convolution
- 3-dimensional tensors
- Max pooling
- (Case studies)
[video edited fromĀ 3b1b]
Why do we need a specialized network (hypothesis class)?

For higher-resolution images, or more complex tasks, or larger networks, the number of parameters can grow very fast.
- Partly, fully-connected nets don't scale well for vision tasks
- More importantly, a carefully chosen hypothesis class helps fight overfitting
426-by-426 grayscale image
Use the same 2 hidden-layer network to predict what top-10 engineering school seal this image is, need to learn ~3M parameters.
Underfitting
Appropriate
Overfitting

Recall, models with needless parameters tend to overfit
If we know the data is generated by the green curve, it's easy to choose the appropriate quadratic hypothesis class.
so... do we know anything about vision problems?

Why do we humans think
is a 9?

Why do we think any of
is a 9?
[video edited fromĀ 3b1b]


- Visual hierarchy
Layered structure are well-suited to model this hierarchical processing.




- Visual hierarchy
- Spatial locality
- Translational invariance



CNN exploits
to handle images efficiently and effectively.
via
- layered structure
- convolution
- pooling
- Visual hierarchy
- Spatial locality
- Translational invariance
CNN
the same feedforward net as before
typical CNN architecture for image classification

Ā
Ā
Ā
CNN
Ā
Ā
Ā
typical CNN structure for image classification

Outline
- Vision problem structure
-
Convolution
- 1-dimensional and 2-dimensional convolution
- 3-dimensional tensors
- Max pooling
- (Case studies)
Convolutional layer might sound foreign, but it's very similar to a fully-connected layer

Convolution result:
filter weights
convolution, activation
dot-product, activation
Forward pass, do
Backward pass, learn
neuron weights
Design choices
neuron count, etc.
Layer | |||
---|---|---|---|
fully-connected | |||
convolutional |
conv specs, etc.
0
1
0
1
1
-1
1
input
filter
convolved output
1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
-1
example: 1-dimensional convolution
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
example: 1-dimensional convolution
-1
0
1
0
1
1
-1
1
input
filter
convolved output
1
1
-1
0
example: 1-dimensional convolution
0
1
-1
1
1
-1
1
input
filter
convolved output
template matching
1
-2
2
0
convolution interpretation 1:




-1
Ā 1
-1
Ā 1
-1
Ā 1
-1
Ā 1
convolution interpretation 2:
0
1
-1
1
1
-1
1
input
filter
convolved output
1
-2
2
0
"look" locallyĀ through the filter (sparse-connected layer)
0
1
-1
1
1
1
-2
2
0
input
output
0
1
-1
1
1
-1
1
convolve with
=
dot product with
1
-2
2
0
1
0
0
0
0
-1
0
0
0
0
0
1
0
0
0
1
-1
0
1
-1
-1
convolution interpretation 3:
sparse-connected layer with parameter sharingĀ

0
1
0
1
1
convolve with
dot product with
0
1
0
1
1
1
0 | 1 |
convolution interpretation 4:
input
filter
convolved output
0 | 1 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
translational equivarianceĀ
input

filter
[image edited from vdumoulin]

convolved output

example: 2-dimensional convolution
[image edited from vdumoulin]


















[image edited from vdumoulin]
stride of 2
input

filter

output


[image edited from vdumoulin]
stride of 2











[image edited from vdumoulin]
stride of 2, with padding of size 1
input

filter

output


[image edited from vdumoulin]

















quick summary: hyperparameters for 1d convolution
- Zero-padding
- Filter size (e.g. we saw these two in 1-d)
- Stride
(e.g. stride of 2)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
these weights are what CNN learn eventually
-1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
quick summary: hyperparameters for 2d convolution
[video credit Lena Voita]
- Look locally (sparse connections)
- Parameter sharing
- Template matching
- Translational equivarianceĀ
quick summary: convolution interpretation







filter 1
filter 2

input
filters
conv'd output


Outline
- Vision problem structure
- Convolution
- 1-dimensional and 2-dimensional convolution
- 3-dimensional tensors
- Max pooling
- (Case studies)

A tender intro to tensor:



[image credit: tensorflowā]





red
green

blue

color images and channels
each channel encodes a holistic but independentĀ perspective of the same image, similar to:


so channels are often referred to as feature maps




image channels
image width
image
height
3d tensors from color channels







filter 1
filter 2
3d tensors from multiple filters





filter 1
filter 2
channels
3d tensors from multiple filters


2. using multiple filters



channels
width
height
channels
1. color input
Why 3d tensors:




2d convolution
3d convolution
Why we don'tĀ typically do 3d convolution
width
height
- 2d convolution, 2d output
channels
... | ||||
... | ||||
... | ||||
... |
- 3d tensor input, channel \(d\)
- 3d tensor filter, channel \(d\)
output
We don'tĀ typically do 3-dimensional convolution. Instead:
... | ||
... | ||
... | ... | ... |
... | ||
... | ||
... | ... | ... |
... | ||
... | ||
... | ... | ... |
input tensor
multiple filters
multiple output matrices
... |
... |
input tensor
\(k\) filters
output tensor
... | ||
... | ||
... | ... | ... |
2. the use of multiple filtersĀ
1. color input
Every convolutional layer works with 3d tensors:



in doing 2d convolution

Outline
- Vision problem structure
- Convolution
- 1-dimensional and 2-dimensional convolution
- 3-dimensional tensors
- Max pooling
- (Case studies)


ā | ||||
---|---|---|---|---|
ā | ||||
ā | ||||
ā | ||||
ā | ||||
---|---|---|---|---|
cat moves, detection moves
convolution helps detect pattern, but ...

convolution
max pooling
slide w. stride
slide w. stride
no learnable parameter
learnable filter weights
Ā
ReLU
summarizes strongest response
detects pattern
1d max pooling
2d max pooling








[image credit Philip Isola]


large response regardless of exact position of edge
Pooling across spatialĀ locations achieves invariance w.r.t. small translations:
Pooling across spatialĀ locations achieves invariance w.r.t. small translations:

channels
channels
height
width
width
height
so the channelĀ dimension remains unchanged
pooling
applied independently across all channels
Outline
- Vision problem structure
- Convolution
- 1-dimensional and 2-dimensional convolution
- 3-dimensional tensors
- Max pooling
- (Case studies)
[image credit Philip Isola]

CNN renaissance

filter weights
fully-connected neuron weights
label
image
[all max pooling are via 3-by-3 filter, stride of 2]

[image credit Philip Isola]


AlexNet '12
VGG '14
āVery Deep Convolutional Networks for Large-Scale Image Recognitionā, Simonyan & Zisserman. ICLR 2015
[image credit Philip Isola and Kaiming He]

VGG '14
Main developments:
- small convolutional filters: only 3x3
Ā
Ā
Ā
Ā
Ā
Ā
Ā
- increased depth: about 16 or 19 layers
- stack the same modules
VGG '14
[He et al: Deep Residual Learning for Image Recognition, CVPR 2016]
[image credit Philip Isola and Kaiming He]



ResNet '16

Ā
Main developments:
- Residual block -- gradients can propagate faster (via the identity mapping)
- increased depth: > 100 layers
Summary
- Though NN are technically āuniversal approximatorsā, designing the NN structure so that it matches what we know about the underlying structure of the problem can substantially improve generalization ability and computational efficiency.
- Images are a very important input type and they have important properties that we can take advantage of: visual hierarchy, translation invariance, spatial locality.
- Convolution is an important image-processing technique that builds on these ideas. It can be interpreted as locally connected network, with weight-sharing.
- Pooling layer helps aggregate local info effectively, achieving bigger receptive field.
- We can train the parameters in a convolutional filtering function using backprop and combine convolutional filtering operations with other neural-network layers.
Thanks!
We'd love to hear your thoughts.
[video edited fromĀ 3b1b]
[video edited fromĀ 3b1b]
[video edited fromĀ 3b1b]
[video edited fromĀ 3b1b]
\(\dots\)
Now
how to find
?
\(\dots\)
how to find
?
Previously, we found
\(\dots\)
?
suppose we sampled a particular \((x,y),\) how to findĀ
- can choose filter size
- typically choose to have no padding
- typically a stride >1
- reduces spatial dimension
2-dimensional max pooling (example)
... | ||||
... | ||||
... | ||||
input tensor
one filter
2d output
- 3d tensor input, channel \(d\)
- 3d tensor filter, channel \(d\)
- 2d tensor (matrix) output
IntroML (Fall25) - Lecture 7 Convolutional Neural Networks
By Shen Shen
IntroML (Fall25) - Lecture 7 Convolutional Neural Networks
- 6