Shen Shen
March 31, 2025
2:30pm, Room 32-144
Shen Shen
Wed 4-5pm & Fri 2-3pm
24-328
Amit Schechter
Thursdays 1-3pm
45-324
Derek Lim
Tuesdays 3-5pm,
45-324
Amit Schechter
Thursdays 1-3pm
45-324
Derek Lim
Tuesdays 3-5pm,
45-324
Shen Shen
Wed 4-5pm & Fri 2-3pm
24-328
Semester at a glance
Generative AI
Reinforcement Learning
Integration
layer
linear combo
activations
Recap:
layer
input
neuron
learnable weights
hidden
output
\(-3(\sigma_1 +\sigma_2)\)
recall this example
\(f =\sigma(\cdot)\)
\(f(\cdot) \) identity function
\(-3(\sigma_1 +\sigma_2)\)
compositions of ReLU(s) can be quite expressive
in fact, asymptotically, can approximate any function!
(image credit: Phillip Isola)
\(\dots\)
Forward pass: evaluate, given the current parameters
linear combination
loss function
(nonlinear) activation
\(\dots\)
back propagation: reuse of computation
\(\dots\)
back propagation: reuse of computation
Two different ways to visualize a function
Two different ways to visualize a function
Representation transformations for a variety of neural net operations
and stack of neural net operations
wiring graph
equation
mapping 1D
mapping 2D
Training data
maps from complex data space to simple embedding space
Neural networks are representation learners
Deep nets transform datapoints, layer by layer
Each layer gives a different representation (aka embedding) of the data
\(f: X \rightarrow Y\)
"Good"
Representation
🧠
humans also learn representations
"I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have "327"? No. I have sky, house, and trees.”
— Max Wertheimer, 1923
Good representations are:
[See “Representation Learning”, Bengio 2013, for more commentary]
[Bartlett, 1932]
[Intraub & Richardson, 1989]
[https://www.behance.net/gallery/35437979/Velocipedia]
compact representation/embedding
Auto-encoder
Auto-encoder
"What I cannot create, I do not understand." Feynman
Auto-encoder
encoder
decoder
bottleneck
Auto-encoder
Training Data
loss/objective
hypothesis class
A model
\(f\)
\(m<d\)
[See “Representation Learning”, Bengio 2013, for more commentary]
Auto-encoders try to achieve these
these may just emerge as well
Good representations are:
https://www.tensorflow.org/text/tutorials/word2vec
Word2Vec
verb tense
gender
X = Vector(“Paris”) – vector(“France”) + vector(“Italy”) \(\approx\) vector("Rome")
“Meaning is use” — Wittgenstein
[Mikolov et al., 2013]
Can help downstream tasks:
Word2Vec
[video edited from 3b1b]
embedding
a
robot
must
obey
distribution over the vocabulary
Transformer
"A robot must obey the orders given it by human beings ..."
push for Prob("robot") to be high
push for Prob("must") to be high
push for Prob("obey") to be high
push for Prob("the") to be high
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
a
robot
must
obey
input embedding
output embedding
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
transformer block
transformer block
transformer block
\(L\) blocks
\(\dots\)
a
robot
must
obey
input embedding
output embedding
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
transformer block
transformer block
transformer block
A sequence of \(n\) tokens, each token in \(\mathbb{R}^{d}\)
a
robot
must
obey
input embedding
\(\dots\)
transformer block
transformer block
transformer block
output embedding
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
a
robot
must
obey
input embedding
output embedding
transformer block
\(\dots\)
\(\dots\)
\(\dots\)
attention layer
fully-connected network
\(\dots\)
[video edited from 3b1b]
[video edited from 3b1b]
a
robot
must
obey
input embedding
output embedding
transformer block
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
attention layer
fully-connected network
the usual weights
attention mechanism
[image edited from 3b1b]
\(n\)
\(d\)
input embedding (e.g. via a fixed encoder)
[video edited from 3b1b]
[video edited from 3b1b]
[image edited from 3b1b]
Cross-entropy loss encourages the internal weights update so as to make this probability higher
Foundation Models
Often, what we will be “tested” on is not what we were trained on.
Final-layer adaptation: freeze \(f\), train a new final layer to new target data
Finetuning: initialize \(f’\) as \(f\), then continue training for \(f'\) as well, on new target data
[Zeiler et al. 2013]
E.g., features from a model pre-trained on image net can be reused for medical images
Label prediction (supervised learning)
Features
Label
Feature reconstruction (unsupervised learning)
Features
Reconstructed Features
Partial
features
Other partial
features
Feature reconstruction (self-supervised learning)
Self-supervised learning
Common trick:
Masked Auto-encoder
[He, Chen, Xie, et al. 2021]
Masked Auto-encoder
[Devlin, Chang, Lee, et al. 2019]
[Zhang, Isola, Efros, ECCV 2016]
predict color from gray-scale
[Zhang, Isola, Efros, ECCV 2016]
The allegory of the cave
Contrastive learning
Contrastive learning
[Chen, Kornblith, Norouzi, Hinton, ICML 2020]
[Slide credit: Andrew Owens]
[Owens et al, Ambient Sound Provides Supervision for Visual Learning, ECCV 2016]
[Slide credit: Andrew Owens]
[Owens et al, Ambient Sound Provides Supervision for Visual Learning, ECCV 2016]
What did the model learn?
[Slide credit: Andrew Owens]
[Owens et al, Ambient Sound Provides Supervision for Visual Learning, ECCV 2016]
[https://arxiv.org/pdf/2204.06125.pdf]
DallE
A few other examples
[Slide Credit: Yann LeCun]
We'd love to hear your thoughts.
(auto encoder slides adapted from Phillip Isola)
GPT 4o: multi-modal image generation
prompt: "make an image of Mona Lisa holding this cat"
[this cat is button]