Outline

Neural networks are representation learners
Auto-encoders
Unsupervised and self-supervised learning
Word embeddings
(Some recent representation learning ideas)

\dots

layer

linear combo

activations

Recap:

\dots

layer

\dots

x_1

x_2

x_d

input

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

neuron

learnable weights

hidden

output

compositions of ReLU(s) can be quite expressive

in fact, asymptotically, can approximate any function!

[image credit: Phillip Isola]

Two different ways to visualize a function

[images credit: visionbook.mit.edu]

Two different ways to visualize a function

[images credit: visionbook.mit.edu]

Representation transformations for a variety of neural net operations

[images credit: visionbook.mit.edu]

and stack of neural net operations

)

[images credit: visionbook.mit.edu]

wiring graph

equation

mapping 1D

mapping 2D

[images credit: visionbook.mit.edu]

Training data

x

z_1

a_1

z_2

g

{z}_1=\text { linear }(x)

{a}_1=\text { ReLU}(z_1)

g=\text {softmax}(z_2)

{z}_2=\text { linear }(a_1)

x\in \mathbb{R^2}

maps from complex data space to simple embedding space

[images credit: visionbook.mit.edu]

Neural networks are representation learners

Deep nets transform datapoints, layer by layer
Each layer gives a different representation (aka embedding) of the data

[images credit: visionbook.mit.edu]

Outline

Neural networks are representation learners
Auto-encoders
Unsupervised and self-supervised learning
Word embeddings
(Some recent representation learning ideas)

[Bartlett, 1932]
[Intraub & Richardson, 1989]

[https://www.behance.net/gallery/35437979/Velocipedia]

"I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have "327"? No. I have sky, house, and trees.”

— Max Wertheimer, 1923

🧠

humans also learn representations

[images credit: visionbook.mit.edu]

Good representations are:

Compact (minimal)
Explanatory (roughly sufficient)

[See “Representation Learning”, Bengio 2013, for more commentary]

Auto-encoder

"What I cannot create, I do not understand." Feynman

[images credit: visionbook.mit.edu]

compact representation/embedding

Auto-encoder

\underbrace{\hspace{1cm}}

encoder

decoder

bottleneck

Auto-encoder

x

\tilde{x}=\text{NN}(x;W)

\min_{W} ||x - \tilde{x}||^2

input $x \in \mathbb{R^d}$

output $\tilde{x} \in \mathbb{R^d}$

\dots

bottleneck

typically, has lower dimension than $d$

Auto-encoder

Training Data

\left\{{x}^{(i)}\right\}_{i=1}^n

loss/objective

\mathcal{L}(F(\mathbf{x}), \mathbf{x})=\|F(\mathbf{x})-\mathbf{x}\|^2

hypothesis class

A model

$f$

F= h \circ g: \mathbb{R}^d \rightarrow \mathbb{R}^m \rightarrow \mathbb{R}^d

h

g

$m<d$

[images credit: visionbook.mit.edu]

$$g$$

$$h$$

Data space

Encoder

Decoder

Represnetation space

[Often, encoders can be kept to get "good representations" whereas decoders can serve as "generative models"]

Outline

Neural networks are representation learners
Auto-encoders
Unsupervised and self-supervised learning
Word embeddings
(Some recent representation learning ideas)

Supervised Learning

Training data

$$\{x^{(1)}, y^{(1)}\}$$

$$\{x^{(2)}, y^{(2)}\}$$

$$\{x^{(3)}, y^{(3)}\}$$

$$\ldots$$

Learner

$$f(x) \to y$$

"Good"

Representation

Unsupervised Learning

Training Data

$$\{x^{(1)}\}$$

$$\{x^{(2)}\}$$

$$\{x^{(3)}\}$$

$$\ldots$$

Learner

Label prediction (supervised learning)

Features

Label

$$x$$

$$y$$

Feature reconstruction (unsupervised learning)

Features

Reconstructed Features

$$x$$

$$\hat{x}$$

Self-supervised learning

Partial features

Other partial features

Masked Auto-encoder

[He, Chen, Xie, et al. MAE, 2021]

todo: move that better MAE plot here

[Zhang, Isola, Efros, ECCV 2016]

predict color from gray-scale

[Zhang, Isola, Efros, ECCV 2016]

Masked Auto-encoder

[Devlin, Chang, Lee, et al. 2019]

Self-supervised learning

Common trick:

Convert “unsupervised” problem into “supervised” setup
Do so by cooking up “labels” (prediction targets) from the raw data itself — called pretext task

Outline

Neural networks are representation learners
Auto-encoders
Unsupervised and self-supervised learning
Word embeddings
(Some recent representation learning ideas)

Large Language Models (LLMs) are trained in a self-supervised way

Scrape the internet for unlabeled plain texts.
Cook up “labels” (prediction targets) from the unlabeled texts.
Convert “unsupervised” problem into “supervised” setup.

"To date, the cleverest thinker of all time was Issac. "

feature

label

To date, the

cleverest

\dots

To date, the cleverest

thinker

To date, the cleverest thinker

was

\dots

To date, the cleverest thinker of all time was

Issac

[video edited from 3b1b]

[image edited from 3b1b]

$n$

\underbrace{\hspace{5.98cm}}

\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \\ \\ \end{array} \right.

$d$

input embedding (e.g. via a fixed encoder)

[video edited from 3b1b]

a

robot

must

obey

\left\{ \begin{array}{l} \\ \\ \\ \end{array} \right.

distribution over the vocabulary

Transformer

"A robot must obey the orders given it by human beings ..."

push for Prob("robot") to be high

push for Prob("must") to be high

push for Prob("obey") to be high

push for Prob("the") to be high

$\dots$

a

robot

must

obey

input embedding

output embedding

$\dots$

transformer block

\left\{ \begin{array}{l} \\ \\ \\ \\ \end{array} \right.

$L$ blocks

$\dots$

[video edited from 3b1b]

Word embedding

dot-product similarity

[video edited from 3b1b]

dot-product similarity

For now, let's think about dictionary look-up:

apple

pomme

banane

citron

banana

lemon

Key

Value

$:$

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

good embedding representation => sensible dot-product similarity

=> enables effective attention in transformers next week.

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "lemon" 
output = dict_en2fr[query]

apple

pomme

banane

citron

banana

lemon

Key

Value

lemon

$:$

Query

Output

citron

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

What if we run

Python would complain. 🤯

orange

apple

pomme

banane

citron

banana

lemon

Key

Value

$:$

Query

Output

???

What if we run

But we can probably see the rationale behind this:

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

0.1

pomme

0.1

banane

0.8

citron

+

0.1

pomme

0.1

banane

0.8

citron

+

We implicitly assumed the (query, key, value) are represented in 'good' embeddings.

If we are to generalize this idea, we need to:

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

0.1

pomme

0.1

banane

0.8

citron

+

0.1

pomme

0.1

banane

0.8

citron

+

get this sort of percentage

Query

Key

Value

Output

orange

apple

$:$

pomme

0.1

pomme

0.1

banane

0.8

citron

banana

$:$

banane

lemon

$:$

citron

+

orange

0.1

pomme

0.1

banane

0.8

citron

+

apple

banana

lemon

orange

very roughly, the attention mechanism does exactly this kind of "soft" look-up:

apple

banana

lemon

orange

,

orange

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

orange

pomme

banane

citron

0.1

pomme

0.1

banane

0.8

citron

+

0.1

pomme

0.1

banane

0.8

citron

+

dot-product similarity

very roughly, the attention mechanism does exactly this kind of "soft" look-up:

apple

banana

lemon

orange

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

,

orange

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

orange

pomme

banane

citron

0.1

pomme

0.1

banane

0.8

citron

+

pomme

banane

citron

+

very roughly, the attention mechanism does exactly this kind of "soft" look-up:

0.1

0.8

=[\quad \quad \quad ]

apple

banana

lemon

orange

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

,

orange

Query

Key

Value

Output

orange

apple

$:$

pomme

0.1

pomme

0.1

banane

0.8

citron

banana

$:$

banane

lemon

$:$

citron

+

orange

0.8

pomme

0.1

banane

0.1

citron

+

=[\quad \quad \quad ]

0.1

0.8

pomme

banane

citron

+

and output

weighted average over values

very roughly, the attention mechanism does exactly this kind of "soft" look-up:

Outline

Neural networks are representation learners
Auto-encoders
Unsupervised and self-supervised learning
Word embeddings
(Some recent representation learning ideas)

Compact (minimal)
Explanatory (roughly sufficient)
Disentangled (independent factors)
Interpretable
Make subsequent problem solving easy

[See “Representation Learning”, Bengio 2013, for more commentary]

Auto-encoders try to achieve these

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

\left\{ \begin{array}{l} \\ \\ \\ \end{array} \right.

these may just emerge as well

Good representations are:

\left( \begin{array}{l} \\ \\ \\ \\ \\ \\ \end{array} \right.

pre-training
contrastive
multi-modality

Often, what we will be “tested” on is not what we were trained on.

[images credit: visionbook.mit.edu]

Final-layer adaptation: freeze $f$, train a new final layer to new target data

[images credit: visionbook.mit.edu]

Finetuning: initialize $f’$ as $f$, then continue training for $f'$ as well, on new target data

[images credit: visionbook.mit.edu]

The allegory of the cave

[images credit: visionbook.mit.edu]

Contrastive learning

[images credit: visionbook.mit.edu]

Contrastive learning

[images credit: visionbook.mit.edu]

[Chen, Kornblith, Norouzi, Hinton, ICML 2020]

SimCLR animation

Multi-modality

[images credit: visionbook.mit.edu]

[Slide credit: Andrew Owens]

[Owens et al, Ambient Sound Provides Supervision for Visual Learning, ECCV 2016]

Video - audio

[Slide credit: Andrew Owens]

[Owens et al, Ambient Sound Provides Supervision for Visual Learning, ECCV 2016]

What did the model learn?

[Slide credit: Andrew Owens]

[Owens et al, Ambient Sound Provides Supervision for Visual Learning, ECCV 2016]

Image classification (done in the contrastive way)

[https://arxiv.org/pdf/2204.06125.pdf]

Dall-E: text-image generation

Summary

We looked at the mechanics of neural net. Today we see deep nets learn representations, just like our brains do.
This is useful because representations transfer — they act as prior knowledge that enables quick learning on new tasks.
Representations can also be learned without labels, e.g. as we do in unsupervised, or self-supervised learning. This is great since labels are expensive and limiting.
Without labels there are many ways to learn representations. We saw today:
- representations as compressed codes, auto-encoder with bottleneck
- (representations that are predictive of their context)
- (representations that are shared across sensory modalities)

Thanks!

We'd love to hear your thoughts.

Lecture 8: Representation Learning

Intro to Machine Learning

Outline

Outline

Outline

Auto-encoder

Outline

Supervised Learning

Unsupervised Learning

Training Data

Outline

Outline

Summary

Thanks!

6.390 IntroML (Spring25) - Lecture 8 Representation Learning

6.390 IntroML (Spring25) - Lecture 8 Representation Learning

Shen Shen

Lecture 8: Representation Learning

Intro to Machine Learning

Outline

Outline

Outline

Auto-encoder

Outline

Supervised Learning

Unsupervised Learning

Training Data

Outline

Outline

Summary

Thanks!

6.390 IntroML (Spring25) - Lecture 8 Representation Learning

More from Shen Shen