Lecture 9: Transformers

Shen Shen

April 11, 2025

11am, Room 10-250

(interactive slides support animated walk-throughs of transformers and attention mechanisms.)

Intro to Machine Learning

Outline

Recap, embedding and representation
Transformers high-level intuition
Transformers architecture overview
(query, key, value) and self-attention
- matrix form
Multi-head attention
(Applications)

[video edited from 3b1b]

embedding

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

Good embeddings enable vector arithmetic.

apple

pomme

banane

citron

banana

lemon

Key

Value

$:$

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "lemon" 
output = dict_en2fr[query]

lemon

Query

Output

citron

A query comes:

apple

pomme

banane

citron

banana

lemon

Key

Value

$:$

Python would complain. 🤯

orange

Query

???

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

What if:

Output

apple

pomme

banane

citron

banana

lemon

Key

Value

$:$

But we may agree with this intuition:

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

0.1

pomme

0.1

banane

0.8

citron

0.1

pomme

0.1

banane

0.8

citron

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

What if:

Now, if we are to formalize this idea, we need:

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

0.1

pomme

0.1

banane

0.8

citron

0.1

pomme

0.1

banane

0.8

citron

2. calculate this sort of percentages

1. learn to get to these "good" (query, key, value) embeddings.

Query

Key

Value

Output

orange

apple

$:$

pomme

0.1

pomme

0.1

banane

0.8

citron

banana

$:$

banane

lemon

$:$

citron

orange

0.1

pomme

0.1

banane

0.8

citron

apple

banana

lemon

orange

very roughly, with good embeddings, getting the percentages can be easy:

apple

banana

lemon

orange

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

orange

pomme

banane

citron

0.1

pomme

0.1

banane

0.8

citron

0.1

pomme

0.1

banane

0.8

citron

query compared with keys → dot-product similarity

very roughly, with good embeddings, getting the percentages can be easy:

what about percentages?

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

Query

Key

Value

Output

orange

apple

$:$

pomme

banana

$:$

banane

lemon

$:$

citron

orange

pomme

banane

citron

0.1

pomme

0.1

banane

0.8

citron

pomme

banane

citron

0.1

0.8

=[\quad \quad \quad ]

query compared with keys → percentages

apple

banana

lemon

orange

Query

Key

Value

Output

orange

apple

$:$

pomme

0.1

pomme

0.1

banane

0.8

citron

banana

$:$

banane

lemon

$:$

citron

orange

0.8

pomme

0.1

banane

0.1

citron

=[\quad \quad \quad ]

0.1

0.8

pomme

banane

citron

(very roughly, the attention mechanism does just this "reasonable merging")

query compared with keys → percentages

combine values using these percentages as output

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

apple

banana

lemon

orange

Outline

Recap, embedding and representation
Transformers high-level intuition
Transformers architecture overview
(query, key, value) and self-attention
- matrix form
Multi-head Attention
(Applications)

Paperswithcode

Large Language Models (LLMs) are trained in a self-supervised way

Scrape the internet for unlabeled plain texts.
Cook up “labels” (prediction targets) from the unlabeled texts.
Convert “unsupervised” problem into “supervised” setup.

"To date, the cleverest thinker of all time was Issac. "

feature

label

To date, the

cleverest

\dots

To date, the cleverest

thinker

To date, the cleverest thinker

was

\dots

To date, the cleverest thinker of all time was

Issac

e.g., train to predict the next-word

Auto-regressive

How to train? The same recipe:

model has some learnable weights
multi-class classification

[video edited from 3b1b]

[image edited from 3b1b]

$n$

\underbrace{\hspace{5.98cm}}

\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \\ \\ \end{array} \right.

$d$

input embedding (e.g. via a fixed encoder)

[video edited from 3b1b]

[image edited from 3b1b]

Cross-entropy loss encourages the internal weights update so as to make this probability higher

image credit: Nicholas Pfaff

Generative Boba by Boyuan Chen in Bldg 45

😉

[video edited from 3b1b]

Outline

Recap, embedding and representation
Transformers high-level intuition
Transformers architecture overview
(query, key, value) and self-attention
- matrix form
Multi-head Attention
(Applications)

robot

must

obey

Transformer

"A robot must obey the orders given it by human beings ..."

push for Prob("robot") to be high

push for Prob("must") to be high

push for Prob("obey") to be high

push for Prob("the") to be high

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

distribution over the vocabulary

$\dots$

robot

must

obey

input embedding

output embedding

$\dots$

transformer block

\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right.

$L$ blocks

$\dots$

robot

must

obey

input embedding

output embedding

$\dots$

transformer block

x_1

x_2

x_3

x_4

A sequence of $n$ tokens, each token in $\mathbb{R}^{d}$

$\dots$

robot

must

obey

input embedding

$\dots$

transformer block

output embedding

x_1

x_2

x_3

x_4

$\dots$

transformer block

robot

must

obey

input embedding

output embedding

transformer block

self-attention layer

fully-connected network

x_1

x_2

x_3

x_4

$\dots$

W_k

W_v

W_q

learn

the usual weights

\nabla{\mathcal{L}}

W^o

x_1

x_2

x_3

x_4

robot

must

obey

v_1

k_1

q_1

W_k

W_v

W_q

W_k

W_v

W_q

v_2

k_2

q_2

W_k

W_v

W_q

W_k

W_v

W_q

v_3

k_3

q_3

v_4

q_4

k_4

v_4

attention layer

attention mechanism

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

z_1

z_2

z_3

z_4

$(q,k,v)$

embedding

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

attention mechanism

x_1

x_2

x_3

x_4

v_4

q_4

k_4

v_4

input

embedding

z_1

z_2

z_3

z_4

x_1

x_2

x_3

x_4

robot

must

obey

v_1

k_1

q_1

W_k

W_v

W_q

W_k

W_v

W_q

v_2

k_2

q_2

W_k

W_v

W_q

W_k

W_v

W_q

v_3

k_3

q_3

v_4

q_4

k_4

v_4

sequence of $d$-dimensional input tokens $x$
learnable weights, $W_q, W_v, W_k$, all in $\mathbb{R}^{d \times d_k}$
map the input sequence into $d_k$-dimensional ($qkv$) sequence, e.g., $q_1 = W_q^Tx_1$
the weights are shared, across the sequence of tokens -- parallel processing

Outline

Recap, embedding and representation
Transformers high-level intuition
Transformers architecture overview
(query, key, value) and self-attention
- matrix form
Multi-head Attention
(Applications)

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

q_1

k_1

k_2

k_3

k_4

attention mechanism

x_1

x_2

x_3

x_4

robot

must

obey

z_1

z_2

z_3

z_4

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

q_1

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

a_{11}

a_{14}

a_{12}

a_{13}

v_4

v_2

v_3

v_1

a_{11}

a_{14}

a_{12}

a_{13}

x_1

x_2

x_3

x_4

robot

must

obey

z_1

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

q_1

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_4

v_2

v_3

v_1

a_{11}

a_{14}

a_{12}

a_{13}

a_{11}

a_{14}

a_{12}

a_{13}

x_1

x_2

x_3

x_4

robot

must

obey

z_1

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

q_2

k_1

k_2

k_3

k_4

attention mechanism

x_1

x_2

x_3

x_4

z_1

z_2

z_3

z_4

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

a_{21}

a_{24}

a_{22}

a_{23}

q_2

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_4

v_2

v_3

v_1

a_{21}

a_{24}

a_{22}

a_{23}

x_1

x_2

x_3

x_4

z_2

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

a_{21}

a_{24}

a_{22}

a_{23}

q_2

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_4

v_2

v_3

v_1

a_{21}

a_{24}

a_{22}

a_{23}

x_1

x_2

x_3

x_4

z_2

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

q_4

k_1

k_2

k_3

k_4

attention mechanism

x_1

x_2

x_3

x_4

z_1

z_2

z_4

z_3

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

a_{41}

a_{44}

a_{42}

a_{43}

q_4

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_1

v_2

v_3

v_4

a_{41}

a_{42}

a_{43}

a_{44}

x_1

x_2

x_3

x_4

z_4

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

a_{41}

a_{44}

a_{42}

a_{43}

q_4

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_2

v_3

v_1

v_4

a_{41}

a_{44}

a_{42}

a_{43}

x_1

x_2

x_3

x_4

z_4

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

q_3

k_1

k_2

k_3

k_4

attention mechanism

x_1

x_2

x_3

x_4

z_1

z_2

z_3

z_4

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

a_{31}

a_{34}

a_{32}

a_{3 3}

q_3

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_4

v_2

v_3

v_1

a_{31}

a_{34}

a_{32}

a_{33}

x_1

x_2

x_3

x_4

z_3

v_1

k_1

q_1

v_2

k_2

q_2

v_3

k_3

q_3

v_4

q_4

k_4

v_4

q_1

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

d_k

a_{31}

a_{34}

a_{32}

a_{3 3}

q_3

k_1

k_2

k_3

k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.

\Bigg) \begin{array}{l} \end{array} \Bigg.

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

/\sqrt{d_k}

v_4

v_2

v_3

v_1

a_{31}

a_{34}

a_{32}

a_{33}

x_1

x_2

x_3

x_4

z_3

Outline

Recap, embedding and representation
Transformers high-level intuition
Transformers architecture overview
(query, key, value) and self-attention
- matrix form
Multi-head Attention
(Applications)

q_4

q_1

q_2

q_3

Q =

k_2

k_1

= K

k_3

k_4

\mathbb{R}^{n \times d_k}

q_4

q_1

q_2

q_3

Q =

k_2

k_1

= K

k_3

k_4

\mathbb{R}^{n \times d_k}

(q_1)^Tk_1

q_1

Q =

k_2

k_1

= K

k_3

k_4

\mathbb{R}^{n \times d_k}

(q_1)^Tk_3

q_4

q_2

q_3

q_1

q_2

Q =

k_2

k_1

= K

k_3

k_4

\mathbb{R}^{n \times d_k}

(q_2)^Tk_1

q_4

q_3

q_4

q_2

q_3

Q =

k_2

k_1

= K

k_3

k_4

\mathbb{R}^{n \times d_k}

(q_3)^Tk_4

q_4

q_2

q_1

q_4

q_1

Q =

k_2

k_1

= K

k_3

k_4

\mathbb{R}^{n \times d_k}

(q_4)^Tk_2

q_1

q_2

q_3

q_4

q_1

q_2

q_3

Q =

k_2

k_1

= K

A =

\Bigg[ \begin{array}{l} \end{array} \Bigg.

\Bigg] \begin{array}{l} \end{array} \Bigg.

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

k_3

k_4

\mathbb{R}^{n \times d_k}

\mathbb{R}^{n \times n}

each row sums up to 1

(

)

softmax

/\sqrt{d_k}

(

)

softmax

/\sqrt{d_k}

(

)

softmax

/\sqrt{d_k}

(

)

softmax

/\sqrt{d_k}

attention matrix

q_1

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

q_1

q_2

q_3

q_4

k_4

k_1

k_2

k_3

attention mechanism

x_1

x_2

x_3

x_4

v_4

v_1

v_2

v_3

attention mechanism

x_1

x_2

x_3

x_4

q_1

q_2

q_3

q_4

k_4

k_1

k_2

k_3

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

v_4

q_1

q_2

q_3

v_4

q_4

k_4

v_4

v_1

k_1

v_2

k_2

v_3

k_3

attention mechanism

x_1

x_2

x_3

x_4

a_{11}

a_{14}

a_{12}

a_{13}

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

v_4

v_2

v_3

v_1

v_4

q_1

q_2

q_3

v_4

q_4

k_4

v_4

v_1

k_1

v_2

k_2

v_3

k_3

attention mechanism

\in \mathbb{R}^{d_k}

x_1

x_2

x_3

x_4

z_1

v_4

q_1

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

a_{21}

a_{24}

a_{22}

a_{23}

v_1

q_1

k_1

v_2

q_2

k_2

v_3

q_3

k_3

v_4

q_4

k_4

v_4

v_2

v_3

v_1

attention mechanism

\in \mathbb{R}^{d_k}

x_1

x_2

x_3

x_4

z_2

v_4

q_1

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

a_{31}

a_{34}

a_{32}

a_{33}

v_1

q_1

k_1

v_2

q_2

k_2

v_3

q_3

k_3

v_4

q_4

k_4

v_4

v_2

v_3

v_1

attention mechanism

\in \mathbb{R}^{d_k}

x_1

x_2

x_3

x_4

z_3

v_4

q_1

a_{41}

a_{42}

a_{43}

a_{44}

a_{31}

a_{34}

a_{32}

a_{3 3}

a_{21}

a_{24}

a_{22}

a_{23}

a_{11}

a_{14}

a_{12}

a_{13}

a_{41}

a_{44}

a_{42}

a_{43}

v_1

q_1

k_1

v_2

q_2

k_2

v_3

q_3

k_3

v_4

q_4

k_4

v_4

v_2

v_3

v_1

attention mechanism

\in \mathbb{R}^{d_k}

x_1

x_2

x_3

x_4

z_4

Outline

Recap, embedding and representation
Transformers high-level intuition
Transformers architecture overview
(query, key, value) and self-attention
- matrix form
Multi-head Attention
(Applications)

robot

must

obey

v_1

k_1

q_1

W_k

W_v

W_q

W_k

W_v

W_q

v_2

k_2

q_2

W_k

W_v

W_q

W_k

W_v

W_q

v_3

k_3

q_3

v_4

q_4

k_4

v_4

one attention head

attention mechanism

x_1

x_2

x_3

x_4

z_1

z_2

z_3

z_4

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

z_1^3

z_2^3

z_3^3

z_4^3

W^3_k

W^3_v

W^3_q

v^3_1

k^3_1

q^3_1

W^3_k

W^3_v

W^3_q

v^3_2

k^3_2

q^3_2

W^3_k

W^3_v

W^3_q

W^3_k

W^3_v

W^3_q

v^3_3

k^3_3

q^3_3

v^3_4

q^3_4

k^3_4

v^3_4

attention mechanism

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

z_1^3

z_2^3

z_3^3

z_4^3

W^3_k

W^3_v

W^3_q

v^3_1

k^3_1

q^3_1

W^3_k

W^3_v

W^3_q

v^3_2

k^3_2

q^3_2

W^3_k

W^3_v

W^3_q

W^3_k

W^3_v

W^3_q

v^3_3

k^3_3

q^3_3

v^3_4

q^3_4

k^3_4

v^3_4

attention mechanism

\dots

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

z_1^3

z_2^3

z_3^3

z_4^3

W^3_k

W^3_v

W^3_q

v^3_1

k^3_1

q^3_1

W^3_k

W^3_v

W^3_q

v^3_2

k^3_2

q^3_2

W^3_k

W^3_v

W^3_q

W^3_k

W^3_v

W^3_q

v^3_3

k^3_3

q^3_3

v^3_4

q^3_4

k^3_4

v^3_4

attention mechanism

z_1^H

z_2^H

z_3^H

z_4^H

W^H_k

W^H_v

W^H_q

v^H_1

k^H_1

q^H_1

W^H_k

W^H_v

W^H_q

v^H_2

k^H_2

q^H_2

W^H_k

W^H_v

W^H_q

W^H_k

W^H_v

W^H_q

v^H_3

k^H_3

q^H_3

v^H_4

q^H_4

k^H_4

v^H_4

attention mechanism

\dots

Each attention head

can be processed independently and in parallel with all other heads,
learns its own set of $W_q, W_k, W_v$,
creates its own projected $(q,k,v)$ tokens,
computes its own attention outputs independently,
processes the sequence of $n$ tokens simultaneously and in parallel

independent, parallel, and structurally identical processing across all heads and tokens.

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

z_1^3

z_2^3

z_3^3

z_4^3

W^3_k

W^3_v

W^3_q

v^3_1

k^3_1

q^3_1

W^3_k

W^3_v

W^3_q

v^3_2

k^3_2

q^3_2

W^3_k

W^3_v

W^3_q

W^3_k

W^3_v

W^3_q

v^3_3

k^3_3

q^3_3

v^3_4

q^3_4

k^3_4

v^3_4

attention mechanism

z_1^H

z_2^H

z_3^H

z_4^H

W^H_k

W^H_v

W^H_q

v^H_1

k^H_1

q^H_1

W^H_k

W^H_v

W^H_q

v^H_2

k^H_2

q^H_2

W^H_k

W^H_v

W^H_q

W^H_k

W^H_v

W^H_q

v^H_3

k^H_3

q^H_3

v^H_4

q^H_4

k^H_4

v^H_4

attention mechanism

\dots

multi-head attention

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

z_1^3

z_2^3

z_3^3

z_4^3

W^3_k

W^3_v

W^3_q

v^3_1

k^3_1

q^3_1

W^3_k

W^3_v

W^3_q

v^3_2

k^3_2

q^3_2

W^3_k

W^3_v

W^3_q

W^3_k

W^3_v

W^3_q

v^3_3

k^3_3

q^3_3

v^3_4

q^3_4

k^3_4

v^3_4

attention mechanism

z_1^H

z_2^H

z_3^H

z_4^H

W^H_k

W^H_v

W^H_q

v^H_1

k^H_1

q^H_1

W^H_k

W^H_v

W^H_q

v^H_2

k^H_2

q^H_2

W^H_k

W^H_v

W^H_q

W^H_k

W^H_v

W^H_q

v^H_3

k^H_3

q^H_3

v^H_4

q^H_4

k^H_4

v^H_4

attention mechanism

\dots

multi-head attention

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

robot

must

obey

x_1

x_2

x_3

x_4

z_1^1

z_2^1

z_3^1

z_4^1

W^1_k

W^1_v

W^1_q

v^1_1

k^1_1

q^1_1

W^1_k

W^1_v

W^1_q

v^1_2

k^1_2

q^1_2

W^1_k

W^1_v

W^1_q

W^1_k

W^1_v

W^1_q

v^1_3

k^1_3

q^1_3

v^1_4

q^1_4

k^1_4

v^1_4

attention mechanism

z_1^2

z_2^2

z_3^2

z_4^2

W^2_k

W^2_v

W^2_q

v^2_1

k^2_1

q^2_1

W^2_k

W^2_v

W^2_q

v^2_2

k^2_2

q^2_2

W^2_k

W^2_v

W^2_q

W^2_k

W^2_v

W^2_q

v^2_3

k^2_3

q^2_3

v^2_4

q^2_4

k^2_4

v^2_4

attention mechanism

z_1^3

z_2^3

z_3^3

z_4^3

W^3_k

W^3_v

W^3_q

v^3_1

k^3_1

q^3_1

W^3_k

W^3_v

W^3_q

v^3_2

k^3_2

q^3_2

W^3_k

W^3_v

W^3_q

W^3_k

W^3_v

W^3_q

v^3_3

k^3_3

q^3_3

v^3_4

q^3_4

k^3_4

v^3_4

attention mechanism

z_1^H

z_2^H

z_3^H

z_4^H

W^H_k

W^H_v

W^H_q

v^H_1

k^H_1

q^H_1

W^H_k

W^H_v

W^H_q

v^H_2

k^H_2

q^H_2

W^H_k

W^H_v

W^H_q

W^H_k

W^H_v

W^H_q

v^H_3

k^H_3

q^H_3

v^H_4

q^H_4

k^H_4

v^H_4

attention mechanism

\dots

multi-head attention

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

W^o

all in $\mathbb{R}^{d}$

Shape Example

	num tokens	2
	token dim	4
	dim	3
	num heads	5

$$n$$

$$d$$

$$d_k$$

$$H$$

\left\{ \begin{array}{l} \\ \\ \\ \end{array} \right.

learned

	query proj
	key proj
	value proj
	output proj
input	-
	query
	key
	value
	attn matrix
	head out.
output	-

$$W_q^h$$

$$W_k^h$$

$$W_v^h$$

$$W^o$$

$$Q^h$$

$$K^h$$

$$V^h$$

$$A^h$$

$$Z^h$$

$$d \times d_k$$

$$d\times Hd_k$$

$$n \times d$$

$$n \times d_k$$

$$n \times n$$

$$n \times d_k$$

$$n \times d$$

$$4 \times 3$$

$$4 \times 15$$

$$2 \times 4$$

$$2 \times 3$$

$$2 \times 2$$

$$2 \times 3$$

$$2 \times 4$$

$$d \times d_k$$

$$4 \times 3$$

$$d \times d_k$$

$$4 \times 3$$

$$(qkv)$$

Some practical techniques commonly needed when training auto-regressive transformers:

masking

Layer normlization

Residual connection

Positional encoding

\left( \begin{array}{l} \\ \\ \\ \\ \\ \\ \end{array} \right.

applications/comments

We can tokenize anything.
General strategy: chop the input up into chunks, project each chunk to an embedding

this projection can be fixed from a pre-trained model, or trained jointly with downstream task

[images credit: visionbook.mit.edu]

\underbrace{\hspace{2.78cm}}

a sequence of $n$ tokens

a projection, e.g. via a fixed, or learned linear transformation

\left\{ \begin{array}{l} \\ \end{array} \right.

each token $\in \mathbb{R}^{d}$ embedding

[images credit: visionbook.mit.edu]

100-by-100

\underbrace{\hspace{2.6cm}}

each token $\in \mathbb{R}^{400}$

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

20-by-20

\left\{ \begin{array}{l} \\ \end{array} \right.

\underbrace{\hspace{2.78cm}}

a sequence of $n=25$ tokens

suppose just flatten

[images credit: visionbook.mit.edu]

Multi-modality (text + image)

(query, key, value) come from different input modality
cross-attention

[images credit: visionbook.mit.edu]

Image/video credit: RFDiffusion https://www.bakerlab.org

[“DINO”, Caron et all. 2021]

Success mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

\left) \begin{array}{l} \\ \\ \\ \\ \\ \\ \end{array} \right.

Summary

Transformers combine many of the best ideas from earlier architectures—convolutional patch-wise processing, relu nonlinearities, residual connections —with several new innovations, in particular, embedding and attention layers.
Transformers start with some generic hard-coded embeddings, and layer-by-layer, creates better and better embeddings.
Parallel processing everything in attention: each head is processed in parallel, and within each head, the $q,k,v$ token sequence is created in parallel, the attention scores is computed in parallel, and the attention output is computed in parallel.

Thanks!

for your attention!

We'd love to hear your thoughts.

6.390 IntroML (Spring25) - Lecture 9 Transformers

By Shen Shen

6.390 IntroML (Spring25) - Lecture 9 Transformers

Shen Shen

shenshen.mit.edu

Lecture 9: Transformers

Intro to Machine Learning

Outline

Outline

Outline

Outline

Outline

Outline

Summary

Thanks!

6.390 IntroML (Spring25) - Lecture 9 Transformers

More from Shen Shen