Lecture 9: Transformers

Shen Shen

April 11, 2025

11am, Room 10-250

(interactive slides support animated walk-throughs of transformers and attention mechanisms.)

Intro to Machine Learning

Outline

  • Recap, embedding and representation 
  • Transformers high-level intuition 
  • Transformers architecture overview
  • (query, key, value) and self-attention
    • matrix form
  • Multi-head attention
  • (Applications)

[video edited from 3b1b]

embedding

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

Good embeddings enable vector arithmetic.

apple

pomme

banane

citron

banana

lemon

Key

Value

\(:\)

\(:\)

\(:\)

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "lemon" 
output = dict_en2fr[query]

lemon

Query

Output

citron

A query comes:

apple

pomme

banane

citron

banana

lemon

Key

Value

\(:\)

\(:\)

\(:\)

Python would complain. 🤯

orange

Query

???

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

What if:

Output

apple

pomme

banane

citron

banana

lemon

Key

Value

\(:\)

\(:\)

\(:\)

But we may agree with this intuition:

Query

Key

Value

Output

orange

apple

\(:\)

pomme

banana

\(:\)

banane

lemon

\(:\)

citron

0.1

pomme

    0.1

banane

   0.8

citron

+

+

0.1

pomme

    0.1

banane

   0.8

citron

+

+

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

What if:

Now, if we are to formalize this idea, we need:

Query

Key

Value

Output

orange

apple

\(:\)

pomme

banana

\(:\)

banane

lemon

\(:\)

citron

0.1

pomme

    0.1

banane

   0.8

citron

+

+

0.1

pomme

    0.1

banane

   0.8

citron

+

+

2. calculate this sort of percentages

1. learn to get to these "good" (query, key, value)  embeddings.

Query

Key

Value

Output

orange

apple

\(:\)

pomme

0.1

pomme

    0.1

banane

   0.8

citron

banana

\(:\)

banane

lemon

\(:\)

citron

+

+

orange

orange

0.1

pomme

    0.1

banane

   0.8

citron

+

+

apple

banana

lemon

orange

very roughly, with good embeddings, getting the percentages can be easy:

apple

banana

lemon

orange

orange

orange

Query

Key

Value

Output

orange

apple

\(:\)

pomme

banana

\(:\)

banane

lemon

\(:\)

citron

orange

orange

pomme

banane

citron

0.1

pomme

    0.1

banane

   0.8

citron

+

+

0.1

pomme

    0.1

banane

   0.8

citron

+

+

query compared with keys → dot-product similarity

,
,

very roughly, with good embeddings, getting the percentages can be easy:

what about percentages?

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.

Query

Key

Value

Output

orange

apple

\(:\)

pomme

banana

\(:\)

banane

lemon

\(:\)

citron

orange

orange

pomme

banane

citron

0.1

pomme

    0.1

banane

   0.8

citron

+

+

pomme

banane

citron

+

+

0.1

    0.1

   0.8

=[\quad \quad \quad ]
  • query compared with keys → percentages

apple

banana

lemon

orange

orange

orange

,
,

Query

Key

Value

Output

orange

apple

\(:\)

pomme

0.1

pomme

    0.1

banane

   0.8

citron

banana

\(:\)

banane

lemon

\(:\)

citron

+

+

orange

orange

0.8

pomme

    0.1

banane

   0.1

citron

+

+

=[\quad \quad \quad ]

0.1

    0.1

   0.8

pomme

banane

citron

+

+

(very roughly, the attention mechanism does just this "reasonable merging")

  • query compared with keys → percentages
  • combine values using these percentages as output

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.

apple

banana

lemon

orange

orange

orange

,
,

Outline

  • Recap, embedding and representation
  • Transformers high-level intuition
  • Transformers architecture overview
  • (query, key, value) and self-attention
    • matrix form
  • Multi-head Attention
  • (Applications)

Large Language Models (LLMs) are trained in a self-supervised way

  • Scrape the internet for unlabeled plain texts.
  • Cook up “labels” (prediction targets) from the unlabeled texts.
  • Convert “unsupervised” problem into “supervised” setup.

"To date, the cleverest thinker of all time was Issac. "

feature

label

To date, the

cleverest

\dots

To date, the cleverest 

thinker

To date, the cleverest thinker

was

\dots
\dots
\dots

To date, the cleverest thinker of all time was 

Issac

e.g., train to predict the next-word

Auto-regressive

How to train? The same recipe:

  • model has some learnable weights
  • multi-class classification

[video edited from 3b1b]

[image edited from 3b1b]

\(n\)

\underbrace{\hspace{5.98cm}}
\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \\ \\ \end{array} \right.

\(d\)

input embedding (e.g. via a fixed encoder)

[video edited from 3b1b]

[video edited from 3b1b]

[image edited from 3b1b]

Cross-entropy loss encourages the internal weights update so as to make this probability higher

image credit: Nicholas Pfaff 

Generative Boba by Boyuan Chen in Bldg 45

😉

😉

[video edited from 3b1b]

[video edited from 3b1b]

Outline

  • Recap, embedding and representation
  • Transformers high-level intuition 
  • Transformers architecture overview
  • (query, key, value) and self-attention
    • matrix form
  • Multi-head Attention
  • (Applications)

a

robot

must

obey

Transformer

"A robot must obey the orders given it by human beings ..."

push for Prob("robot") to be high

push for Prob("must") to be high

push for Prob("obey") to be high

push for Prob("the") to be high

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

distribution over the  vocabulary

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

a

robot

must

obey

input embedding

output embedding

\(\dots\)

\(\dots\)

\(\dots\)

transformer block

transformer block

transformer block

\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right.

\(L\) blocks

\(\dots\)

\(\dots\)

a

robot

must

obey

input embedding

output embedding

\(\dots\)

transformer block

transformer block

transformer block

x_1
x_2
x_3
x_4

A sequence of \(n\) tokens, each token in \(\mathbb{R}^{d}\)

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

a

robot

must

obey

input embedding

\(\dots\)

transformer block

output embedding

x_1
x_2
x_3
x_4

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

transformer block

transformer block

a

robot

must

obey

input embedding

output embedding

transformer block

self-attention layer

fully-connected network

x_1
x_2
x_3
x_4

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

W_k
W_v
W_q

learn

the usual weights

\nabla{\mathcal{L}}
W^o
x_1
x_2
x_3
x_4

a

robot

must

obey

v_1
k_1
q_1
W_k
W_v
W_q
W_k
W_v
W_q
v_2
k_2
q_2
W_k
W_v
W_q
W_k
W_v
W_q
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4

attention layer

attention mechanism

\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
z_1
z_2
z_3
z_4

\((q,k,v)\)

embedding

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

attention mechanism

x_1
x_2
x_3
x_4
v_4
q_4
k_4
v_4
v_4
v_4

input

embedding 

z_1
z_2
z_3
z_4
x_1
x_2
x_3
x_4

a

robot

must

obey

v_1
k_1
q_1
W_k
W_v
W_q
W_k
W_v
W_q
v_2
k_2
q_2
W_k
W_v
W_q
W_k
W_v
W_q
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
  • sequence of \(d\)-dimensional input tokens \(x\)
  • learnable weights, \(W_q, W_v, W_k\), all in \(\mathbb{R}^{d \times d_k}\)
  • map the input sequence into \(d_k\)-dimensional (\(qkv\)) sequence, e.g., \(q_1 = W_q^Tx_1\)
  • the weights are shared, across the sequence of tokens -- parallel processing

Outline

  • Recap, embedding and representation
  • Transformers high-level intuition 
  • Transformers architecture overview
  • (query, key, value) and self-attention
    • matrix form
  • Multi-head Attention
  • (Applications)
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_1
q_1
q_1
q_1
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4

a

robot

must

obey

z_1
z_2
z_3
z_4
?
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_1
q_1
q_1
q_1
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
a_{11}
a_{14}
a_{12}
a_{13}
v_4
v_2
v_3
v_1
a_{11}
a_{14}
a_{12}
a_{13}
=
x_1
x_2
x_3
x_4

a

robot

must

obey

z_1
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_1
q_1
q_1
q_1
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
=
v_4
v_2
v_3
v_1
+
+
+
=
a_{11}
a_{14}
a_{12}
a_{13}
a_{11}
a_{14}
a_{12}
a_{13}
x_1
x_2
x_3
x_4

a

robot

must

obey

z_1
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_2
q_2
q_2
q_2
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
z_4
?
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
=
a_{21}
a_{24}
a_{22}
a_{23}
q_2
q_2
q_2
q_2
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
a_{21}
a_{24}
a_{22}
a_{23}
x_1
x_2
x_3
x_4
z_2
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
=
=
a_{21}
a_{24}
a_{22}
a_{23}
q_2
q_2
q_2
q_2
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
+
+
+
a_{21}
a_{24}
a_{22}
a_{23}
x_1
x_2
x_3
x_4
z_2
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_4
q_4
q_4
q_4
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_4
z_3
?
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
=
a_{41}
a_{44}
a_{42}
a_{43}
q_4
q_4
q_4
q_4
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_1
v_2
v_3
v_4
a_{41}
a_{42}
a_{43}
a_{44}
x_1
x_2
x_3
x_4
z_4
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
=
a_{41}
a_{44}
a_{42}
a_{43}
q_4
q_4
q_4
q_4
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
=
v_2
v_3
v_1
v_4
+
+
+
a_{41}
a_{44}
a_{42}
a_{43}
x_1
x_2
x_3
x_4
z_4
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_3
q_3
q_3
q_3
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
?
z_4
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
=
a_{31}
a_{34}
a_{32}
a_{3 3}
q_3
q_3
q_3
q_3
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
a_{31}
a_{34}
a_{32}
a_{33}
x_1
x_2
x_3
x_4
z_3
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
=
=
a_{31}
a_{34}
a_{32}
a_{3 3}
q_3
q_3
q_3
q_3
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
+
+
+
a_{31}
a_{34}
a_{32}
a_{33}
x_1
x_2
x_3
x_4
z_3

Outline

  • Recap, embedding and representation
  • Transformers high-level intuition 
  • Transformers architecture overview
  • (query, key, value) and self-attention
    • matrix form
  • Multi-head Attention
  • (Applications)
q_4
q_1
q_2
q_3
Q =
k_2
k_1
= K
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
q_4
q_1
q_2
q_3
Q =
k_2
k_1
= K
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
(q_1)^Tk_1
q_1
Q =
k_2
k_1
= K
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
(q_1)^Tk_3
q_4
q_2
q_3
q_1
q_2
Q =
k_2
k_1
= K
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
(q_2)^Tk_1
q_4
q_3
q_4
q_2
q_3
Q =
k_2
k_1
= K
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
(q_3)^Tk_4
q_4
q_2
q_1
q_4
q_1
Q =
k_2
k_1
= K
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
(q_4)^Tk_2
q_1
q_2
q_3
q_4
q_1
q_2
q_3
Q =
k_2
k_1
= K
A =
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
a_{41}
a_{42}
a_{43}
a_{44}
=
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
k_3
k_4
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times d_k}
\mathbb{R}^{n \times n}

each row sums up to 1

(
)

softmax

/\sqrt{d_k}
(
)

softmax

/\sqrt{d_k}
(
)

softmax

/\sqrt{d_k}
(
)

softmax

/\sqrt{d_k}

attention matrix

q_1
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
q_1
q_2
q_3
q_4
k_4
k_1
k_2
k_3

attention mechanism

x_1
x_2
x_3
x_4
v_4
v_4
v_4
v_4
v_1
v_2
v_3

attention mechanism

x_1
x_2
x_3
x_4
q_1
q_1
q_2
q_3
q_4
k_4
k_1
k_2
k_3
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
v_4
v_4
q_1
q_1
q_2
q_3
v_4
q_4
k_4
v_4
v_1
k_1
v_2
k_2
v_3
k_3

attention mechanism

x_1
x_2
x_3
x_4
+
+
+
a_{11}
a_{14}
a_{12}
a_{13}
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
=
v_4
v_2
v_3
v_1
v_4
v_4
q_1
q_1
q_2
q_3
v_4
q_4
k_4
v_4
v_1
k_1
v_2
k_2
v_3
k_3

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_1
v_4
v_4
q_1
=
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
+
+
+
a_{21}
a_{24}
a_{22}
a_{23}
v_1
q_1
k_1
v_2
q_2
k_2
v_3
q_3
k_3
v_4
q_4
k_4
v_4
v_4
v_2
v_3
v_1

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_2
v_4
v_4
q_1
=
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
+
+
+
a_{31}
a_{34}
a_{32}
a_{33}
v_1
q_1
k_1
v_2
q_2
k_2
v_3
q_3
k_3
v_4
q_4
k_4
v_4
v_4
v_2
v_3
v_1

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_3
v_4
v_4
q_1
=
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
+
+
+
a_{41}
a_{44}
a_{42}
a_{43}
v_1
q_1
k_1
v_2
q_2
k_2
v_3
q_3
k_3
v_4
q_4
k_4
v_4
v_4
v_2
v_3
v_1

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_4

Outline

  • Recap, embedding and representation
  • Transformers high-level intuition 
  • Transformers architecture overview
  • (query, key, value) and self-attention
    • matrix form
  • Multi-head Attention
  • (Applications)

a

robot

must

obey

v_1
k_1
q_1
W_k
W_v
W_q
W_k
W_v
W_q
v_2
k_2
q_2
W_k
W_v
W_q
W_k
W_v
W_q
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4

one attention head

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
z_4

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

\dots

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

Each attention head

  • can be processed independently and in parallel with all other heads,
  • learns its own set of \(W_q, W_k, W_v\),
  • creates its own projected \((q,k,v)\) tokens,
  • computes its own attention outputs independently,
  • processes the sequence of \(n\) tokens simultaneously and in parallel 

independent, parallel, and structurally identical processing across all heads and tokens.

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

multi-head attention

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

multi-head attention

\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o

a

robot

must

obey

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

multi-head attention

\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o

all in \(\mathbb{R}^{d}\)

Shape Example

num tokens 2
token dim 4
          dim 3
num heads 5

$$n$$

$$d$$

$$d_k$$

$$H$$

\left\{ \begin{array}{l} \\ \\ \\ \end{array} \right.

learned

query proj
key proj
value proj
output proj
input -
query
key
value
attn matrix
head out.
output -

$$W_q^h$$

$$W_k^h$$

$$W_v^h$$

$$W^o$$

$$Q^h$$

$$K^h$$

$$V^h$$

$$A^h$$

$$Z^h$$

$$d \times d_k$$

$$d\times Hd_k$$

$$n \times d$$

$$n \times d_k$$

$$n \times d_k$$

$$n \times d_k$$

$$n \times n$$

$$n \times d_k$$

$$n \times d$$

$$4 \times 3$$

$$4 \times 15$$

$$2 \times 4$$

$$2 \times 3$$

$$2 \times 3$$

$$2 \times 3$$

$$2 \times 2$$

$$2 \times 3$$

$$2 \times 4$$

$$d \times d_k$$

$$4 \times 3$$

$$d \times d_k$$

$$4 \times 3$$

$$(qkv)$$

Some practical techniques commonly needed when training auto-regressive transformers:

masking

Layer normlization

Residual connection

Positional encoding

\left( \begin{array}{l} \\ \\ \\ \\ \\ \\ \end{array} \right.

applications/comments

We can tokenize anything.
General strategy: chop the input up into chunks, project each chunk to an embedding

this projection can be fixed from a pre-trained model, or trained jointly with downstream task

[images credit: visionbook.mit.edu]

\underbrace{\hspace{2.78cm}}

a sequence of \(n\) tokens

a projection, e.g. via a fixed, or learned linear transformation 

\left\{ \begin{array}{l} \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \end{array} \right.

each token \(\in \mathbb{R}^{d}\) embedding

[images credit: visionbook.mit.edu]

100-by-100

\underbrace{\hspace{2.6cm}}

each token \(\in \mathbb{R}^{400}\)

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

20-by-20

\left\{ \begin{array}{l} \\ \end{array} \right.
\underbrace{\hspace{2.78cm}}

a sequence of \(n=25\) tokens

suppose just flatten

[images credit: visionbook.mit.edu]

Multi-modality (text + image)

  • (query, key, value) come from different input modality
  • cross-attention

[images credit: visionbook.mit.edu]

Image/video credit: RFDiffusion https://www.bakerlab.org

[“DINO”, Caron et all. 2021]

Success mode:

Success mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

\left) \begin{array}{l} \\ \\ \\ \\ \\ \\ \end{array} \right.

Summary

  • Transformers combine many of the best ideas from earlier architectures—convolutional patch-wise processing, relu nonlinearities, residual connections —with several new innovations, in particular, embedding and attention layers.
  • Transformers start with some generic hard-coded embeddings, and layer-by-layer, creates better and better embeddings.
  • Parallel processing everything in attention: each head is processed in parallel, and within each head, the \(q,k,v\) token sequence is created in parallel, the attention scores is computed in parallel, and the attention output is computed in parallel. 

Thanks!

for your attention!

We'd love to hear your thoughts.

6.390 IntroML (Spring25) - Lecture 9 Transformers

By Shen Shen

6.390 IntroML (Spring25) - Lecture 9 Transformers

  • 233