Lecture 9: Transformers

 

Shen Shen

Oct 30, 2025

11am, Room 10-250

Interactive Slides and Lecture Recording​

Intro to Machine Learning

Outline

  • Transformers high-level intuition and architecture 
  • Attention mechanism
  • Multi-head attention
  • (Applications)
  • Transformers high-level intuition and architecture 
  • Attention mechanism
  • Multi-head attention
  • (Applications)

[video edited from 3b1b]

Recap: Word embedding

this enables "soft" dictionary look-up:

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

Good word-embeddings space is equipped with semantically meaningful vector arithmetic

Key

Value

apple

pomme

\(:\)

banane

banana

\(:\)

citron

lemon

\(:\)

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

Python would complain. 🤯

orange

apple

pomme

banane

citron

banana

lemon

Key

Value

\(:\)

\(:\)

\(:\)

Query

Output

???

But we can probably see the rationale behind something like this:

Query

Key

Value

Output

orange

apple

\(:\)

pomme

banana

\(:\)

banane

lemon

\(:\)

citron

dict_en2fr = { 
  "apple" : "pomme",
  "banana": "banane", 
  "lemon" : "citron"}

query = "orange" 
output = dict_en2fr[query]

0.1

pomme

    0.1

banane

   0.8

citron

+

+

0.1

pomme

    0.1

banane

   0.8

citron

+

+

via these mixing percentages [0.1  0.1  0.8] made sense

We put (query, key, value) in "good" embeddings in our human brain

such that mixing the values

Query

Key

Value

Output

orange

apple

\(:\)

pomme

0.1

pomme

    0.1

banane

   0.8

citron

banana

\(:\)

banane

lemon

\(:\)

citron

+

+

orange

orange

0.1

pomme

    0.1

banane

   0.8

citron

+

+

apple

banana

lemon

orange

0.8

    0.1

   0.1

pomme

banane

citron

+

+

very roughly, the attention mechanism in transformers automates this process. 

apple

banana

lemon

orange

,
,

orange

orange

Query

Key

Value

Output

orange

apple

\(:\)

pomme

banana

\(:\)

banane

lemon

\(:\)

citron

orange

orange

pomme

banane

citron

0.1

pomme

    0.1

banane

   0.8

citron

+

+

0.1

pomme

    0.1

banane

   0.8

citron

+

+

dot-product similarity

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.

0.1

    0.1

   0.8

=[\quad \quad \quad ]

a. compare query and key for merging percentages: 

Query

Key

Value

Output

orange

apple

\(:\)

pomme

0.1

pomme

    0.1

banane

   0.8

citron

banana

\(:\)

banane

lemon

\(:\)

citron

+

+

orange

orange

pomme

banane

citron

+

+

0.8

    0.1

   0.1

pomme

banane

citron

+

+

b. then output mixed values

a. compare query and key for merging percentages: 

Let's see how this intuition becomes a trainable mechanism.

apple

banana

lemon

orange

,
,

orange

orange

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.

0.1

    0.1

   0.8

=[\quad \quad \quad ]

Outline

  • Transformers high-level intuition and architecture 
  • Attention mechanism
  • Multi-head attention
  • (Applications)
  • Transformers high-level intuition and architecture 
  • Attention mechanism
  • Multi-head attention
  • (Applications)

Large Language Models (LLMs) are trained in this self-supervised way

  • Scrape the internet for plain texts.
  • Cook up “labels” (prediction targets) from these texts.
  • Convert “unsupervised” problem into “supervised” setup.

"To date, the cleverest thinker of all time was Issac. "

feature

label

To date, the

cleverest

To date, the cleverest 

thinker

To date, the cleverest thinker

was

\dots
\dots

To date, the cleverest thinker of all time was 

Issac

auto-regressive prediction

[video edited from 3b1b]

\(n\)

\underbrace{\hspace{5.98cm}}
\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \\ \\ \end{array} \right.

\(d\)

input embedding

[video edited from 3b1b]

[video edited from 3b1b]

Cross-entropy loss encourages the internal weights update to push this probability higher

[video edited from 3b1b]

[video edited from 3b1b]

Transformer

"To date, the cleverest [thinker] of all time was Issac.

push for Prob("date") to be high

push for Prob("the") to be high

push for Prob("cleverest") to be high

push for Prob("thinker") to be high

\left\{ \begin{array}{l} \\ \\ \end{array} \right.

distribution over the  vocabulary

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

To

date

the

cleverest

input embedding

\(\dots\)

\(\dots\)

\(\dots\)

transformer block

transformer block

transformer block

\left\{ \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right.

\(L\) blocks

\(\dots\)

\(\dots\)

output embedding

To

date

the

cleverest

\(\dots\)

transformer block

transformer block

transformer block

x_1
x_2
x_3
x_4

A sequence of \(n\) tokens, each token in \(\mathbb{R}^{d}\)

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

output embedding

input embedding

input embedding

To

date

the

cleverest

input embedding

\(\dots\)

transformer block

output embedding

x_1
x_2
x_3
x_4

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

transformer block

transformer block

each of the \(n\) tokens transformed, block by block

within a shared \(d\)-dimensional word-embedding space.

To

date

the

cleverest

attention layer

MLP

x_1
x_2
x_3
x_4

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

neuron weights

\nabla{\mathcal{L}}
\dots
W_k
W_v
W_q
W^o

input embedding

transformer block

output embedding

To

date

the

cleverest

attention layer

x_1
x_2
x_3
x_4
W^o

output projection

W_k
W_v
W_q

\((qkv)\) projection

attention mechanism

Most important bits in an attention layer:

  1. (query, key, value) projection 
  2. attention mechanism

Why learning these projections:

  • \(W_q\) learns how to ask
  • \(W_k\) learns how to listen
  • \(W_v\) learns how to speak

With learned projections, we frame \(x\) into:

  • a query to be the questions
  • a key to be compared
  • a value to contribute
x
q
k
v
W_q
W_k
W_v

1. (query, key, value) projection

W_k
W_v
W_q
x_2
W_k
W_v
W_q
x_3
W_k
W_v
W_q
x_4
W_k
W_v
W_q
q_2
v_2
k_2
q_3
v_3
k_3
q_4
v_4
k_4
  • \(W_q, W_k, W_v\), all in \(\mathbb{R}^{d \times d_k}\)
  • project the \(d\)-dimensional word-embedding space to \(d_k\)-dimensional (\(qkv\)) space (typically \(d_k < d\))
  • the query \(q_i = W_q^Tx_i, \forall i;\) similar weight sharing for keys and values
  • parallel and structurally identical processing
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
x_1
q_1
v_1
k_1

1. (query, key, value) projection

To

date

the

cleverest

x_1
q_1
v_1
k_1
W_k
W_v
W_q
x_2
x_3
x_4
q_2
v_2
k_2
q_3
v_3
k_3
q_4
v_4
k_4
  • Attention mechanism turns the projected \((q,k,v)\) into \(z\)
  • Each \(z\) is context-aware: a mixture of everyone's values, weighted by relevance
W_k
W_v
W_q
W_k
W_v
W_q
W_k
W_v
W_q

2. Attention mechanism 

attention mechanism

z_1
z_2
z_3
z_4

To

date

the

cleverest

Outline

  • Transformers high-level intuition and architecture 
  • Attention mechanism
  • Multi-head attention
  • (Applications)
v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k
q_1
q_1
q_1
q_1
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
z_4
?

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
q_1
q_1
q_1
q_1
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
a_{11}
a_{14}
a_{12}
a_{13}
v_4
v_2
v_3
v_1
a_{11}
a_{14}
a_{12}
a_{13}
=
x_1
x_2
x_3
x_4
z_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
q_1
q_1
q_1
q_1
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
=
v_4
v_2
v_3
v_1
+
+
+
=
a_{11}
a_{14}
a_{12}
a_{13}
a_{11}
a_{14}
a_{12}
a_{13}
x_1
x_2
x_3
x_4
z_1
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
q_2
q_2
q_2
q_2
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
z_4
?
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
=
a_{21}
a_{24}
a_{22}
a_{23}
q_2
q_2
q_2
q_2
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
a_{21}
a_{24}
a_{22}
a_{23}
x_1
x_2
x_3
x_4
z_2
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
=
=
a_{21}
a_{24}
a_{22}
a_{23}
q_2
q_2
q_2
q_2
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
+
+
+
a_{21}
a_{24}
a_{22}
a_{23}
x_1
x_2
x_3
x_4
z_2
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
q_4
q_4
q_4
q_4
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_4
z_3
?
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
=
a_{41}
a_{44}
a_{42}
a_{43}
q_4
q_4
q_4
q_4
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_1
v_2
v_3
v_4
a_{41}
a_{42}
a_{43}
a_{44}
x_1
x_2
x_3
x_4
z_4

parallel and structurally identical processing

can calculate \(z_4\) without \(z_3\)

\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
=
a_{41}
a_{44}
a_{42}
a_{43}
q_4
q_4
q_4
q_4
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
=
v_2
v_3
v_1
v_4
+
+
+
a_{41}
a_{44}
a_{42}
a_{43}
x_1
x_2
x_3
x_4
z_4
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
q_3
q_3
q_3
q_3
k_1
k_2
k_3
k_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
?
z_4
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
=
a_{31}
a_{34}
a_{32}
a_{3 3}
q_3
q_3
q_3
q_3
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
a_{31}
a_{34}
a_{32}
a_{33}
x_1
x_2
x_3
x_4
z_3
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4
v_4
q_1
=
=
a_{31}
a_{34}
a_{32}
a_{3 3}
q_3
q_3
q_3
q_3
k_1
k_2
k_3
k_4

softmax

\Bigg( \begin{array}{l} \end{array} \Bigg.
\Bigg) \begin{array}{l} \end{array} \Bigg.
,
,
,
\Bigg[ \begin{array}{l} \end{array} \Bigg.
\Bigg] \begin{array}{l} \end{array} \Bigg.
/\sqrt{d_k}
v_4
v_2
v_3
v_1
+
+
+
a_{31}
a_{34}
a_{32}
a_{33}
x_1
x_2
x_3
x_4
z_3
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
d_k

To

date

the

cleverest

v_1
k_1
q_1
v_2
k_2
q_2
v_3
k_3
q_3
v_4
q_4
k_4
v_4
v_4

attention mechanism

x_1
x_2
x_3
x_4
z_1
z_2
z_3
z_4

Attention head

maps sequence of \(x\) to sequence of \(z\):

1. (query, key, value) projection

2. attention mechanism

parallel and structurally identical processing

W_k
W_v
W_q
W_k
W_v
W_q
W_k
W_v
W_q
W_k
W_v
W_q
X
Z

attention mechanism

\((qkv)\) proj.

W_k
W_v
W_q
Q
K
V
Q = XW_q
K = XW_k
V = XW_v
A = \text{softmax}_{\text{row}}(QK^T/\sqrt{d_k})
Z = AV
X

By stacking each individual vector in the sequence as a row

Attention head - compact matrix form

To

date

the

cleverest

input embedding

output embedding

\(\dots\)

transformer block

transformer block

transformer block

x_1
x_2
x_3
x_4

A sequence of \(n\) tokens, each token in \(\mathbb{R}^{d}\)

\(\dots\)

\(\dots\)

\(\dots\)

\(\dots\)

X \in \mathbb{R}^{n \times d}

Stack each token as a row in the input

v_1^T
v_2^T
v_3^T
v_4^T
\in \mathbb{R}^{d \times d_k}
W_q
W_k
W_v
X \in \mathbb{R}^{n \times d}
Q = XW_q \in \mathbb{R}^{n \times d_k}
K = XW_k \in \mathbb{R}^{n \times d_k}
V = XW_v \in \mathbb{R}^{n \times d_k}

1. (query, key, value) projection

q_4^T
q_1^T
q_2^T
q_3^T
k_2^T
k_1^T
k_3^T
k_4^T
\in \mathbb{R}^{d \times d_k}
\in \mathbb{R}^{d \times d_k}
v_1^T
v_2^T
v_3^T
v_4^T
W_q
W_k
W_v
X
Q
K
V
q_4^T
q_1^T
q_2^T
q_3^T
k_2^T
k_1^T
k_3^T
k_4^T

2a. dot-product similarity

compare \(q_i\) and \(k_j\)

assemble the \(n \times n\) similarities so rows correspond to query

q_4^T
q_1^T
q_2^T
q_3^T
Q
K
v_1^T
v_2^T
v_3^T
v_4^T
V
W_q
W_k
W_v
X \in \mathbb{R}^{n \times d}

2a. dot-product similarity

q_1^{T}k_3
k_2^T
k_4^T
q_4^T
q_2^T
q_3^T
k_4^T
k_2^T
k_1^T
k_1^T
k_3^T
q_4^T
q_1^T
q_2^T
q_3^T
v_1^T
v_2^T
v_3^T
v_4^T
W_q
W_k
W_v
X \in \mathbb{R}^{n \times d}

2a. dot-product similarity

q_3^{T}k_4
k_2^T
k_1^T
k_3^T
k_4^T
Q
K
V
W_q
W_k
W_v
X \in \mathbb{R}^{n \times d}
k_2^T
k_1^T
k_3^T
k_4^T
q_4^T
q_1^T
q_2^T
q_3^T
v_1^T
v_2^T
v_3^T
v_4^T
QK^T
\in \mathbb{R}^{n \times n}

2a. dot-product similarity

Q
K
V
A =
\left] \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right.
a_{41}
a_{42}
a_{43}
a_{44}
=
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
\in \mathbb{R}^{n \times n}

each row sums up to 1

(
)

softmax

/\sqrt{d_k}
(
)

softmax

/\sqrt{d_k}
(
)

softmax

/\sqrt{d_k}
(
)

softmax

/\sqrt{d_k}
W_q
W_k
W_v
X \in \mathbb{R}^{n \times d}
k_2^T
k_1^T
k_3^T
k_4^T
q_4^T
q_1^T
q_2^T
q_3^T
v_1^T
v_2^T
v_3^T
v_4^T
\left[ \begin{array}{l} \\ \\ \\ \\ \\ \end{array} \right.
=
QK^T
/\sqrt{d_k}

softmax

(
)

row

\(A\)

2b. attention matrix

Q
K
V
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
W_q
W_k
W_v
X \in \mathbb{R}^{n \times d}
k_2^T
k_1^T
k_3^T
k_4^T
q_4^T
q_1^T
q_2^T
q_3^T
v_1^T
v_2^T
v_3^T
v_4^T
Z = AV

2c. attention-weighted values \(Z\)

Q
K
V
A
QK^T
/\sqrt{d_k}

softmax

(
)

row

q_1
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
q_1
q_2
q_3
q_4
k_4
k_1
k_2
k_3

attention mechanism

x_1
x_2
x_3
x_4
v_4
v_4
q_1
q_1
q_2
q_3
v_4
q_4
k_4
v_4
v_1
k_1
v_2
k_2
v_3
k_3

attention mechanism

x_1
x_2
x_3
x_4
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
v_4
v_4
q_1
q_1
q_2
q_3
v_4
q_4
k_4
v_4
v_1
k_1
v_2
k_2
v_3
k_3

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_1
+
+
+
a_{11}
a_{14}
a_{12}
a_{13}
=
v_4
v_2
v_3
v_1
v_4
v_4
q_1
=
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
+
+
+
a_{21}
a_{24}
a_{22}
a_{23}
v_1
q_1
k_1
v_2
q_2
k_2
v_3
q_3
k_3
v_4
q_4
k_4
v_4
v_4
v_2
v_3
v_1

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_2
v_4
v_4
q_1
=
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
+
+
+
a_{31}
a_{34}
a_{32}
a_{33}
v_1
q_1
k_1
v_2
q_2
k_2
v_3
q_3
k_3
v_4
q_4
k_4
v_4
v_4
v_2
v_3
v_1

attention mechanism

x_1
x_2
x_3
x_4
z_3
\in \mathbb{R}^{d_k}
v_4
v_4
q_1
=
a_{41}
a_{42}
a_{43}
a_{44}
a_{31}
a_{34}
a_{32}
a_{3 3}
a_{21}
a_{24}
a_{22}
a_{23}
a_{11}
a_{14}
a_{12}
a_{13}
+
+
+
a_{41}
a_{44}
a_{42}
a_{43}
v_1
q_1
k_1
v_2
q_2
k_2
v_3
q_3
k_3
v_4
q_4
k_4
v_4
v_4
v_2
v_3
v_1

attention mechanism

\in \mathbb{R}^{d_k}
x_1
x_2
x_3
x_4
z_4

Outline

  • Transformers high-level intuition and architecture 
  • Attention mechanism
  • Multi-head attention
  • (Applications)
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4

attention mechanism

x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

x_1
x_2
x_3
x_4
x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

\dots
x_1
x_2
x_3
x_4
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots
x_1
x_2
x_3
x_4

In particular, each head:

  • learns its own set of \(W_q, W_k, W_v\)
  • creates its own projected sequence of \((q,k,v)\) 
  • computes its own sequence of \(z\)
  • structurally identical processing
  • for each token in the sequence:
    • structurally identical processing

Parallel, and structurally identical processing across all heads and tokens.

Multi-head Attention

index along heads

index along sequence

z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

multi-head attention

x_1
x_2
x_3
x_4
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.

each concatenated \(z_i \in \mathbb{R}^{Hd_k}\)

concatenated \(z_1\)

concatenated \(z_2\)

concatenated \(z_3\)

concatenated \(z_4\)

z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

multi-head attention

x_1
x_2
x_3
x_4
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o

attention output projection

z_{\text{out}_1}
z_{\text{out}_2}
z_{\text{out}_3}
z_{\text{out}_4}
\in \mathbb{R}^{Hd_k\times d}
z_1^1
z_2^1
z_3^1
z_4^1
W^1_k
W^1_v
W^1_q
v^1_1
k^1_1
q^1_1
W^1_k
W^1_v
W^1_q
v^1_2
k^1_2
q^1_2
W^1_k
W^1_v
W^1_q
W^1_k
W^1_v
W^1_q
v^1_3
k^1_3
q^1_3
v^1_4
q^1_4
k^1_4
v^1_4
v^1_4
v^1_4

attention mechanism

z_1^2
z_2^2
z_3^2
z_4^2
W^2_k
W^2_v
W^2_q
v^2_1
k^2_1
q^2_1
W^2_k
W^2_v
W^2_q
v^2_2
k^2_2
q^2_2
W^2_k
W^2_v
W^2_q
W^2_k
W^2_v
W^2_q
v^2_3
k^2_3
q^2_3
v^2_4
q^2_4
k^2_4
v^2_4
v^2_4
v^2_4

attention mechanism

z_1^3
z_2^3
z_3^3
z_4^3
W^3_k
W^3_v
W^3_q
v^3_1
k^3_1
q^3_1
W^3_k
W^3_v
W^3_q
v^3_2
k^3_2
q^3_2
W^3_k
W^3_v
W^3_q
W^3_k
W^3_v
W^3_q
v^3_3
k^3_3
q^3_3
v^3_4
q^3_4
k^3_4
v^3_4
v^3_4
v^3_4

attention mechanism

z_1^H
z_2^H
z_3^H
z_4^H
W^H_k
W^H_v
W^H_q
v^H_1
k^H_1
q^H_1
W^H_k
W^H_v
W^H_q
v^H_2
k^H_2
q^H_2
W^H_k
W^H_v
W^H_q
W^H_k
W^H_v
W^H_q
v^H_3
k^H_3
q^H_3
v^H_4
q^H_4
k^H_4
v^H_4
v^H_4
v^H_4

attention mechanism

\dots

attention layer

all in \(\mathbb{R}^{d}\)

x_1
x_2
x_3
x_4
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
\left\{ \begin{array}{l} \\ \\ \end{array} \right.

attention output projection

z_{\text{out}_1}
z_{\text{out}_2}
z_{\text{out}_3}
z_{\text{out}_4}
\left\{ \begin{array}{l} \\ \\ \end{array} \right.
W^o

Shape Example

num tokens 2
input token dim 4
          embedding dim 3
num heads 5

$$n$$

$$d$$

$$d_k$$

$$H$$

learned

$$(qkv)$$

query proj
key proj
value proj
output proj
input
query
key
value
attn matrix
attn head out
multi-head out
attn layer out

$$W_q^h$$

$$W_k^h$$

$$W_v^h$$

$$W^o$$

$$Q^h$$

$$K^h$$

$$V^h$$

$$A^h$$

$$Z^h$$

$$d \times d_k$$

$$Hd_k\times d$$

$$n \times d$$

$$n \times d_k$$

$$n \times d_k$$

$$n \times d_k$$

$$n \times n$$

$$n \times d_k$$

$$n \times d$$

$$4 \times 3$$

$$15 \times 4$$

$$2 \times 4$$

$$2 \times 3$$

$$2 \times 3$$

$$2 \times 3$$

$$2 \times 2$$

$$2 \times 3$$

$$2 \times 4$$

$$d \times d_k$$

$$4 \times 3$$

$$d \times d_k$$

$$4 \times 3$$

$$X$$

$$Z_{\text{out}}$$

\left\{ \begin{array}{l} \\ \\ \\ \\ \end{array} \right.

for a single attention head 

$$\text {concat}(Z^1 \dots Z^H)$$

$$n \times Hd_k$$

$$2 \times 15$$

\left\{ \begin{array}{l} \\ \\ \\ \end{array} \right.

Some practical techniques commonly needed when training auto-regressive transformers:

masking

Layer normlization

Residual connection

Positional encoding

Outline

  • Transformers high-level intuition and architecture 
  • Attention mechanism 
  • Multi-head attention
  • (Applications)

image credit: Nicholas Pfaff 

Generative Boba by Boyuan Chen in Bldg 45

😉

😉

Transformers in Action: Performance across domains

Transformers in Action: Performance across domains

We can tokenize anything.
General strategy: chop the input up into chunks, project each chunk to an embedding

[images credit: visionbook.mit.edu]

Multi-modality (image q&a)

  • (query, key, value) come from different input modality
  • cross-attention

[images credit: visionbook.mit.edu]

image classification (done in the contrastive way)

[Radford et al, Learning Transferable Visual Models From Natural Language Supervision, ICML, 2011]

[“DINO”, Caron et all. 2021]

Success mode:

Success mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Failure mode:

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]

Success or Failure? mode:

Summary

  • Transformers combine many of the best ideas from earlier architectures—patch-wise parallel processing (like convolution), relu nonlinearities, residual connections —with several new innovations, in particular, embedding and attention layers.
  • Transformers start with some generic hard-coded embeddings, and block-by-block, creates better and better embeddings.
  • Parallel processing everything in attention:
    • each head is processed in parallel
    • within each head, the \(q,k,v\) token sequence is created in parallel
    • the attention scores are computed in parallel
    • the attention output is computed in parallel. 

Thanks!

for your attention!