[video edited from 3b1b]
Recap: Word embedding
this enables "soft" dictionary look-up:
dict_en2fr = {
"apple" : "pomme",
"banana": "banane",
"lemon" : "citron"}Good word-embeddings space is equipped with semantically meaningful vector arithmetic
Key
Value
apple
pomme
\(:\)
banane
banana
\(:\)
citron
lemon
\(:\)
dict_en2fr = {
"apple" : "pomme",
"banana": "banane",
"lemon" : "citron"}
query = "orange"
output = dict_en2fr[query]Python would complain. 🤯
orange
apple
pomme
banane
citron
banana
lemon
Key
Value
\(:\)
\(:\)
\(:\)
Query
Output
???
But we can probably see the rationale behind something like this:
Query
Key
Value
Output
orange
apple
\(:\)
pomme
banana
\(:\)
banane
lemon
\(:\)
citron
dict_en2fr = {
"apple" : "pomme",
"banana": "banane",
"lemon" : "citron"}
query = "orange"
output = dict_en2fr[query]0.1
pomme
0.1
banane
0.8
citron
+
+
0.1
pomme
0.1
banane
0.8
citron
+
+
via these mixing percentages [0.1 0.1 0.8] made sense
We put (query, key, value) in "good" embeddings in our human brain
such that mixing the values
Query
Key
Value
Output
orange
apple
\(:\)
pomme
0.1
pomme
0.1
banane
0.8
citron
banana
\(:\)
banane
lemon
\(:\)
citron
+
+
orange
orange
0.1
pomme
0.1
banane
0.8
citron
+
+
apple
banana
lemon
orange
0.8
0.1
0.1
pomme
banane
citron
+
+
very roughly, the attention mechanism in transformers automates this process.
apple
banana
lemon
orange
orange
orange
Query
Key
Value
Output
orange
apple
\(:\)
pomme
banana
\(:\)
banane
lemon
\(:\)
citron
orange
orange
pomme
banane
citron
0.1
pomme
0.1
banane
0.8
citron
+
+
0.1
pomme
0.1
banane
0.8
citron
+
+
dot-product similarity
softmax
0.1
0.1
0.8
a. compare query and key for merging percentages:
Query
Key
Value
Output
orange
apple
\(:\)
pomme
0.1
pomme
0.1
banane
0.8
citron
banana
\(:\)
banane
lemon
\(:\)
citron
+
+
orange
orange
pomme
banane
citron
+
+
0.8
0.1
0.1
pomme
banane
citron
+
+
b. then output mixed values
a. compare query and key for merging percentages:
Let's see how this intuition becomes a trainable mechanism.
apple
banana
lemon
orange
orange
orange
softmax
0.1
0.1
0.8
Large Language Models (LLMs) are trained in this self-supervised way
"To date, the cleverest thinker of all time was Issac. "
feature
label
To date, the
cleverest
To date, the cleverest
thinker
To date, the cleverest thinker
was
To date, the cleverest thinker of all time was
Issac
auto-regressive prediction
[video edited from 3b1b]
\(n\)
\(d\)
input embedding
[video edited from 3b1b]
[video edited from 3b1b]
Cross-entropy loss encourages the internal weights update to push this probability higher
[video edited from 3b1b]
[video edited from 3b1b]
Transformer
"To date, the cleverest [thinker] of all time was Issac.
push for Prob("date") to be high
push for Prob("the") to be high
push for Prob("cleverest") to be high
push for Prob("thinker") to be high
distribution over the vocabulary
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
To
date
the
cleverest
input embedding
\(\dots\)
\(\dots\)
\(\dots\)
transformer block
transformer block
transformer block
\(L\) blocks
\(\dots\)
\(\dots\)
output embedding
To
date
the
cleverest
\(\dots\)
transformer block
transformer block
transformer block
A sequence of \(n\) tokens, each token in \(\mathbb{R}^{d}\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
output embedding
input embedding
input embedding
To
date
the
cleverest
input embedding
\(\dots\)
transformer block
output embedding
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
transformer block
transformer block
each of the \(n\) tokens transformed, block by block
within a shared \(d\)-dimensional word-embedding space.
To
date
the
cleverest
attention layer
MLP
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
neuron weights
input embedding
transformer block
output embedding
To
date
the
cleverest
attention layer
output projection
\((qkv)\) projection
attention mechanism
Most important bits in an attention layer:
Why learning these projections:
With learned projections, we frame \(x\) into:
1. (query, key, value) projection
1. (query, key, value) projection
To
date
the
cleverest
2. Attention mechanism
attention mechanism
To
date
the
cleverest
attention mechanism
To
date
the
cleverest
softmax
To
date
the
cleverest
softmax
To
date
the
cleverest
attention mechanism
To
date
the
cleverest
softmax
To
date
the
cleverest
softmax
To
date
the
cleverest
attention mechanism
To
date
the
cleverest
softmax
parallel and structurally identical processing
can calculate \(z_4\) without \(z_3\)
To
date
the
cleverest
softmax
To
date
the
cleverest
attention mechanism
To
date
the
cleverest
softmax
To
date
the
cleverest
softmax
To
date
the
cleverest
attention mechanism
Attention head
maps sequence of \(x\) to sequence of \(z\):
1. (query, key, value) projection
2. attention mechanism
parallel and structurally identical processing
attention mechanism
\((qkv)\) proj.
By stacking each individual vector in the sequence as a row
Attention head - compact matrix form
To
date
the
cleverest
input embedding
output embedding
\(\dots\)
transformer block
transformer block
transformer block
A sequence of \(n\) tokens, each token in \(\mathbb{R}^{d}\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
Stack each token as a row in the input
1. (query, key, value) projection
2a. dot-product similarity
compare \(q_i\) and \(k_j\)
assemble the \(n \times n\) similarities so rows correspond to query
2a. dot-product similarity
2a. dot-product similarity
2a. dot-product similarity
each row sums up to 1
softmax
softmax
softmax
softmax
softmax
row
\(A\)
2b. attention matrix
2c. attention-weighted values \(Z\)
softmax
row
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention mechanism
In particular, each head:
Parallel, and structurally identical processing across all heads and tokens.
Multi-head Attention
index along heads
index along sequence
attention mechanism
attention mechanism
attention mechanism
attention mechanism
multi-head attention
each concatenated \(z_i \in \mathbb{R}^{Hd_k}\)
concatenated \(z_1\)
concatenated \(z_2\)
concatenated \(z_3\)
concatenated \(z_4\)
attention mechanism
attention mechanism
attention mechanism
attention mechanism
multi-head attention
attention output projection
attention mechanism
attention mechanism
attention mechanism
attention mechanism
attention layer
all in \(\mathbb{R}^{d}\)
attention output projection
Shape Example
| num tokens | 2 | |
| input token dim | 4 | |
| embedding dim | 3 | |
| num heads | 5 |
$$n$$
$$d$$
$$d_k$$
$$H$$
learned
$$(qkv)$$
| query proj | |||
| key proj | |||
| value proj | |||
| output proj | |||
| input | |||
| query | |||
| key | |||
| value | |||
| attn matrix | |||
| attn head out | |||
| multi-head out | |||
| attn layer out |
$$W_q^h$$
$$W_k^h$$
$$W_v^h$$
$$W^o$$
$$Q^h$$
$$K^h$$
$$V^h$$
$$A^h$$
$$Z^h$$
$$d \times d_k$$
$$Hd_k\times d$$
$$n \times d$$
$$n \times d_k$$
$$n \times d_k$$
$$n \times d_k$$
$$n \times n$$
$$n \times d_k$$
$$n \times d$$
$$4 \times 3$$
$$15 \times 4$$
$$2 \times 4$$
$$2 \times 3$$
$$2 \times 3$$
$$2 \times 3$$
$$2 \times 2$$
$$2 \times 3$$
$$2 \times 4$$
$$d \times d_k$$
$$4 \times 3$$
$$d \times d_k$$
$$4 \times 3$$
$$X$$
$$Z_{\text{out}}$$
for a single attention head
$$\text {concat}(Z^1 \dots Z^H)$$
$$n \times Hd_k$$
$$2 \times 15$$
Some practical techniques commonly needed when training auto-regressive transformers:
masking
Layer normlization
Residual connection
Positional encoding
image credit: Nicholas Pfaff
Generative Boba by Boyuan Chen in Bldg 45
😉
😉
Transformers in Action: Performance across domains
Transformers in Action: Performance across domains
We can tokenize anything.
General strategy: chop the input up into chunks, project each chunk to an embedding
[images credit: visionbook.mit.edu]
Multi-modality (image q&a)
[images credit: visionbook.mit.edu]
image classification (done in the contrastive way)
[Radford et al, Learning Transferable Visual Models From Natural Language Supervision, ICML, 2011]
[“DINO”, Caron et all. 2021]
Success mode:
Success mode:
[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]
Failure mode:
[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al. CVPR (2016)]
Success or Failure? mode:
for your attention!