Shen Shen
April 5, 2024
(many slides adapted from Phillip Isola and Kaiming He)
red
green
blue
image channels
image width
image
height
image channels
image width
image
height
input tensor
filter
output
input tensor
filters
outputs
input tensor
filters
output tensor
[image credit: medium]
cont'd
Enduring principles:
Lessons from CNNs
Enduring principles:
Follow the same principles:
1. via tokenization
2. via attention mechanism
(conceptually: transformers are CNNs where the filter weights -- or here the attention -- dynamically change depending on the patch)
- \(d\) is the size of each token ( \(x^{(i)} \in \mathbb{R}^{d}\))
- \(n\) is the number of tokens
dict_fr2en = {
"pomme": "apple",
"banane": "banana",
"citron": "lemon"
}
Let's start by thinking about dictionary look up
dict_fr2en = {
"pomme": "apple",
"banane": "banana",
"citron": "lemon"
}
query = "citron"
output = dict_fr2en[query]
dict_fr2en = {
"pomme": "apple",
"banane": "banana",
"citron": "lemon"
}
query = "citron"
output = dict_fr2en[query]
dict_fr2en = {
"pomme": "apple",
"banane": "banana",
"citron": "lemon"
}
What if we'd like to run
query = "orange"
output = dict_fr2en[query]
Python would complain.
output = 0.8 * "lemon" + 0.1 * "apple" + 0.1 * "banana"
But you might see the rationale of:
"soft" look up.
Actually one way of understanding "attention"
Sensible "abstraction/embedding"
(though python would still complain)
Single-query example:
1. Similarity score w/ key \(j\):
2. Attention weights (softmax'd scores):
3. Output: attention-weighted sum:
1. Similarity score of (query \(i\) and key \(j\)):
2. Attention weights (softmax'd scores):
3. Output: attention-weighted sum:
Multi-query example:
For each query \(i,\\\) \(a_i = \text{softmax}([s_{i1}, s_{i2}, s_{i3}, \ldots, s_{i n_k}])\)
Stack all such \(a_i\) vertically
For each query \(i,\) \(y_i=\sum\nolimits_j a_{ij} v_{j}\)
Stack all such \(y_i\) vertically
Rather than having just one way of attending, why not have multiple?
Repeat in parallel
One head
命
運
我
操
縱
tokenization
input token
learned projection
query, key, value token sequences
attention head
命
運
我
操
縱
tokenization
input token
learned projection
query, key, value token sequences
attention head
3rd output token
Take the 3rd input token as example, how do we get the 3rd output token?
命
運
我
操
縱
tokenization
input token
learned projection
query, key, value token sequences
softmax
one attention head
the 3rd output token
via learned projection weights
命
命
運
我
操
縱
Some other ideas commonly used in practice:
Causal self-attention
(via masking)
All parameters are in projection
Multi-modality (text + image)
Success mode:
[“DINO”, Caron et all. 2021]
Failure mode:
We'd love it for you to share some lecture feedback.