Transformers
Marcos V. Treviso
Instituto de Telecomunicações
December 19, 2019
Refs
Attention is all you need
https://arxiv.org/abs/1706.03762
The illustrated transformer
http://jalammar.github.io/illustrated-transformer/
The annotated transformer
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Łukasz Kaiser’s presentation
https://www.youtube.com/watch?v=rBCqOTEfxvg
A high-level look

Encoder-Decoder
context vector
Encoder
Decoder
↺
↺
BiLSTM
LSTM
x1x2...xn
y1y2...ym
Encoder-Decoder
Encoder
Decoder
BiLSTM
LSTM
x1x2...xn
context vector
y1y2...ym
↺
↺
Attention
query keys values
q∈Rdq
K∈Rn×dk
V∈Rn×dv
Attention
query keys values
q∈Rdq
K∈Rn×dk
V∈Rn×dv
1. Compute a score between q and each kj
s=score(q,K)∈Rn
Attention
query keys values
q∈Rdq
K∈Rn×dk
V∈Rn×dv
1. Compute a score between q and each kj
dot-product:
bilinear:
additive:
neural net:
kj⊤q,(dq==dk)
kj⊤Wq,W∈Rdk×dq
v⊤tanh(W1kj+W2q)
MLP(q,kj);CNN(q,K);...
s=score(q,K)∈Rn
Attention
query keys values
q∈Rdq
K∈Rn×dk
V∈Rn×dv
1. Compute a score between q and each kj
s=score(q,K)∈Rn
2. Map scores to probabilities
p=π(s)∈△n
Attention
query keys values
q∈Rdq
K∈Rn×dk
V∈Rn×dv
1. Compute a score between q and each kj
s=score(q,K)∈Rn
2. Map scores to probabilities
p=π(s)∈△n
softmax:
sparsemax:
exp(sj)/k∑exp(sk)
argminp∈△n∣∣p−s∣∣22
Attention
query keys values
q∈Rdq
K∈Rn×dk
V∈Rn×dv
1. Compute a score between q and each kj
s=score(q,K)∈Rn
2. Map scores to probabilities
p=π(s)∈△n
3. Combine values via a weighted sum
z=i=1∑mpiVi∈Rdv
Drawbacks of RNNs
- Sequential mechanism prohibits parallelization
- Long-range dependencies are tricky, despite gating
x1
x2
...
xn
x1x2x3x4x5x6x7x8x9...xn−1xn
Transformer

Transformer

x1x2...xn
r1r2...rn
y1y2...ym
encode
decode
Transformer

Transformer

Transformer blocks

The encoder

Self-attention

"The animal didn't cross the street because it was too tired"
Self-attention

"The animal didn't cross the street because it was too tired"
Qj=Kj=Vj∈Rd⟺
dot-product scorer!
Transformer self-attention

Transformer self-attention

Transformer self-attention


Matrix calculation

Matrix calculation

Matrix calculation

S=score(Q,K)∈Rn×n
P=π(S)∈△n×n
Z=PV∈Rn×d
Z=softmax(dkQK⊤)V
Q,K,V∈Rn×d
Problem of self-attention

- Convolution: a different linear transformation for each relative position
> Allows you to distinguish what information came from where
- Self-attention: a weighted average :(
Fix: multi-head attention
- Multiple attention layers (heads) in parallel
- Each head uses different linear transformations
- Attention layer with multiple “representation subspaces”

Multi-head attention


2 heads
all heads (8)

Multi-head attention
Multi-head attention

Multi-head attention

Multi-head attention

Multi-head attention

Positional encoding
- A way to account for the order of the words in the seq.

Positional encoding

Positional encoding

Residuals & LayerNorm

Residuals & LayerNorm

Residuals & LayerNorm

The decoder

encoder self-attn
The decoder


encoder self-attn
The decoder


encoder self-attn
decoder self-attn (masked)

The decoder


encoder self-attn
decoder self-attn (masked)

- Mask subsequent positions (before softmax)
The decoder



encoder self-attn
context attention
decoder self-attn (masked)

The decoder



encoder self-attn
context attention
- Use the encoder output as keys and values
S=score(Q,Renc)∈Rm×n
P=π(S)∈△m×n
Z=PRenc∈Rm×d
decoder self-attn (masked)
Renc=Encoder(x)∈Rn×d
The decoder

The decoder

Computational cost

n = seq. length d = hidden dim k = kernel size
Other tricks
- Training Transformers is like black-magic. In the original paper they employed a lot of other tricks:
- Label smoothing
- Dropout at every layer before residuals
- Beam search with length penalties
- Subword units - BPEs
- Adam optimizer with learning-rate decay

Coding & training tips
- Sasha Rush's post is a really good start point:
http://nlp.seas.harvard.edu/2018/04/03/attention.html
- OpenNMT-py implementation:
encoder part | decoder part
on the "good" order of LayerNorm and Residuals
- PyTorch has a built-in implementation since August
torch.nn.Transformer
- Training Tips for the Transformer Model
https://arxiv.org/pdf/1804.00247
What else?
- BERT uses only the encoder side; GPT-2 uses only the decoder side
- Absolute vs relative positional encoding
https://www.aclweb.org/anthology/N18-2074.pdf
- Transformer-XL: keep a memory of previous encoded states
http://arxiv.org/abs/1901.02860
- Sparse transformers
https://www.aclweb.org/anthology/D19-1223.pdf
Thank you for your attention!
transformers
By mtreviso
transformers
- 356