Transformers
淺介
What is Transformer
Origin
- Attention is all you need
Why
- RNN (seq2eq) 句子間有相依性
- 翻譯時 decoder 需要前一個字的輸出
- 對平行不有善
Proposed Solution
- Transformer encoder/decoder
- Self Attention
Transformer Architecture
- Self Attention
- Positionwise Feed Forward
- Position encoding
Architecture (Cont.)
Attention
x_i \ \forall i \in [1, length] \newline
h_i = f(x_i) \newline
\alpha_{i} = score(y, h_i) \newline
z = \sum_{i}^{length} \alpha_{i} h_i
Simple Self Attention
x_i \ \forall i \in [1, length] \newline
h_i = f(x_i) \newline
\alpha_{ij} = score(h_i, h_j) \newline
y_i = \sum_{j}^{length} \alpha_{ij} h_j
Self Attention
(SA)
x_i \ \forall i \in [1, length] \newline
q_i = f_1(x_i) \newline
k_i = f_2(x_i) \newline
v_i = f_3(x_i) \newline
\alpha_{ij} = score(q_i, k_j) \newline
y_i = \sum_{j}^{length} \alpha_{ij} v_j
Self Attention (Cont.)
Positionwise Feed Forward
(PWFFN)
x_i \ \forall i \in [1, length] \newline
y_i = f(x_i) \newline
where \ f \ is \ a \ \text{two layer MLP}
Position Encoding
- SA & PWFFN 並沒有順序(位置)的概念
- 透過而外得向量提供位置資訊
- 可用 trained weight 也可用定值
Transformer en/decoder
- 透過 transformer (TRF) 取代 seq2seq 中的 RNN 架構
- TRF encoder 將 source sentence 轉成 sequence of latent variable (memory)
- TRF decode 透過 target sentence 存取 memory
Transformer en/decoder (Cont.)
Benefit
- 每一層的第 i 個輸出 h_i 不需要等 h_{i-1} 算完,直接透過 X 的 attention 決定,可以平行加速
Variant
Non-sequence data
- Source: Image Transformer
- 使用 TRF 在 image 上,取代 PixelCNN
Decoder-only TRF
- Source: GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES
- TRF en/decoder 之間也是透過 attention 交換資訊,不需要分 encoder 或 decoder
- Also known as: TRF encoder
Pre-LN TRF
- Source: On Layer Normalization in the Transformer Architecture
- 將原本在 SA 以及 PWFFN 之後的 LN 移到 layer 之前
- 據證是會讓 TRF 內部的 gradient 更穩定,幫助收斂
Pre-LN TRF
TRF-XL
- Source: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- 為解決 TRF 長度問題
- 在 TRF 外再包一層 recurrent
- 使用 related position encoding
TRF-XL (Cont.)
More variant
- Attention scheme
- position embedding design
- ...
Pretrain TRF
Origin
- Improving Language Understanding by Generative Pre-Training (GPT1)
- 第一篇使用 pretrain TRF 的論文
GPT1
- Pre-train Language Modeling
- autoregressive model
GPT1 (Cont.)
Bert
- Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- 為解決 GPT1 只有單向輸入的問題,將 Language Modeling 換成 Masked Language Modeling (MLM)
- 提出 next sentence prediction task
Bert (Cont.)
GPT2
- Source: Language Models are Unsupervised Multitask Learners
- 將 GPT1 從 Post-LN TRF 換成 Pre-LN TRF 並使用更多資料以及更大的模型
XLNet
- Source: XLNet: Generalized Autoregressive Pretraining for Language Understanding
- 以 TRF-XL 作為基底
- 為解決 GPT 單向而 Bert 輸入有 mask token 的問題
- 同使用 MLM 但輸入只使用 subsequence
XLNet
ALBert
- Source: ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
- 解決 Bert 模型過大問題,大量的 share weight
- 提出 Sentence ordering prediction task,預測兩句先後順序
More pretrain arch.
- Source: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation
Q&A
TRF
By Peter Cheng
TRF
- 1,109