Transformers
淺介
What is Transformer
Origin
Attention is all you need
Why
RNN (seq2eq) 句子間有相依性
翻譯時 decoder 需要前一個字的輸出
對平行不有善
Proposed Solution
Transformer encoder/decoder
Self Attention
Transformer Architecture
Self Attention
Positionwise Feed Forward
Position encoding
Architecture (Cont.)
Attention
x_i \ \forall i \in [1, length] \newline h_i = f(x_i) \newline \alpha_{i} = score(y, h_i) \newline z = \sum_{i}^{length} \alpha_{i} h_i
Simple Self Attention
x_i \ \forall i \in [1, length] \newline h_i = f(x_i) \newline \alpha_{ij} = score(h_i, h_j) \newline y_i = \sum_{j}^{length} \alpha_{ij} h_j
Self Attention
(SA)
x_i \ \forall i \in [1, length] \newline q_i = f_1(x_i) \newline k_i = f_2(x_i) \newline v_i = f_3(x_i) \newline \alpha_{ij} = score(q_i, k_j) \newline y_i = \sum_{j}^{length} \alpha_{ij} v_j
Self Attention (Cont.)
Positionwise Feed Forward
(PWFFN)
x_i \ \forall i \in [1, length] \newline y_i = f(x_i) \newline where \ f \ is \ a \ \text{two layer MLP}
Position Encoding
SA & PWFFN 並沒有順序(位置)的概念
透過而外得向量提供位置資訊
可用 trained weight 也可用定值
Transformer en/decoder
透過 transformer (TRF) 取代 seq2seq 中的 RNN 架構
TRF encoder 將 source sentence 轉成 sequence of latent variable (memory)
TRF decode 透過 target sentence 存取 memory
Transformer en/decoder (Cont.)
Benefit
每一層的第 i 個輸出
h_i
不需要等 h_{i-1} 算完,直接透過 X 的 attention 決定,可以平行加速
Variant
Non-sequence data
Source: Image Transformer
使用 TRF 在 image 上,取代 PixelCNN
Decoder-only TRF
Source: GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES
TRF en/decoder 之間也是透過 attention 交換資訊,不需要分 encoder 或 decoder
Also known as: TRF encoder
Pre-LN TRF
Source: On Layer Normalization in the Transformer Architecture
將原本在 SA 以及 PWFFN 之後的 LN 移到 layer 之前
據證是會讓 TRF 內部的 gradient 更穩定,幫助收斂
Pre-LN TRF
TRF-XL
Source: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
為解決 TRF 長度問題
在 TRF 外再包一層 recurrent
使用 related position encoding
TRF-XL (Cont.)
More variant
Attention scheme
position embedding design
...
Pretrain TRF
Origin
Improving Language Understanding by Generative Pre-Training (GPT1)
第一篇使用 pretrain TRF 的論文
GPT1
Pre-train Language Modeling
autoregressive model
GPT1 (Cont.)
Bert
Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
為解決 GPT1 只有單向輸入的問題,將 Language Modeling 換成 Masked Language Modeling (MLM)
提出 next sentence prediction task
Bert (Cont.)
GPT2
Source: Language Models are Unsupervised Multitask Learners
將 GPT1 從 Post-LN TRF 換成 Pre-LN TRF 並使用更多資料以及更大的模型
XLNet
Source: XLNet: Generalized Autoregressive Pretraining for Language Understanding
以 TRF-XL 作為基底
為解決 GPT 單向而 Bert 輸入有 mask token 的問題
同使用 MLM 但輸入只使用 subsequence
XLNet
ALBert
Source: ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
解決 Bert 模型過大問題,大量的 share weight
提出 Sentence ordering prediction task,預測兩句先後順序
More pretrain arch.
Source: PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation
Q&A
Made with Slides.com