TRF

Origin

Attention is all you need

Why

RNN (seq2eq) 句子間有相依性
翻譯時 decoder 需要前一個字的輸出
對平行不有善

Proposed Solution

Transformer encoder/decoder
Self Attention

Transformer Architecture

Self Attention
Positionwise Feed Forward
Position encoding

Architecture (Cont.)

Attention

x_i \ \forall i \in [1, length] \newline h_i = f(x_i) \newline \alpha_{i} = score(y, h_i) \newline z = \sum_{i}^{length} \alpha_{i} h_i

x_i \ \forall i \in [1, length] \newline h_i = f(x_i) \newline \alpha_{i} = score(y, h_i) \newline z = \sum_{i}^{length} \alpha_{i} h_i

Simple Self Attention

x_i \ \forall i \in [1, length] \newline h_i = f(x_i) \newline \alpha_{ij} = score(h_i, h_j) \newline y_i = \sum_{j}^{length} \alpha_{ij} h_j

x_i \ \forall i \in [1, length] \newline h_i = f(x_i) \newline \alpha_{ij} = score(h_i, h_j) \newline y_i = \sum_{j}^{length} \alpha_{ij} h_j

Self Attention

(SA)

x_i \ \forall i \in [1, length] \newline q_i = f_1(x_i) \newline k_i = f_2(x_i) \newline v_i = f_3(x_i) \newline \alpha_{ij} = score(q_i, k_j) \newline y_i = \sum_{j}^{length} \alpha_{ij} v_j

x_i \ \forall i \in [1, length] \newline q_i = f_1(x_i) \newline k_i = f_2(x_i) \newline v_i = f_3(x_i) \newline \alpha_{ij} = score(q_i, k_j) \newline y_i = \sum_{j}^{length} \alpha_{ij} v_j

Self Attention (Cont.)

Positionwise Feed Forward

(PWFFN)

x_i \ \forall i \in [1, length] \newline y_i = f(x_i) \newline where \ f \ is \ a \ \text{two layer MLP}

x_i \ \forall i \in [1, length] \newline y_i = f(x_i) \newline where \ f \ is \ a \ \text{two layer MLP}

Position Encoding

SA & PWFFN 並沒有順序(位置)的概念
透過而外得向量提供位置資訊
可用 trained weight 也可用定值

Transformer en/decoder

透過 transformer (TRF) 取代 seq2seq 中的 RNN 架構
TRF encoder 將 source sentence 轉成 sequence of latent variable (memory)
TRF decode 透過 target sentence 存取 memory

Transformer en/decoder （Cont.)

Benefit

每一層的第 i 個輸出 h_i 不需要等 h_{i-1} 算完，直接透過 X 的 attention 決定，可以平行加速

Transformers

What is Transformer

Origin

Why

Proposed Solution

Transformer Architecture

Architecture (Cont.)

Attention

Simple Self Attention

Self Attention

Self Attention (Cont.)

Positionwise Feed Forward

Position Encoding

Transformer en/decoder

Transformer en/decoder （Cont.)

Benefit

Variant

Non-sequence data

Decoder-only TRF

Pre-LN TRF

Pre-LN TRF

TRF-XL

TRF-XL (Cont.)

More variant

Pretrain TRF

Origin

GPT1

GPT1 (Cont.)

Bert

Bert (Cont.)

GPT2

XLNet

XLNet

ALBert

More pretrain arch.

Q&A

TRF

TRF

Peter Cheng

Transformers

TRF

More from Peter Cheng