research guild meeting

language model

caws drink

P(milk  |c)  = 0.7
P(water |c)  = 0.8
P(wine  |c)  = 0.0001
P(bricks|c)  = 0.000000001

...

context c

target x

RNN Language model

caws

drink

???

RNN Language model

caws

drink

???

RNN Language model

caws

drink

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

???

RNN Language model

caws

drink

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

???

RNN Language model

caws

drink

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

P("water") = 0.007

P("water") = 0.007

P("beer") = 0.0001

P("beer") = 0.0001

...

...

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

???

context

P(target

|context

)

Softmax bottleneck

h

P("water") = 0.0024

P("water") = 0.0024

P("milk") = 0.0015

P("milk") = 0.0015

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

h = \text{RNN}(\text{"caws\ drink"})

h = \text{RNN}(\text{"caws\ drink"})

\text{P}(x|c) = \text{softmax}(h \mathbf{W_h})

\text{P}(x|c) = \text{softmax}(h \mathbf{W_h})

P("wine") = 0.0005

P("wine") = 0.0005

Softmax bottleneck

h

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

P("a") = 0.002

P("a") = 0.002

P("as") = 0.00004

P("as") = 0.00004

P("apple") = 0.0001

P("apple") = 0.0001

. . . . . . . .

P("milk") = 0.0045

P("milk") = 0.0045

P("zoo") = 0.000021

P("zoo") = 0.000021

. . . . . . . . . .

Limited expressivity!

h_{c_1}

h_{c_1}

h_{c_2}

h_{c_2}

h_{c_3}

h_{c_3}

h_{c_4}

h_{c_4}

h_{c_5}

h_{c_5}

h_{c_6}

h_{c_6}

h_{c_7}

h_{c_7}

h_{c_8}

h_{c_8}

\begin{bmatrix} {P^\ast(\text{"water"}|c_1)} & {P^\ast(\text{"milk"}|c_1)} & {\cdot \cdot \cdot} & {P^\ast(\text{"zoo"}|c_1)} \end{bmatrix}

\begin{bmatrix} {P^\ast(\text{"water"}|c_1)} & {P^\ast(\text{"milk"}|c_1)} & {\cdot \cdot \cdot} & {P^\ast(\text{"zoo"}|c_1)} \end{bmatrix}

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

Contexts

h_{c_1}

h_{c_1}

h_{c_2}

h_{c_2}

h_{c_3}

h_{c_3}

h_{c_4}

h_{c_4}

h_{c_5}

h_{c_5}

h_{c_6}

h_{c_6}

h_{c_7}

h_{c_7}

h_{c_8}

h_{c_8}

\begin{bmatrix} {P^\ast(x_1|c_1)} & {P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {P^\ast(x_M|c_1)} \end{bmatrix}

\begin{bmatrix} {P^\ast(x_1|c_1)} & {P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {P^\ast(x_M|c_1)} \end{bmatrix}

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

h_{c_1}

h_{c_1}

h_{c_2}

h_{c_2}

h_{c_3}

h_{c_3}

h_{c_4}

h_{c_4}

h_{c_5}

h_{c_5}

h_{c_6}

h_{c_6}

h_{c_7}

h_{c_7}

h_{c_8}

h_{c_8}

\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_3)} & {\log P^\ast(x_2|c_3)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_3)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_3)} & {\log P^\ast(x_2|c_3)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_3)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_N|c_3)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_N|c_3)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

. . . . . . .

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

h_{c_1}

h_{c_1}

h_{c_2}

h_{c_2}

h_{c_3}

h_{c_3}

h_{c_4}

h_{c_4}

h_{c_5}

h_{c_5}

h_{c_6}

h_{c_6}

h_{c_7}

h_{c_7}

h_{c_8}

h_{c_8}

\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}

\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}

\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}

\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

\mathbf{H}_\theta

\mathbf{H}_\theta

\begin{bmatrix} {w_{x_1}} \\ {w_{x_2}} \\ \\ \cdots \\ \\ \\ {w_{x_M}} \end{bmatrix}

\begin{bmatrix} {w_{x_1}} \\ {w_{x_2}} \\ \\ \cdots \\ \\ \\ {w_{x_M}} \end{bmatrix}

\mathbf{W}_\theta

\mathbf{W}_\theta

\mathbf{A}

\mathbf{A}

this is matrix decomposition!

H_\theta W_\theta^\top = A^\prime

H_\theta W_\theta^\top = A^\prime

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

\mathbf{H}_\theta \in \mathbb{R}^{N \times d}

\mathbf{H}_\theta \in \mathbb{R}^{N \times d}

W_\theta \in \mathbb{R}^{M \times d}

W_\theta \in \mathbb{R}^{M \times d}

A \in \mathbb{R}^{N \times M}

A \in \mathbb{R}^{N \times M}

rank(A) is limited to d

rank?

>>> x = np.array([[ 0,  0,  0,  0,  0,  0],
                  [ 1,  2,  3,  4,  5,  6],
                  [ 2,  4,  6,  8, 10, 12],
                  [ 3,  6,  9, 12, 15, 18],
                  [ 4,  8, 12, 16, 20, 24],
                  [ 5, 10, 15, 20, 25, 30],
                  [ 6, 12, 18, 24, 30, 36],
                  [ 7, 14, 21, 28, 35, 42],
                  [ 8, 16, 24, 32, 40, 48],
                  [ 9, 18, 27, 36, 45, 54]])

>>> np.linalg.matrix_rank(x)
1

large matrix,

low diversity

The Problem

If A for a natural language is a high-rank matrix

then no matter how expressive neural network is,

softmax layer will be a limiting factor

Solutions

Increase rank of A

– Make word embedding size d larger?

Introduces too many parameters

Increase rank of A

Mixture of Softmaxes

– Compute many independent softmaxes and mix them

h \in \mathbb{R}^{d}

h \in \mathbb{R}^{d}

model's hidden vector

h

h_1

h_1

h_2

h_2

h_3

h_3

W_k \in \mathbb{R}^{K \cdot d \times d}

W_k \in \mathbb{R}^{K \cdot d \times d}

make K hidden vectors

h_k = \tanh(W_k h)

h_k = \tanh(W_k h)

added paramerers

h

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

h_1

h_1

h_2

h_2

h_3

h_3

P("mooing") = 0.099

P("mooing") = 0.099

P("drink") = 0.0002

P("drink") = 0.0002

P("mooing") = 0.003

P("mooing") = 0.003

P("drink") = 0.001

P("drink") = 0.001

compute K softmaxes

h

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

h_1

h_1

h_2

h_2

h_3

h_3

P("mooing") = 0.099

P("mooing") = 0.099

P("drink") = 0.0002

P("drink") = 0.0002

P("mooing") = 0.003

P("mooing") = 0.003

P("drink") = 0.001

P("drink") = 0.001

P("mooing") = 0.003

P("mooing") = 0.003

P("drink") = 0.001

P("drink") = 0.001

mix K softmaxes

How to mix?

P_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}

P_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}

\sum_{k=1}^K \pi_{c,k} = 1

\sum_{k=1}^K \pi_{c,k} = 1

learned parameter

weighted average

Weighted sum with learned coefficients

Does it help?

Matrix rank

Softmax	k=1	k=2	k=3	k=4	k=5
Traditional	34	34	34	34	34
Mixture	34	629	979	995	997

http://smerity.com/articles/2017/mixture_of_softmaxes.html

$d = 32, V = 1000, N = 2048$

Language modelling

Model	Params	PTB	WikiText-2
AWD-LSTM	24M	57.7	65.8
+ mixture of softmaxes	22M	54.44	61.45

State-of-the-art

Neural Machine translation

Questions?

Softmax Bottlneck

By Oleksiy Syvokon

Softmax Bottlneck

Research Guild Meeting: "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model" by Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen.

language model

RNN Language model

RNN Language model

RNN Language model

RNN Language model

RNN Language model

Softmax bottleneck

Softmax bottleneck

rank?

The Problem

Solutions

Does it help?

Matrix rank

Language modelling

Neural Machine translation

Questions?

Softmax Bottlneck

More from Oleksiy Syvokon