research guild meeting

language model

caws drink
P(milk  |c)  = 0.7
P(water |c)  = 0.8
P(wine  |c)  = 0.0001
P(bricks|c)  = 0.000000001
...

context  c

target x

RNN Language model

caws
drink
???

RNN Language model

caws
drink
???

RNN Language model

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
???

RNN Language model

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}
???

RNN Language model

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
P("water") = 0.007
P("water")=0.007P("water") = 0.007
P("beer") = 0.0001
P("beer")=0.0001P("beer") = 0.0001
...
......
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}
???

context

P(target

|context

)

Softmax bottleneck

h
hh
P("water") = 0.0024
P("water")=0.0024P("water") = 0.0024
P("milk") = 0.0015
P("milk")=0.0015P("milk") = 0.0015

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

h = \text{RNN}(\text{"caws\ drink"})
h=RNN("caws  drink")h = \text{RNN}(\text{"caws\ drink"})
\text{P}(x|c) = \text{softmax}(h \mathbf{W_h})
P(xc)=softmax(hWh)\text{P}(x|c) = \text{softmax}(h \mathbf{W_h})
P("wine") = 0.0005
P("wine")=0.0005P("wine") = 0.0005

Softmax bottleneck

h
hh

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

P("a") = 0.002
P("a")=0.002P("a") = 0.002
P("as") = 0.00004
P("as")=0.00004P("as") = 0.00004
P("apple") = 0.0001
P("apple")=0.0001P("apple") = 0.0001

.    .    .    .    .    .    .    .

P("milk") = 0.0045
P("milk")=0.0045P("milk") = 0.0045
P("zoo") = 0.000021
P("zoo")=0.000021P("zoo") = 0.000021

.    .    .    .    .    .    .    .    .    .

Limited expressivity!

h_{c_1}
hc1h_{c_1}
h_{c_2}
hc2h_{c_2}
h_{c_3}
hc3h_{c_3}
h_{c_4}
hc4h_{c_4}
h_{c_5}
hc5h_{c_5}
h_{c_6}
hc6h_{c_6}
h_{c_7}
hc7h_{c_7}
h_{c_8}
hc8h_{c_8}
\begin{bmatrix} {P^\ast(\text{"water"}|c_1)} & {P^\ast(\text{"milk"}|c_1)} & {\cdot \cdot \cdot} & {P^\ast(\text{"zoo"}|c_1)} \end{bmatrix}
[P("water"c1)P("milk"c1)P("zoo"c1)]\begin{bmatrix} {P^\ast(\text{"water"}|c_1)} & {P^\ast(\text{"milk"}|c_1)} & {\cdot \cdot \cdot} & {P^\ast(\text{"zoo"}|c_1)} \end{bmatrix}

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

Contexts

h_{c_1}
hc1h_{c_1}
h_{c_2}
hc2h_{c_2}
h_{c_3}
hc3h_{c_3}
h_{c_4}
hc4h_{c_4}
h_{c_5}
hc5h_{c_5}
h_{c_6}
hc6h_{c_6}
h_{c_7}
hc7h_{c_7}
h_{c_8}
hc8h_{c_8}
\begin{bmatrix} {P^\ast(x_1|c_1)} & {P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {P^\ast(x_M|c_1)} \end{bmatrix}
[P(x1c1)P(x2c1)P(xMc1)]\begin{bmatrix} {P^\ast(x_1|c_1)} & {P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {P^\ast(x_M|c_1)} \end{bmatrix}

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

h_{c_1}
hc1h_{c_1}
h_{c_2}
hc2h_{c_2}
h_{c_3}
hc3h_{c_3}
h_{c_4}
hc4h_{c_4}
h_{c_5}
hc5h_{c_5}
h_{c_6}
hc6h_{c_6}
h_{c_7}
hc7h_{c_7}
h_{c_8}
hc8h_{c_8}
\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \end{bmatrix}
[logP(x1c1)logP(x2c1)logP(xMc1)]\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \end{bmatrix}
\begin{bmatrix} {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \end{bmatrix}
[logP(x1c2)logP(x2c2)logP(xMc2)]\begin{bmatrix} {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \end{bmatrix}
\begin{bmatrix} {\log P^\ast(x_1|c_3)} & {\log P^\ast(x_2|c_3)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_3)} \end{bmatrix}
[logP(x1c3)logP(x2c3)logP(xMc3)]\begin{bmatrix} {\log P^\ast(x_1|c_3)} & {\log P^\ast(x_2|c_3)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_3)} \end{bmatrix}
\begin{bmatrix} {\log P^\ast(x_N|c_3)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}
[logP(xNc3)logP(x2cN)logP(xMcN)]\begin{bmatrix} {\log P^\ast(x_N|c_3)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

.      .      .      .      .      .      .

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

h_{c_1}
hc1h_{c_1}
h_{c_2}
hc2h_{c_2}
h_{c_3}
hc3h_{c_3}
h_{c_4}
hc4h_{c_4}
h_{c_5}
hc5h_{c_5}
h_{c_6}
hc6h_{c_6}
h_{c_7}
hc7h_{c_7}
h_{c_8}
hc8h_{c_8}
\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}
[logP(x1c1)logP(x2c1)logP(xMc1)logP(x1c2)logP(x2c2)logP(xMc2)logP(x1cN)logP(x2cN)logP(xMcN)]\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

cats drink

cats eat

cats say

dogs drink

dogs eat

dogs say

humans drink

human eat

\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}
[hc1hc2hcN]\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}
\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}
[logP(x1c1)logP(x2c1)logP(xMc1)logP(x1c2)logP(x2c2)logP(xMc2)logP(x1cN)logP(x2cN)logP(xMcN)]\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}
\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}
[hc1hc2hcN]\begin{bmatrix} {h_{c_1}} \\ {h_{c_2}} \\ \\ \\ \\ \cdots \\ \\ \\ \\ \\ {h_{c_N}} \end{bmatrix}
\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}
[logP(x1c1)logP(x2c1)logP(xMc1)logP(x1c2)logP(x2c2)logP(xMc2)logP(x1cN)logP(x2cN)logP(xMcN)]\begin{bmatrix} {\log P^\ast(x_1|c_1)} & {\log P^\ast(x_2|c_1)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_1)} \\ {\log P^\ast(x_1|c_2)} & {\log P^\ast(x_2|c_2)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_2)} \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ {\log P^\ast(x_1|c_N)} & {\log P^\ast(x_2|c_N)} & {\cdot \cdot \cdot} & {\log P^\ast(x_M|c_N)} \end{bmatrix}
\mathbf{H}_\theta
Hθ\mathbf{H}_\theta
\begin{bmatrix} {w_{x_1}} \\ {w_{x_2}} \\ \\ \cdots \\ \\ \\ {w_{x_M}} \end{bmatrix}
[wx1wx2wxM]\begin{bmatrix} {w_{x_1}} \\ {w_{x_2}} \\ \\ \cdots \\ \\ \\ {w_{x_M}} \end{bmatrix}
\mathbf{W}_\theta
Wθ\mathbf{W}_\theta
\mathbf{A}
A\mathbf{A}

this is matrix decomposition!

H_\theta W_\theta^\top = A^\prime
HθWθ=AH_\theta W_\theta^\top = A^\prime

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

\mathbf{H}_\theta \in \mathbb{R}^{N \times d}
HθRN×d\mathbf{H}_\theta \in \mathbb{R}^{N \times d}
W_\theta \in \mathbb{R}^{M \times d}
WθRM×dW_\theta \in \mathbb{R}^{M \times d}
A \in \mathbb{R}^{N \times M}
ARN×MA \in \mathbb{R}^{N \times M}

rank(A) is limited to d

rank?

>>> x = np.array([[ 0,  0,  0,  0,  0,  0],
                  [ 1,  2,  3,  4,  5,  6],
                  [ 2,  4,  6,  8, 10, 12],
                  [ 3,  6,  9, 12, 15, 18],
                  [ 4,  8, 12, 16, 20, 24],
                  [ 5, 10, 15, 20, 25, 30],
                  [ 6, 12, 18, 24, 30, 36],
                  [ 7, 14, 21, 28, 35, 42],
                  [ 8, 16, 24, 32, 40, 48],
                  [ 9, 18, 27, 36, 45, 54]])

>>> np.linalg.matrix_rank(x)
1

large matrix,

low diversity

The Problem

If A for a natural language is a high-rank matrix

then no matter how expressive neural network is,

softmax layer will be a limiting factor

Solutions

Increase rank of A

Increase rank of A

Make word embedding size d larger?

Introduces too many parameters

Increase rank of A

Mixture of Softmaxes

Compute many independent softmaxes and mix them

h \in \mathbb{R}^{d}
hRdh \in \mathbb{R}^{d}

model's hidden vector

h
hh
h_1
h1h_1
h_2
h2h_2
h_3
h3h_3
W_k \in \mathbb{R}^{K \cdot d \times d}
WkRKd×dW_k \in \mathbb{R}^{K \cdot d \times d}

make K hidden vectors

h_k = \tanh(W_k h)
hk=tanh(Wkh)h_k = \tanh(W_k h)

added paramerers

h
hh
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005
h_1
h1h_1
h_2
h2h_2
h_3
h3h_3
P("mooing") = 0.099
P("mooing")=0.099P("mooing") = 0.099
P("drink") = 0.0002
P("drink")=0.0002P("drink") = 0.0002
P("mooing") = 0.003
P("mooing")=0.003P("mooing") = 0.003
P("drink") = 0.001
P("drink")=0.001P("drink") = 0.001

compute K softmaxes

h
hh
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005
h_1
h1h_1
h_2
h2h_2
h_3
h3h_3
P("mooing") = 0.099
P("mooing")=0.099P("mooing") = 0.099
P("drink") = 0.0002
P("drink")=0.0002P("drink") = 0.0002
P("mooing") = 0.003
P("mooing")=0.003P("mooing") = 0.003
P("drink") = 0.001
P("drink")=0.001P("drink") = 0.001
P("mooing") = 0.003
P("mooing")=0.003P("mooing") = 0.003
P("drink") = 0.001
P("drink")=0.001P("drink") = 0.001

mix K softmaxes

How to mix?

P_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}
Pθ(xc)=k=1Kπc,kexphc,kwxxexphc,kwxP_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}
\sum_{k=1}^K \pi_{c,k} = 1
k=1Kπc,k=1\sum_{k=1}^K \pi_{c,k} = 1

learned parameter

weighted average

Weighted sum with learned coefficients

Does it help?

Matrix rank

Softmax k=1 k=2 k=3 k=4 k=5
Traditional 34 34 34 34 34
 Mixture 34 629 979 995 997

http://smerity.com/articles/2017/mixture_of_softmaxes.html

 d=32, V=1000, N=2048

Language modelling

Model Params PTB WikiText-2
AWD-LSTM 24M 57.7 65.8
  + mixture of            softmaxes 22M 54.44 61.45

State-of-the-art

Neural Machine translation

Questions?

Softmax Bottlneck

By Oleksiy Syvokon

Softmax Bottlneck

Research Guild Meeting: "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model" by Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen.

  • 653