research guild meeting
language model
caws drink
P(milk |c) = 0.7
P(water |c) = 0.8
P(wine |c) = 0.0001
P(bricks|c) = 0.000000001
...
context c
target x
RNN Language model
caws
drink
???
RNN Language model
caws
drink
???
RNN Language model
caws
drink
???
RNN Language model
caws
drink
???
RNN Language model
caws
drink
???
context
P(target
|context
)
Softmax bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
Softmax bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
. . . . . . . .
. . . . . . . . . .
Limited expressivity!
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
Contexts
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
. . . . . . .
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
this is matrix decomposition!
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
rank(A) is limited to d
rank?
>>> x = np.array([[ 0, 0, 0, 0, 0, 0],
[ 1, 2, 3, 4, 5, 6],
[ 2, 4, 6, 8, 10, 12],
[ 3, 6, 9, 12, 15, 18],
[ 4, 8, 12, 16, 20, 24],
[ 5, 10, 15, 20, 25, 30],
[ 6, 12, 18, 24, 30, 36],
[ 7, 14, 21, 28, 35, 42],
[ 8, 16, 24, 32, 40, 48],
[ 9, 18, 27, 36, 45, 54]])
>>> np.linalg.matrix_rank(x)
1
large matrix,
low diversity
The Problem
If A for a natural language is a high-rank matrix
then no matter how expressive neural network is,
softmax layer will be a limiting factor
Solutions
Increase rank of A
Increase rank of A
– Make word embedding size d larger?
Introduces too many parameters
Increase rank of A
Mixture of Softmaxes
– Compute many independent softmaxes and mix them
model's hidden vector
make K hidden vectors
added paramerers
compute K softmaxes
mix K softmaxes
How to mix?
learned parameter
weighted average
Weighted sum with learned coefficients
Does it help?
Matrix rank
Softmax | k=1 | k=2 | k=3 | k=4 | k=5 |
---|---|---|---|---|---|
Traditional | 34 | 34 | 34 | 34 | 34 |
Mixture | 34 | 629 | 979 | 995 | 997 |
http://smerity.com/articles/2017/mixture_of_softmaxes.html
d=32, V=1000, N=2048
Language modelling
Model | Params | PTB | WikiText-2 |
---|---|---|---|
AWD-LSTM | 24M | 57.7 | 65.8 |
+ mixture of softmaxes | 22M | 54.44 | 61.45 |
State-of-the-art
Neural Machine translation
Questions?
Softmax Bottlneck
By Oleksiy Syvokon
Softmax Bottlneck
Research Guild Meeting: "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model" by Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen.
- 692