research guild meeting
caws drink
P(milk |c) = 0.7
P(water |c) = 0.8
P(wine |c) = 0.0001
P(bricks|c) = 0.000000001
...
context c
target x
caws
drink
???
caws
drink
???
caws
drink
???
caws
drink
???
caws
drink
???
context
P(target
|context
)
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
. . . . . . . .
. . . . . . . . . .
Limited expressivity!
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
Contexts
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
. . . . . . .
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
cats drink
cats eat
cats say
dogs drink
dogs eat
dogs say
humans drink
human eat
this is matrix decomposition!
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
rank(A) is limited to d
>>> x = np.array([[ 0, 0, 0, 0, 0, 0],
[ 1, 2, 3, 4, 5, 6],
[ 2, 4, 6, 8, 10, 12],
[ 3, 6, 9, 12, 15, 18],
[ 4, 8, 12, 16, 20, 24],
[ 5, 10, 15, 20, 25, 30],
[ 6, 12, 18, 24, 30, 36],
[ 7, 14, 21, 28, 35, 42],
[ 8, 16, 24, 32, 40, 48],
[ 9, 18, 27, 36, 45, 54]])
>>> np.linalg.matrix_rank(x)
1
large matrix,
low diversity
If A for a natural language is a high-rank matrix
then no matter how expressive neural network is,
softmax layer will be a limiting factor
Increase rank of A
Increase rank of A
– Make word embedding size d larger?
Introduces too many parameters
Increase rank of A
Mixture of Softmaxes
– Compute many independent softmaxes and mix them
model's hidden vector
make K hidden vectors
added paramerers
compute K softmaxes
mix K softmaxes
How to mix?
learned parameter
weighted average
Weighted sum with learned coefficients
Softmax | k=1 | k=2 | k=3 | k=4 | k=5 |
---|---|---|---|---|---|
Traditional | 34 | 34 | 34 | 34 | 34 |
Mixture | 34 | 629 | 979 | 995 | 997 |
http://smerity.com/articles/2017/mixture_of_softmaxes.html
d=32, V=1000, N=2048
Model | Params | PTB | WikiText-2 |
---|---|---|---|
AWD-LSTM | 24M | 57.7 | 65.8 |
+ mixture of softmaxes | 22M | 54.44 | 61.45 |
State-of-the-art