Oleksiy Syvokon
research engineer
cats drink ...
P(milk) = 0.7
P(water) = 0.8
P(wine) = 0.0001
P(bricks) = 0.000000001
Autocompletion
Speech recognition
Just FYI
Just F why I?
Just FBI
Machine translation
| Це є добре
This is good => | Це є благо
| Це добре
Text generation:
chatbots,
text summarization,
question answering,
image captioning,
...................................
Transfer learning
Improvements to language models lead to improvements on virtually all NLP tasks
1. Direct metric (WER, BLEU...)
2. Perplexity
weighted average branching factor
two and two make ...
guys were drinking ...
caws were drinking ...
$$ PP(\{0,1,2,3,4,5,6,7,8,9\}) = 10 $$
1. Direct metric (WER, BLEU...)
2. Perplexity
the lower is better
cats 11,913,675 cats drink 1,986 cats drink milk 92 drink milk 95,387 drink 28,677,196 milk 23,639,284
Model | PTB test PPL |
---|---|
Kneser–Ney 5-gram | 141.2 |
caws
drink
???
caws
drink
caws
drink
caws
drink
caws
drink
Model | PTB test PPL |
---|---|
Kneser–Ney 5-gram | 141.2 |
Plain LSTM | 121.1 |
LSTM Long short-term memory
GRU Gated Recurrent Unit
RHN Recurrent Highway Network
NAS Neural Architecture Search with Reinforcement Learning
. . .
LSTM Long short-term memory
GRU Gated Recurrent Unit
RHN Recurrent Highway Network
NAS Neural Architecture Search with Reinforcement Learning
. . .
Dropout
Batch normalization
Recurrent matrix regularization
Trainable parameters reduction
. . . . . . . . .
Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
caws
drink
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 20M | 121.1 |
+ embed dropout | 20M | 86.5 |
Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)
bad for RNNs!
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
same mask for all timestamp
(but different for each sample in a mini-batch)
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 20M | 121.1 |
+ embed dropout | 20M | 86.5 |
+ variational dropout | 20M | 78.6 |
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
alters
LSTM
internals
good
results
Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)
drop LSTM weights,
then run as usual
good
results
Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)
drop LSTM weights,
then run as usual
good
results
no LSTM
changes
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
caws
input embeddings
output embeddings
Use single embedding matrix
for both input and output!
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
caws
caws
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
Make W = V!
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ weights tying | 51M | 74.3 |
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ weights tying | 51M | 74.3 |
+ variational dropout | 51M | 73.2 |
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
+ weights tying + all dropouts |
24M (4-layer LSTM) | 58.3 |
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
+ weights tying + all dropouts |
24M (4-layer LSTM) | 58.3 |
+ weights tying + all dropouts |
10M (1-layer LSTM) | 59.6 |
Limited expressivity!
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
rank(A) is limited to d
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
How to increase rank of A?
– Compute many softmaxes and mix them!
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
But... how do we get many softmaxes?
– Make projections!
makes K hidden vectors
model's hidden vector
added paramerers
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
How to mix?
learned parameter
weighted average
Model | Parameters | PTB test PPL |
---|---|---|
AWD-LSTM | 24M | 57.7 |
+ mixture of softmax | 22M | 54.44 |
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
State-of-the-art as of 2017-11-26
Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)
Adapt model parameters to parts of sequence during evaluation.
Thousands
of
far-right
nationalists
$$ \text{model}(s_1, \theta_1)$$
$$ \text{P}(s_1, \theta_1) $$
gathered
in
Poland's capital
$$ \text{model}(s_2, \theta_2)$$
$$ \text{P}(s_2, \theta_2) $$
Warsaw
for
"Independence March"
$$ \text{model}(s_3, \theta_3)$$
$$ \text{P}(s_3, \theta_3) $$
$$s_1$$
$$s_2$$
$$s_3$$
$$\nabla L(s_1)$$
$$\nabla L(s_2)$$
Adapt model parameters to parts of sequence during evaluation.
Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)
Model | Parameters | PTB test PPL |
---|---|---|
AWD-LSTM | 24M | 57.7 |
+ dynamic eval | 24M | 51.1 |
Adapt model parameters to parts of sequence during evaluation.
Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)
Model | Parameters | PTB test PPL |
---|---|---|
AWD-LSTM | 24M | 57.7 |
+ dynamic eval | 24M | 51.1 |
+ mixture of softmax | 24M | 47.69 |
Improving Neural Language Models with a Continuous Cache (Grave et al., 2016)
Store hidden vectors with the corresponding next words
Make a prediction based on current hidden vector similarity to cached hidden states
Final prediction is a linear combination of cache prediction and "normal" model output.
* Combine n-gram and neural LMs
* Large vocabulary problem:
- efficient softmax approximations
- subword models (character, BPE, syllabous)
* Models compression
- weight prunning
- word embedding compression
* More adaptive models
We are hiring!