Language modelling:
recent advances
Oleksiy Syvokon
research engineer
What is language model?
What is language model?
cats drink ...
P(milk) = 0.7
P(water) = 0.8
P(wine) = 0.0001
P(bricks) = 0.000000001
Why should I care?
do we need LM?
Autocompletion
do we need LM?
Speech recognition
Just FYI
Just F why I?
Just FBI
do we need LM?
Machine translation
| Це є добре
This is good => | Це є благо
| Це добре
do we need LM?
Text generation:
chatbots,
text summarization,
question answering,
image captioning,
...................................
do we need LM?
Transfer learning
- word embeddings
- pretraining decoder
- secondary objective task
do we need LM?
Improvements to language models lead to improvements on virtually all NLP tasks
Evaluation
1. Direct metric (WER, BLEU...)
2. Perplexity
weighted average branching factor
two and two make ...
guys were drinking ...
caws were drinking ...
$$ PP(\{0,1,2,3,4,5,6,7,8,9\}) = 10 $$
Evaluation
1. Direct metric (WER, BLEU...)
2. Perplexity
the lower is better
Count-based
language models
n-gram models
cats 11,913,675 cats drink 1,986 cats drink milk 92 drink milk 95,387 drink 28,677,196 milk 23,639,284
n-gram models
Model | PTB test PPL |
---|---|
Kneser–Ney 5-gram | 141.2 |
neural language models
RNN Language model
caws
drink
???
RNN Language model
caws
drink
RNN Language model
caws
drink
RNN Language model
caws
drink
RNN Language model
caws
drink
RNN Language model
Model | PTB test PPL |
---|---|
Kneser–Ney 5-gram | 141.2 |
Plain LSTM | 121.1 |
advances
rnn architecture
LSTM Long short-term memory
GRU Gated Recurrent Unit
RHN Recurrent Highway Network
NAS Neural Architecture Search with Reinforcement Learning
. . .
rnn architecture
LSTM Long short-term memory
GRU Gated Recurrent Unit
RHN Recurrent Highway Network
NAS Neural Architecture Search with Reinforcement Learning
. . .
Regularization
Dropout
Batch normalization
Recurrent matrix regularization
Trainable parameters reduction
. . . . . . . . .
Dropout
Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)
Embed (input) dropout
Regularization
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
caws
drink
Regularization
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 20M | 121.1 |
+ embed dropout | 20M | 86.5 |
Embed (input) dropout
Standard dropout
Regularization
Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)
bad for RNNs!
variational dropout
Regularization
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
same mask for all timestamp
(but different for each sample in a mini-batch)
variational dropout
Regularization
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 20M | 121.1 |
+ embed dropout | 20M | 86.5 |
+ variational dropout | 20M | 78.6 |
variational dropout
Regularization
A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
alters
LSTM
internals
good
results
Weight-dropped LSTM
Regularization
Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)
drop LSTM weights,
then run as usual
good
results
Weight-dropped LSTM
Regularization
Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)
drop LSTM weights,
then run as usual
good
results
no LSTM
changes
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
caws
input embeddings
output embeddings
Use single embedding matrix
for both input and output!
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
caws
caws
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
Make W = V!
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ weights tying | 51M | 74.3 |
Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)
weights tying
Regularization
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ weights tying | 51M | 74.3 |
+ variational dropout | 51M | 73.2 |
output dropout
Regularization
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
intra-layer dropout
Regularization
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Everything combined
Regularization
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
Everything combined
Regularization
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
+ weights tying + all dropouts |
24M (4-layer LSTM) | 58.3 |
Everything combined
Regularization
On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)
(M
Model | Parameters | PTB test PPL |
---|---|---|
Non-regularized LSTM | 66M | 127.4 |
+ embed dropout | 66M | 86.0 |
+ variational dropout | 66M | 73.4 |
+ weights tying + all dropouts |
24M (4-layer LSTM) | 58.3 |
+ weights tying + all dropouts |
10M (1-layer LSTM) | 59.6 |
Softmax
Softmax bottleneck
Limited expressivity!
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
Softmax bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
Softmax Bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
rank(A) is limited to d
Softmax Bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
How to increase rank of A?
– Compute many softmaxes and mix them!
Softmax Bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
But... how do we get many softmaxes?
– Make projections!
makes K hidden vectors
model's hidden vector
added paramerers
Softmax Bottleneck
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
How to mix?
learned parameter
weighted average
Mixture of softmaxes
Model | Parameters | PTB test PPL |
---|---|---|
AWD-LSTM | 24M | 57.7 |
+ mixture of softmax | 22M | 54.44 |
Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)
State-of-the-art as of 2017-11-26
Can we beat sota?
adaptive models
Dynamic evaluation
Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)
Adapt model parameters to parts of sequence during evaluation.
Thousands
of
far-right
nationalists
$$ \text{model}(s_1, \theta_1)$$
$$ \text{P}(s_1, \theta_1) $$
gathered
in
Poland's capital
$$ \text{model}(s_2, \theta_2)$$
$$ \text{P}(s_2, \theta_2) $$
Warsaw
for
"Independence March"
$$ \text{model}(s_3, \theta_3)$$
$$ \text{P}(s_3, \theta_3) $$
$$s_1$$
$$s_2$$
$$s_3$$
$$\nabla L(s_1)$$
$$\nabla L(s_2)$$
Dynamic evaluation
Adapt model parameters to parts of sequence during evaluation.
Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)
Model | Parameters | PTB test PPL |
---|---|---|
AWD-LSTM | 24M | 57.7 |
+ dynamic eval | 24M | 51.1 |
Dynamic evaluation
Adapt model parameters to parts of sequence during evaluation.
Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)
Model | Parameters | PTB test PPL |
---|---|---|
AWD-LSTM | 24M | 57.7 |
+ dynamic eval | 24M | 51.1 |
+ mixture of softmax | 24M | 47.69 |
neural cache
Improving Neural Language Models with a Continuous Cache (Grave et al., 2016)
Store hidden vectors with the corresponding next words
Make a prediction based on current hidden vector similarity to cached hidden states
Final prediction is a linear combination of cache prediction and "normal" model output.
out of scope:
* Combine n-gram and neural LMs
* Large vocabulary problem:
- efficient softmax approximations
- subword models (character, BPE, syllabous)
* Models compression
- weight prunning
- word embedding compression
* More adaptive models
Questions?
We are hiring!
Recent Advanced in Language Modelling
By Oleksiy Syvokon
Recent Advanced in Language Modelling
- 1,086