Language modelling:

recent advances

Oleksiy Syvokon

research engineer

What is language model?

P(\text{cats drink milk})=0.0009

P(\text{cats drink milk})=0.0009

P(\text{cats drinks milk})=0.0002

P(\text{cats drinks milk})=0.0002

P(\text{cats window red})=0.0000001

P(\text{cats window red})=0.0000001

What is language model?

cats drink ...

P(milk)     = 0.7
P(water)    = 0.8
P(wine)     = 0.0001
P(bricks)   = 0.000000001

Why should I care?

do we need LM?

Autocompletion

do we need LM?

Speech recognition

Just FYI
Just F why I?
Just FBI

do we need LM?

Machine translation

                | Це є добре
This is good => | Це є благо
                | Це добре

do we need LM?

Text generation:

chatbots,

text summarization,

question answering,

image captioning,

...................................

do we need LM?

Transfer learning

word embeddings
pretraining decoder
secondary objective task

do we need LM?

Improvements to language models lead to improvements on virtually all NLP tasks

Evaluation

1. Direct metric (WER, BLEU...)

2. Perplexity

PP(W) = P(w_1 w_2 w_3 ... w_N)^{-\frac{1}{N}}

PP(W) = P(w_1 w_2 w_3 ... w_N)^{-\frac{1}{N}}

weighted average branching factor

two and two make ...

guys were drinking ...

caws were drinking ...

$$ PP(\{0,1,2,3,4,5,6,7,8,9\}) = 10 $$

Evaluation

1. Direct metric (WER, BLEU...)

2. Perplexity

the lower is better

Count-based
language models

n-gram models

cats              11,913,675
cats drink        1,986
cats drink milk   92
drink milk        95,387
drink             28,677,196
milk              23,639,284

n-gram models

Model	PTB test PPL
Kneser–Ney 5-gram	141.2

neural language models

RNN Language model

caws

drink

???

RNN Language model

caws

drink

RNN Language model

caws

drink

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

RNN Language model

caws

drink

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

...

...

RNN Language model

caws

drink

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

...

...

P("water") = 0.007

P("water") = 0.007

P("beer") = 0.0001

P("beer") = 0.0001

...

...

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

RNN Language model

Model	PTB test PPL
Kneser–Ney 5-gram	141.2
Plain LSTM	121.1

advances

rnn architecture

LSTM Long short-term memory

GRU Gated Recurrent Unit

RHN Recurrent Highway Network

NAS Neural Architecture Search with Reinforcement Learning

. . .

rnn architecture

LSTM Long short-term memory

GRU Gated Recurrent Unit

RHN Recurrent Highway Network

NAS Neural Architecture Search with Reinforcement Learning

. . .

Regularization

Dropout

Batch normalization

Recurrent matrix regularization

Trainable parameters reduction

. . . . . . . . .

Dropout

Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)

Embed (input) dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

caws

drink

h_{1,1}

h_{1,1}

h_{2,1}

h_{2,1}

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

Model	Parameters	PTB test PPL
Non-regularized LSTM	20M	121.1
+ embed dropout	20M	86.5

Embed (input) dropout

Standard dropout

Regularization

Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

h_{3,1}

h_{3,1}

h_{3,2}

h_{3,2}

bad for RNNs!

variational dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

h_{3,1}

h_{3,1}

h_{3,2}

h_{3,2}

same mask for all timestamp

(but different for each sample in a mini-batch)

variational dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

Model	Parameters	PTB test PPL
Non-regularized LSTM	20M	121.1
+ embed dropout	20M	86.5
+ variational dropout	20M	78.6

variational dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4
+ embed dropout	66M	86.0
+ variational dropout	66M	73.4

alters
LSTM
internals

good

results

Weight-dropped LSTM

Regularization

Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

h_{3,1}

h_{3,1}

h_{3,2}

h_{3,2}

drop LSTM weights,

then run as usual

good

results

Weight-dropped LSTM

Regularization

Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

h_{3,1}

h_{3,1}

h_{3,2}

h_{3,2}

drop LSTM weights,

then run as usual

good

results

no LSTM

changes

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

caws

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

...

...

W \in \mathbb{R}^{30000 \times 250}

W \in \mathbb{R}^{30000 \times 250}

V \in \mathbb{R}^{30000 \times 250}

V \in \mathbb{R}^{30000 \times 250}

input embeddings

output embeddings

Use single embedding matrix
for both input and output!

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

caws

c = [0 \ 0 \ 0 \ ... \ 0 \ 1 \ 0 \ 0], \ c \in \mathbb{R}^N

c = [0 \ 0 \ 0 \ ... \ 0 \ 1 \ 0 \ 0], \ c \in \mathbb{R}^N

caws

W \in \mathbb{R}^{N \times d}

W \in \mathbb{R}^{N \times d}

W \in \mathbb{R}^{30000 \times 250}

W \in \mathbb{R}^{30000 \times 250}

x_i = W^\top c

x_i = W^\top c

x_i \in \mathbb{R}^{d}

x_i \in \mathbb{R}^{d}

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

p_i \in \mathbb{R}^N

p_i \in \mathbb{R}^N

V \in \mathbb{R}^{N \times d}

V \in \mathbb{R}^{N \times d}

V \in \mathbb{R}^{30000 \times 250}

V \in \mathbb{R}^{30000 \times 250}

y_i^{\prime} = V h_i

y_i^{\prime} = V h_i

h_i \in \mathbb{R}^{d}

h_i \in \mathbb{R}^{d}

h_{1,2}

h_{1,2}

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

...

...

p_i = \text{softmax}(y_i^{\prime})

p_i = \text{softmax}(y_i^{\prime})

y^{\prime} \in \mathbb{R}^N

y^{\prime} \in \mathbb{R}^N

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

V \in \mathbb{R}^{N \times d}

V \in \mathbb{R}^{N \times d}

W \in \mathbb{R}^{N \times d}

W \in \mathbb{R}^{N \times d}

Make W = V!

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4
+ weights tying	51M	74.3

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4
+ weights tying	51M	74.3
+ variational dropout	51M	73.2

output dropout

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

h_{1,2}

h_{1,2}

h_{2,2}

h_{2,2}

intra-layer dropout

Regularization

h_{1,1}

h_{1,1}

h_{1,2}

h_{1,2}

h_{2,1}

h_{2,1}

h_{2,2}

h_{2,2}

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

Everything combined

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4
+ embed dropout	66M	86.0
+ variational dropout	66M	73.4

Everything combined

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4
+ embed dropout	66M	86.0
+ variational dropout	66M	73.4
+ weights tying + all dropouts	24M (4-layer LSTM)	58.3

Everything combined

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

Model	Parameters	PTB test PPL
Non-regularized LSTM	66M	127.4
+ embed dropout	66M	86.0
+ variational dropout	66M	73.4
+ weights tying + all dropouts	24M (4-layer LSTM)	58.3
+ weights tying + all dropouts	10M (1-layer LSTM)	59.6

Softmax

Softmax bottleneck

h

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

Limited expressivity!

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

Softmax bottleneck

h

P("mooing") = 0.002

P("mooing") = 0.002

P("drink") = 0.005

P("drink") = 0.005

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

h_1

h_1

h_2

h_2

h_3

h_3

P("mooing") = 0.099

P("mooing") = 0.099

P("drink") = 0.0002

P("drink") = 0.0002

P("mooing") = 0.003

P("mooing") = 0.003

P("drink") = 0.001

P("drink") = 0.001

P("mooing") = 0.003

P("mooing") = 0.003

P("drink") = 0.001

P("drink") = 0.001

Softmax Bottleneck

P_{\theta}(x|c) = \frac{\exp{h_c^\top w_x}}{\sum_{x'}\exp{h_c^\top w_x}}

P_{\theta}(x|c) = \frac{\exp{h_c^\top w_x}}{\sum_{x'}\exp{h_c^\top w_x}}

H_\theta \in \mathbb{R}^{N \times d}

H_\theta \in \mathbb{R}^{N \times d}

W_\theta \in \mathbb{R}^{M \times d}

W_\theta \in \mathbb{R}^{M \times d}

A \in \mathbb{R}^{N \times M}

A \in \mathbb{R}^{N \times M}

H_\theta W_\theta^\top = A^\prime

H_\theta W_\theta^\top = A^\prime

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

rank(A) is limited to d

Softmax Bottleneck

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

How to increase rank of A?

– Compute many softmaxes and mix them!

Softmax Bottleneck

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

But... how do we get many softmaxes?

– Make projections!

W_k \in \mathbb{R}^{K d \times d}

W_k \in \mathbb{R}^{K d \times d}

makes K hidden vectors

\tanh(W_k h)

\tanh(W_k h)

h \in \mathbb{R}^{d}

h \in \mathbb{R}^{d}

model's hidden vector

added paramerers

Softmax Bottleneck

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

How to mix?

P_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}

P_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}

\sum_{k=1}^K \pi_{c,k} = 1

\sum_{k=1}^K \pi_{c,k} = 1

learned parameter

weighted average

Mixture of softmaxes

Model	Parameters	PTB test PPL
AWD-LSTM	24M	57.7
+ mixture of softmax	22M	54.44

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

State-of-the-art as of 2017-11-26

Can we beat sota?

adaptive models

Dynamic evaluation

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Adapt model parameters to parts of sequence during evaluation.

Thousands
of 
far-right
nationalists

$$ \text{model}(s_1, \theta_1)$$

$$ \text{P}(s_1, \theta_1) $$

gathered
in
Poland's capital

$$ \text{model}(s_2, \theta_2)$$

$$ \text{P}(s_2, \theta_2) $$

Warsaw
for
"Independence March"

$$ \text{model}(s_3, \theta_3)$$

$$ \text{P}(s_3, \theta_3) $$

$$s_1$$

$$s_2$$

$$s_3$$

$$\nabla L(s_1)$$

$$\nabla L(s_2)$$

Dynamic evaluation

Adapt model parameters to parts of sequence during evaluation.

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Model	Parameters	PTB test PPL
AWD-LSTM	24M	57.7
+ dynamic eval	24M	51.1

Dynamic evaluation

Adapt model parameters to parts of sequence during evaluation.

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Model	Parameters	PTB test PPL
AWD-LSTM	24M	57.7
+ dynamic eval	24M	51.1
+ mixture of softmax	24M	47.69

neural cache

Improving Neural Language Models with a Continuous Cache (Grave et al., 2016)

Store hidden vectors with the corresponding next words

Make a prediction based on current hidden vector similarity to cached hidden states

Final prediction is a linear combination of cache prediction and "normal" model output.

out of scope:

* Combine n-gram and neural LMs

* Large vocabulary problem:

- efficient softmax approximations

- subword models (character, BPE, syllabous)

* Models compression

- weight prunning

- word embedding compression

* More adaptive models

Questions?

We are hiring!

https://www.grammarly.com/jobs

Language modelling:

recent advances

What is language model?

What is language model?

Why should I care?

do we need LM?

do we need LM?

do we need LM?

do we need LM?

do we need LM?

do we need LM?

Evaluation

Evaluation

Count-based language models

n-gram models

n-gram models

neural language models

RNN Language model

RNN Language model

RNN Language model

RNN Language model

RNN Language model

RNN Language model

advances

rnn architecture

rnn architecture

Regularization

Dropout

Embed (input) dropout

Regularization

Regularization

Embed (input) dropout

Standard dropout

Regularization

variational dropout

Regularization

variational dropout

Regularization

variational dropout

Regularization

Weight-dropped LSTM

Regularization

Weight-dropped LSTM

Regularization

weights tying

Regularization

weights tying

Regularization

weights tying

Regularization

weights tying

Regularization

weights tying

Regularization

weights tying

Regularization

weights tying

Regularization

weights tying

Regularization

output dropout

Regularization

intra-layer dropout

Regularization

Everything combined

Regularization

Everything combined

Regularization

Everything combined

Regularization

Softmax

Softmax bottleneck

Softmax bottleneck

Softmax Bottleneck

Softmax Bottleneck

Softmax Bottleneck

Softmax Bottleneck

Mixture of softmaxes

Can we beat sota?

adaptive models

Count-based
language models