Language modelling:

recent advances

Oleksiy Syvokon

research engineer

What is language model?

P(\text{cats drink milk})=0.0009
P(cats drink milk)=0.0009P(\text{cats drink milk})=0.0009
P(\text{cats drinks milk})=0.0002
P(cats drinks milk)=0.0002P(\text{cats drinks milk})=0.0002
P(\text{cats window red})=0.0000001
P(cats window red)=0.0000001P(\text{cats window red})=0.0000001

What is language model?

cats drink ...
P(milk)     = 0.7
P(water)    = 0.8
P(wine)     = 0.0001
P(bricks)   = 0.000000001

Why should I care?

do we need LM?

Autocompletion

do we need LM?

Speech recognition

Just FYI
Just F why I?
Just FBI

do we need LM?

Machine translation

                | Це є добре
This is good => | Це є благо
                | Це добре

do we need LM?

Text generation:

chatbots,

text summarization,

question answering,

image captioning,

...................................

do we need LM?

Transfer learning

  •  word embeddings
  • pretraining decoder
  • secondary objective task

do we need LM?

Improvements to language models lead to improvements on virtually all NLP tasks

Evaluation

1. Direct metric (WER, BLEU...)

2. Perplexity

PP(W) = P(w_1 w_2 w_3 ... w_N)^{-\frac{1}{N}}
PP(W)=P(w1w2w3...wN)1NPP(W) = P(w_1 w_2 w_3 ... w_N)^{-\frac{1}{N}}

weighted average branching factor

two and two make ...
guys were drinking ...
caws were drinking ...

$$ PP(\{0,1,2,3,4,5,6,7,8,9\}) = 10 $$

Evaluation

1. Direct metric (WER, BLEU...)

2. Perplexity

the lower is better

Count-based
language models

n-gram models

cats              11,913,675
cats drink        1,986
cats drink milk   92
drink milk        95,387
drink             28,677,196
milk              23,639,284

n-gram models

Model PTB test PPL
Kneser–Ney 5-gram 141.2

neural language models

RNN Language model

caws
drink
???

RNN Language model

caws
drink

RNN Language model

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}

RNN Language model

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005
...
......

RNN Language model

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005
...
......
P("water") = 0.007
P("water")=0.007P("water") = 0.007
P("beer") = 0.0001
P("beer")=0.0001P("beer") = 0.0001
...
......
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}

RNN Language model

Model PTB test PPL
Kneser–Ney 5-gram 141.2
Plain LSTM 121.1

advances

rnn architecture

LSTM   Long short-term memory

GRU      Gated Recurrent Unit

RHN     Recurrent Highway Network

NAS      Neural Architecture Search with Reinforcement Learning

   . . .



rnn architecture

LSTM   Long short-term memory

 

GRU      Gated Recurrent Unit

RHN     Recurrent Highway Network

NAS      Neural Architecture Search with Reinforcement Learning

   . . .

 

 

Regularization

Dropout

Batch normalization

Recurrent matrix regularization

Trainable parameters reduction

     . . . . . . . . .

 

Dropout

Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)

Embed (input) dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

caws
drink
h_{1,1}
h1,1h_{1,1}
h_{2,1}
h2,1h_{2,1}

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

Model Parameters PTB test PPL
Non-regularized LSTM 20M 121.1
   + embed dropout 20M 86.5

Embed (input) dropout

Standard dropout

Regularization

Dropout: a simple way to prevent neural networks from overfitting (Srivastava et al., 2014)

h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}
h_{3,1}
h3,1h_{3,1}
h_{3,2}
h3,2h_{3,2}

bad for RNNs!

variational dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}
h_{3,1}
h3,1h_{3,1}
h_{3,2}
h3,2h_{3,2}

same mask for all timestamp

(but different for each sample in a mini-batch)

variational dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

Model Parameters PTB test PPL
Non-regularized LSTM 20M 121.1
   + embed dropout 20M 86.5
   + variational dropout 20M 78.6

variational dropout

Regularization

A theoretically grounded application of dropout in recurrent neural network (Gal et al., 2016)

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4
   + embed dropout 66M 86.0
   + variational dropout 66M 73.4

alters
LSTM
internals

good

results

Weight-dropped LSTM

Regularization

Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)

h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}
h_{3,1}
h3,1h_{3,1}
h_{3,2}
h3,2h_{3,2}

drop LSTM weights,

then run as usual

good

results

Weight-dropped LSTM

Regularization

Regularizing and Optimizing LSTM Language Models (Merity et al., 2017)

h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}
h_{3,1}
h3,1h_{3,1}
h_{3,2}
h3,2h_{3,2}

drop LSTM weights,

then run as usual

good

results

no LSTM

changes

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

caws
h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005
...
......
W \in \mathbb{R}^{30000 \times 250}
WR30000×250W \in \mathbb{R}^{30000 \times 250}
V \in \mathbb{R}^{30000 \times 250}
VR30000×250V \in \mathbb{R}^{30000 \times 250}

input embeddings

output embeddings

Use single embedding matrix
for both input and output!

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

caws
c = [0 \ 0 \ 0 \ ... \ 0 \ 1 \ 0 \ 0], \ c \in \mathbb{R}^N
c=[0 0 0 ... 0 1 0 0], cRNc = [0 \ 0 \ 0 \ ... \ 0 \ 1 \ 0 \ 0], \ c \in \mathbb{R}^N
caws
W \in \mathbb{R}^{N \times d}
WRN×dW \in \mathbb{R}^{N \times d}
W \in \mathbb{R}^{30000 \times 250}
WR30000×250W \in \mathbb{R}^{30000 \times 250}
x_i = W^\top c
xi=Wcx_i = W^\top c
x_i \in \mathbb{R}^{d}
xiRdx_i \in \mathbb{R}^{d}

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

p_i \in \mathbb{R}^N
piRNp_i \in \mathbb{R}^N
V \in \mathbb{R}^{N \times d}
VRN×dV \in \mathbb{R}^{N \times d}
V \in \mathbb{R}^{30000 \times 250}
VR30000×250V \in \mathbb{R}^{30000 \times 250}
y_i^{\prime} = V h_i
yi=Vhiy_i^{\prime} = V h_i
h_i \in \mathbb{R}^{d}
hiRdh_i \in \mathbb{R}^{d}
h_{1,2}
h1,2h_{1,2}
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005
...
......
p_i = \text{softmax}(y_i^{\prime})
pi=softmax(yi)p_i = \text{softmax}(y_i^{\prime})
y^{\prime} \in \mathbb{R}^N
yRNy^{\prime} \in \mathbb{R}^N

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

V \in \mathbb{R}^{N \times d}
VRN×dV \in \mathbb{R}^{N \times d}
W \in \mathbb{R}^{N \times d}
WRN×dW \in \mathbb{R}^{N \times d}

Make W = V!

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4
   + weights tying 51M 74.3

Using the Output Embedding to Improve Language Models (Press and Wolf, 2016)

weights tying

Regularization

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4
   + weights tying 51M 74.3
   + variational dropout 51M 73.2

output dropout

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

 (M

h_{1,2}
h1,2h_{1,2}
h_{2,2}
h2,2h_{2,2}

intra-layer dropout

Regularization

h_{1,1}
h1,1h_{1,1}
h_{1,2}
h1,2h_{1,2}
h_{2,1}
h2,1h_{2,1}
h_{2,2}
h2,2h_{2,2}

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

 (M

Everything combined

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

 (M

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4
   + embed dropout 66M 86.0
   + variational dropout 66M 73.4

Everything combined

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

 (M

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4
   + embed dropout 66M 86.0
   + variational dropout 66M 73.4
   + weights tying
   + all dropouts
24M (4-layer LSTM) 58.3

Everything combined

Regularization

On the State of the Art of Evaluation in Neural Language Models (Melis et al., 2017)

 (M

Model Parameters PTB test PPL
Non-regularized LSTM 66M 127.4
   + embed dropout 66M 86.0
   + variational dropout 66M 73.4
   + weights tying
   + all dropouts
24M (4-layer LSTM) 58.3
   + weights tying
   + all dropouts
10M (1-layer LSTM) 59.6

Softmax

Softmax bottleneck

h
hh
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005

Limited expressivity!

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

Softmax bottleneck

h
hh
P("mooing") = 0.002
P("mooing")=0.002P("mooing") = 0.002
P("drink") = 0.005
P("drink")=0.005P("drink") = 0.005

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

h_1
h1h_1
h_2
h2h_2
h_3
h3h_3
P("mooing") = 0.099
P("mooing")=0.099P("mooing") = 0.099
P("drink") = 0.0002
P("drink")=0.0002P("drink") = 0.0002
P("mooing") = 0.003
P("mooing")=0.003P("mooing") = 0.003
P("drink") = 0.001
P("drink")=0.001P("drink") = 0.001
P("mooing") = 0.003
P("mooing")=0.003P("mooing") = 0.003
P("drink") = 0.001
P("drink")=0.001P("drink") = 0.001

Softmax Bottleneck

P_{\theta}(x|c) = \frac{\exp{h_c^\top w_x}}{\sum_{x'}\exp{h_c^\top w_x}}
Pθ(xc)=exphcwxxexphcwxP_{\theta}(x|c) = \frac{\exp{h_c^\top w_x}}{\sum_{x'}\exp{h_c^\top w_x}}
H_\theta \in \mathbb{R}^{N \times d}
HθRN×dH_\theta \in \mathbb{R}^{N \times d}
W_\theta \in \mathbb{R}^{M \times d}
WθRM×dW_\theta \in \mathbb{R}^{M \times d}
A \in \mathbb{R}^{N \times M}
ARN×MA \in \mathbb{R}^{N \times M}
H_\theta W_\theta^\top = A^\prime
HθWθ=AH_\theta W_\theta^\top = A^\prime

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

rank(A) is limited to d

Softmax Bottleneck

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

How to increase rank of A?

Compute many softmaxes and mix them!

Softmax Bottleneck

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

But... how do we get many softmaxes?

Make projections!

W_k \in \mathbb{R}^{K d \times d}
WkRKd×dW_k \in \mathbb{R}^{K d \times d}

makes K hidden vectors

\tanh(W_k h)
tanh(Wkh)\tanh(W_k h)
h \in \mathbb{R}^{d}
hRdh \in \mathbb{R}^{d}

model's hidden vector

added paramerers

Softmax Bottleneck

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

How to mix?

P_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}
Pθ(xc)=k=1Kπc,kexphc,kwxxexphc,kwxP_{\theta}(x|c) = \sum_{k=1}^K \pi_{c,k} \frac{\exp{h_{c,k}^\top w_x}}{\sum_{x'}\exp{h_{c,k}^\top w_x}}
\sum_{k=1}^K \pi_{c,k} = 1
k=1Kπc,k=1\sum_{k=1}^K \pi_{c,k} = 1

learned parameter

weighted average

Mixture of softmaxes

Model Parameters PTB test PPL
AWD-LSTM 24M 57.7
   + mixture of softmax 22M 54.44

Breaking the Softmax Bottleneck: A High-Rank RNN Language Mode (Yang et al., 2017)

State-of-the-art as of 2017-11-26

Can we beat sota?

adaptive models

Dynamic evaluation

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Adapt model parameters to parts of sequence during evaluation.

Thousands
of 
far-right
nationalists

$$ \text{model}(s_1, \theta_1)$$

$$ \text{P}(s_1, \theta_1) $$

gathered
in
Poland's capital

$$ \text{model}(s_2, \theta_2)$$

$$ \text{P}(s_2, \theta_2) $$

Warsaw
for
"Independence March"

$$ \text{model}(s_3, \theta_3)$$

$$ \text{P}(s_3, \theta_3) $$

$$s_1$$

$$s_2$$

$$s_3$$

$$\nabla L(s_1)$$

$$\nabla L(s_2)$$

Dynamic evaluation

Adapt model parameters to parts of sequence during evaluation.

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Model Parameters PTB test PPL
AWD-LSTM 24M 57.7
   + dynamic eval 24M 51.1

Dynamic evaluation

Adapt model parameters to parts of sequence during evaluation.

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Model Parameters PTB test PPL
AWD-LSTM 24M 57.7
   + dynamic eval 24M 51.1
   + mixture of softmax 24M 47.69

neural cache

Improving Neural Language Models with a Continuous Cache (Grave et al., 2016)

Store hidden vectors with the corresponding next words

Make a prediction based on current hidden vector similarity to cached hidden states 

Final prediction is a linear combination of cache prediction and "normal" model output.

out of scope:

* Combine n-gram and neural LMs

* Large vocabulary problem:

   - efficient softmax approximations

   - subword models (character, BPE, syllabous)

* Models compression

    - weight prunning

    - word embedding compression

* More adaptive models

Questions?

Recent Advanced in Language Modelling

By Oleksiy Syvokon

Recent Advanced in Language Modelling

  • 1,086