Regularizing RNN LMs

Zoneout (Krueger et al. 2016) Stochastically forces some of the recurrent units in h to maintain their previous values Imagine a faulty update mechanism: where delta is the update and m the dropout mask

h_t = h_{t-1} + m \odot \delta_t
$h_t = h_{t-1} + m \odot \delta_t$

Progression of language models

h_t, y_t = \text{RNN}(x_t, h_{t-1})
$h_t, y_t = \text{RNN}(x_t, h_{t-1})$

Progression of language models

h_t, y_t = \text{RNN}(x_t, h_{t-1})
$h_t, y_t = \text{RNN}(x_t, h_{t-1})$

Natural sequence ⇒ probability

p(S) = p(w_1, w_2, w_3, \ldots, w_n)
$p(S) = p(w_1, w_2, w_3, \ldots, w_n)$

Break this down to probability of next word via chain rule of probability

p(w_n|w_1, w_2, w_3, \ldots, w_{n-1})
$p(w_n|w_1, w_2, w_3, \ldots, w_{n-1})$

Issues with RNNs for LM

Pointer Networks (Vinyals et al. 2015)

Convex Hull

Delaunay  Triangulation

we train them with BPTTfor only 35 timesteps

Issues in standard BPTT

BPTT for 0 ts

For following work, pointer sentinel BPTT uses 100 past timesteps

10+% drop

Frequent

Rare

Progress continues..!

Independently Improving Neural Language Models with a Continuous Cache (Grave, Joulin, Usunier) apply a similar mechanism as Pointer Sentinel to RNN outputs
(they report results on PTB, WikiText-2, and WikiText-103!)

Tying word vectors helps with rare words and avoids wastefully learning a one-to-one mapping
(major perplexity improvement for essentially all models)

Zilly, Srivastava, Koutník, SchmidhuberZoph, Le) are improving basic RNN cells with new SotA on PTB

By smerity

• 787