Convex Hull
Delaunay Triangulation
BPTT for 0 ts
For following work, pointer sentinel BPTT uses 100 past timesteps
10+% drop
Frequent
Rare
Independently Improving Neural Language Models with a Continuous Cache (Grave, Joulin, Usunier) apply a similar mechanism as Pointer Sentinel to RNN outputs
(they report results on PTB, WikiText-2, and WikiText-103!)
Tying word vectors helps with rare words and avoids wastefully learning a one-to-one mapping
(major perplexity improvement for essentially all models)
Recurrent Highway Network (Zilly, Srivastava, Koutník, Schmidhuber) & Neural Architecture Search with RL (Zoph, Le) are improving basic RNN cells with new SotA on PTB