Revisiting Activation Regularization for Language RNNs

Stephen Merity, Bryan McCann, Richard Socher

1st Workshop on Learning to Generate Natural Language

Motivation: Better baselines

Fast and open baselines are the
building blocks of future work
... yet rarely get any continued love.

Given the speed of progress within our field, and the many places progress arises, revisiting these baselines is important

Baseline numbers are frequently copy + pasted from other papers, rarely being re-run or improved

Revisiting RNN regularization

Standard activation regularization is easy,
even part of the Keras API ("activity_regularizer")

Activation regularization has nearly no impact on train time and is compatible with fast black box RNN implementations (i.e. NVIDIA cuDNN)

... yet they haven't been used recently?

Activation Regularization (AR)

Encourage small activations, penalizing any activations far from zero

For RNNs, simply add an additional loss,

where m is dropout mask and α is a scaler.

We found it's more effective when applied to the dropped output of the final RNN layer

\alpha L_2(m \cdot h_t)

\alpha L_2(m \cdot h_t)

Temporal Activation Regularization (TAR)

Loss penalizes changes between hidden states,

where h is the RNN output and β is a scaler.

Our work only penalizes the hidden output h of the RNN, leaving internal memory c unchanged.

\beta L_2(h_t - h_{t-1})

\beta L_2(h_t - h_{t-1})

Penn Treebank Results

Note: these LSTMs have no recurrent regularization,
all other results here used variational recurrent dropout

WikiText-2 Results

Note: these LSTMs have no recurrent regularization,
all other results here used variational recurrent dropout

Related work

Independently LSTM LM baselines are revisited in Melis et al. (2017) and Merity et al. (2017)

Both, with negligible modifications to the LSTM, achieve strong gains over far more complex models on both Penn Treebank and WikiText-2

Melis et al. 2017 also investigate "hyper-parameter noise" / experimental setup of recent SotA models

Summary

Baselines should not be set in stone
As we improve the underlying fundamentals of our field, we should revisit them and improve them

Fast, open, and easy to extend baselines should be first class citizens for our field

We'll be releasing our PyTorch codebase for this and follow-up SotA (PTB/WT-2) numbers soon

Interested in more details?

Contact me online at @smerity / smerity.com

Check out
Revisiting Activation Regularization for Language RNNs

and our follow up work
Regularizing and Optimizing LSTM Language Models

Revisiting Activation Regularization for Language RNNs

Stephen Merity, Bryan McCann, Richard Socher

1st Workshop on Learning to Generate Natural Language

Motivation: Better baselines

Fast and open baselines are the building blocks of future work ​... yet rarely get any continued love.

Given the speed of progress within our field, and the many places progress arises, revisiting these baselines is important

Baseline numbers are frequently copy + pasted from other papers, rarely being re-run or improved

Revisiting RNN regularization

Standard activation regularization is easy, even part of the Keras API ("activity_regularizer")

Activation regularization has nearly no impact on train time and is compatible with fast black box RNN implementations (i.e. NVIDIA cuDNN)

... yet they haven't been used recently?

Activation Regularization (AR)

Encourage small activations, penalizing any activations far from zero

For RNNs, simply add an additional loss,

where m is dropout mask and α is a scaler.

We found it's more effective when applied to the dropped output of the final RNN layer

Temporal Activation Regularization (TAR)

Loss penalizes changes between hidden states, where h is the RNN output and β is a scaler.

Our work only penalizes the hidden output h of the RNN, leaving internal memory c unchanged.

Penn Treebank Results

Note: these LSTMs have no recurrent regularization, all other results here used variational recurrent dropout

WikiText-2 Results

Note: these LSTMs have no recurrent regularization, all other results here used variational recurrent dropout

Related work

Independently LSTM LM baselines are revisited in Melis et al. (2017) and Merity et al. (2017)

Both, with negligible modifications to the LSTM, achieve strong gains over far more complex models on both Penn Treebank and WikiText-2

Melis et al. 2017 also investigate "hyper-parameter noise" / experimental setup of recent SotA models

Summary

Baselines should not be set in stone As we improve the underlying fundamentals of our field, we should revisit them and improve them

Fast, open, and easy to extend baselines should be first class citizens for our field We'll be releasing our PyTorch codebase for this and follow-up SotA (PTB/WT-2) numbers soon

Interested in more details?

Contact me online at @smerity / smerity.com

Check out Revisiting Activation Regularization for Language RNNs

and our follow up work Regularizing and Optimizing LSTM Language Models

Fast and open baselines are the
building blocks of future work
... yet rarely get any continued love.

Standard activation regularization is easy,
even part of the Keras API ("activity_regularizer")

Loss penalizes changes between hidden states,

where h is the RNN output and β is a scaler.

Note: these LSTMs have no recurrent regularization,
all other results here used variational recurrent dropout

Note: these LSTMs have no recurrent regularization,
all other results here used variational recurrent dropout

Baselines should not be set in stone
As we improve the underlying fundamentals of our field, we should revisit them and improve them

Fast, open, and easy to extend baselines should be first class citizens for our field

We'll be releasing our PyTorch codebase for this and follow-up SotA (PTB/WT-2) numbers soon

Check out
Revisiting Activation Regularization for Language RNNs

and our follow up work
Regularizing and Optimizing LSTM Language Models