Using Transformers to teach Transformers how to train Transformers

Piotr Kozakowski

joint work with Łukasz Kaiser and Afroz Mohiuddin

at Google Brain

Hyperparameter tuning

Hyperparameters in deep learning:

  • learning rate
  • dropout rate
  • moment decay in adaptive optimizers
  • ...

Tuning is important, but hard.

Done manually, takes a lot of time.

Needs to be re-done for every new architecture and task.

Hyperparameter tuning

Some require scheduling, which takes even more work.

\eta = 0.1
\eta = 0.02
\eta = 0.004
\eta = 0.0008

Zagoruyko et al. - Wide Residual Networks, 2016

Automatic methods

Existing methods:

 

 

 

Problems:

  • not learnable
  • hyperparameters typically fixed during training
  • typically not adaptive
  • grid/random search
  • Bayesian optimization
  • evolutionary algorithms

Solution: reinforcement learning

Learn a policy for controlling hyperparameters based on the observed validation metrics.

Can train on a wide range of architectures and tasks.

Zero-shot or few-shot transfer to new architectures and tasks.

Long-term goal: a general system that can automatically tune any model.

Open-source the policies, so all ML practitioners can use them.

Tuning as an RL problem

agent

environment

observations: validation metrics

rewards: changes in a chosen metric

actions: hyperparameter changes (discrete)

Tuning as an RL problem

Partially observable: observing all parameter values is intractable.

 

Nondeterministic: random weight initialization and dataset permutation.

Tasks

Language modeling:

  • Transformer on LM1B
  • Transformer on Penn Treebank 

Translation:

 

Image classification:

  • Transformer on WMT EN -> DE
  • Wide ResNet on CIFAR-10
I'm going to
eat
school
France
it's windy today
heute ist es windig
frog

Transformer language model

Vaswani et al. - Attention Is All You Need, 2017

Tuned hyperparameters

For Transformers:

  • learning rate
  • weight decay rate
  • dropouts, separately for each layer                              

For Wide ResNet:

  • learning rate
  • weight decay rate
  • momentum mass in SGD                                               

Model-free approach: PPO

PPO: Proximal Policy Optimization

  • more sample-efficient than REINFORCE
  • stable
  • widely used

Use the Transformer language model without input embedding as a policy.

Schulman et al. - Proximal Policy Optimization Algorithms, 2017

Model-free approach: PPO

Experiment setup:

  • optimizing for accuracy on a holdout set
  • 20 PPO epochs
  • 128 parallel model trainings in each epoch
  • ~3h per model training, ~60h for the entire experiment

Model-free approach: PPO

Model-based approach: SimPLe

SimPLe: Simulated Policy Learning

Elements:

  • policy
  • "world model"                                                                   

Train the world model on data collected in the environment.

Train the policy using PPO in the environment simulated by the world model.

Much more sample-efficient than model-free PPO.

\pi : O^* \rightarrow P(A)
\epsilon : O^* \times A \rightarrow P(O)

Kaiser et al. - Model-Based Reinforcement Learning for Atari, 2019

Model-based approach: SimPLe

Time series forecasting

The metric curves are stochastic.

Predict the next point in the sequence:

 

 

Common approach: use a parametric distribution,

e.g. Gaussian:

P(x_1, ..., x_n) = {\prod_{i=1}^n} P(x_i | x_1, ..., x_{i - 1})
P(x_i | x_1, ..., x_{i - 1}) = \mathcal{N}(f(x_1, ..., x_{i - 1}), \sigma^2)

Time series forecasting

Our approach: discretize to a fixed-point representation

and predict consecutive digits (symbols).

This way we can model any distribution within a set precision.

Experiment on synthetic data

dataset

prediction

discretization

Gaussian distribution

Transformer as a world model

Modeled sequence:

Input: both observations and actions.

Predict only observations.

Calculate rewards based on the last two observations.

o_1 a_1 o_2 a_2 \dots o_n

Transformer as a world model

Transformer as a world model

Inference speed: < 1 minute to sample 128 episodes.

In comparison, > 1 hour to train one real architecture.

World model is at least 128 * 60 = 7680 times faster!

Transformer as a policy

Share the architecture with the world model.

Input same as for the world model.

Output: action distribution, value estimate.

 

Preinitialize from the parameters of a trained world model.

This leads to much faster learning.

SimPLe results

Experiment setup:

  • optimizing for accuracy on a holdout set
  • starting from a dataset of  4 * 20 * 128 = 10240 trajectories collected by PPO
  • 10 SimPLe epochs, 50 simulated PPO epochs each
  • 128 parallel model trainings in each data gathering phase
  • ~3h for data gathering, ~1h for world model training, ~2h for policy training, ~60h for the entire experiment

SimPLe results

SimPLe vs PPO vs human

task SimPLe PPO human
LM1B 35.9% 30.2% 35%
WMT EN -> DE 59.9% 49.5% 60%
Penn Treebank 23.4% 19.2% 23.2%
CIFAR-10 91.6% 91.2% 90%

Final test accuracies:

Learned schedules - LM1B

Summary

Using world models allows faster training of better policies.

 

One of the first successful practical applications of

model-based RL.

 

Amount of data needed currently: ~11K model trainings.

Comparable to the first work on Neural Architecture Search.

Zoph et al. - Neural Architecture Search with Reinforcement Learning, 2016

Future work

Transfer: training general policies or world models to enable wide use.

 

Planning using the model (model predictive control).

 

Evaluation in settings that are notoriously unstable (unsupervised/reinforcement learning); adaptive tuning should help.

Speaker: Piotr Kozakowski (p.kozakowski@mimuw.edu.pl)

Paper under review for ICLR 2020:

Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning

References:

Zagoruyko et al. - Wide Residual Networks, 2016

Vaswani et al. - Attention Is All You Need, 2017

Schulman et al. - Proximal Policy Optimization Algorithms, 2017

Kaiser et al. - Model-Based Reinforcement Learning for Atari, 2019

Zoph et al. - Neural Architecture Search with Reinforcement Learning, 2016

Using Transformers to teach Transformers how to train Transformers

By Piotr Kozakowski

Using Transformers to teach Transformers how to train Transformers

  • 797