Using Transformers to teach Transformers how to train Transformers

Piotr Kozakowski

joint work with Łukasz Kaiser and Afroz Mohiuddin

at Google Brain

Hyperparameter tuning

Hyperparameters in deep learning:

learning rate
dropout rate
moment decay in adaptive optimizers
...

Tuning is important, but hard.

Done manually, takes a lot of time.

Needs to be re-done for every new architecture and task.

Hyperparameter tuning

Some require scheduling, which takes even more work.

\eta = 0.1

\eta = 0.02

\eta = 0.004

\eta = 0.0008

Zagoruyko et al. - Wide Residual Networks, 2016

Automatic methods

Existing methods:

grid/random search
Bayesian optimization
evolutionary algorithms

Problem: non-parametric - can't transfer knowledge to new architectures/tasks.

Problem: hyperparameters typically fixed throughout training, or scheduled using parametric curves.

Possible benefit from adapting based on current metrics.

Solution: RL

Learn a policy for controlling hyperparameters based on the observed validation metrics.

Can train on a wide range of architectures and tasks.

Zero-shot or few-shot transfer to new architectures and tasks.

Long-term goal: a general system that can automatically tune anything.

Open-source the policies, so all ML practitioners can use them.

Adaptive tuning as a POMDP

Transition: a fixed number of training steps, followed by evaluation.

Observation :

current metrics, e.g. training/validation accuracy/loss
current hyperparameter values

Action : discrete; for every hyperparameter - increase/decrease by a fixed %, or keep the same.

Reward : the increase of a chosen metric since the last environment step.

Adaptive tuning as a POMDP

Partially observable: observing all parameter values is intractable.

Nondeterministic: random weight initialization and dataset permutation.

Tasks

Transformer LM on LM1B language modelling dataset (1B words)
Transformer LM on Penn Treebank language modelling dataset (3M words)
Transformer LM on WMT EN -> DE translation dataset, framed as primed language modeling
Wide ResNet on CIFAR-10 image classification dataset

Tuned hyperparameters

For Transformers:

learning rate
weight decay rate
dropouts, separately for each layer

For Wide ResNet:

learning rate
weight decay rate
momentum mass in SGD

Model-free approach: PPO

PPO: Proximal Policy Optimization

stable
more sample-efficient than vanilla policy gradient
widely used

Use the Transformer language model without input embedding as a policy.

Schulman et al. - Proximal Policy Optimization Algorithms, 2017

Model-free approach: PPO

Experiment setup:

optimizing for accuracy on a holdout set
20 PPO epochs
128 parallel model trainings in each epoch
4 repetitions of each experiment
~3h per model training, ~60h for the entire experiment

Model-free approach: PPO

some nice plots here

Model-based approach: SimPLe

SimPLe: Simulated Policy Learning

Elements:

policy
"world model"

Train the world model on data collected in the environment.

Train the policy using PPO in the environment simulated by the world model.

Much more sample-efficient than model-free PPO.

\pi : O^* \rightarrow P(A)

\epsilon : O^* \times A \rightarrow P(O)

Kaiser et al. - Model-Based Reinforcement Learning for Atari, 2019

Model-based approach: SimPLe

Transformer language model

Vaswani et al. - Attention Is All You Need, 2017

Time series forecasting

The metric curves are stochastic.

Autoregressive factorization:

Typically, need to assume a distribution for

(e.g. Gaussian, mixture of Gaussians)

P(x_1, ..., x_n) = {\prod_{i=1}^n} P(x_i | x_1, ..., x_{i - 1})

P(x_i | x_1, ..., x_{i - 1})

Time series forecasting

Using fixed-precision encoding, we can model any distribution within a set precision, using symbols per number, with symbols in the alphabet.

Loss: cross-entropy weighted by symbol significance .

x = \sum_{i = 1}^n a_i N^{i - 1}, a_i \in \{0, ..., N - 1\}

L(t, p) = \sum_{i = 1}^n \beta^{i - 1} H(t_i, p_i), \beta \in (0, 1)

Time series forecasting

Example: 2 numbers, base-8 encoding using 2 symbols.

Representable range: .

Precision: .

[0, 2)

2 \cdot 8^{-2} = \frac{1}{32}

Time series forecasting

Experiment on synthetic data.

Data designed to mimic accuracy curves, converging to 1 at varying rates .

Parameter estimated back from generated curves:

x_i = 1 - \frac{1}{1 + \frac{i}{\alpha}} + \mathcal{N}(0, \sigma^2), \alpha \sim \mathcal{U}(0.5, 5)

\hat{\alpha} = \frac{1}{N} \sum_{i = 1}^N i(1 - \frac{1}{x_i})

\alpha

Time series forecasting

Transformer as a world model

Modelled sequence:

Input: both observations and actions.

Predict only observations.

Rewards calculated based on the two last observations.

o_1 a_1 o_2 a_2 \dots o_n

Transformer as a world model

Transformer as a policy

Share the architecture with the world model (also input embedding).

Input same as for the world model.

Output: action distribution, value estimate.

Action distribution independent with respect to each hyperparameter.

Transformer as a policy

Preinitialize from world model parameters.

This empirically works much better.

Intuition:

symbol embeddings lose the inductive bias of continuousness
attention masks need a lot of signal to converge
reward signal noisy because of credit assignment

SimPLe results

Experiment setup:

optimizing for accuracy on a holdout set
starting from a dataset of 4 * 20 * 128 = 10240 trajectories collected in PPO experiments
10 SimPLe epochs
50 simulated PPO epochs in each SimPLe epoch
128 parallel model trainings in each data gathering phase
4 repetitions of each experiment
~3h for data gathering, ~1h for world model training, ~2h for policy training, ~60h for the entire experiment

SimPLe results

some nice plots here

SimPLe vs PPO vs human

task	SimPLe	PPO	human
LM1B			0.35
WMT EN -> DE			0.595
Penn Treebank			0.168
CIFAR-10			0.933

Final accuracies:

Summary and future work

Amount of data needed for now: ~11K model trainings.

Comparable to the first work in Neural Architecture Search.

Not practically applicable yet.

Future work:

Policy or world model transfer across tasks to enable practical application.
Evaluation in settings that are notoriously unstable (unsupervised/reinforcement learning); adaptive tuning should help.

Zoph et al. - Neural Architecture Search with Reinforcement Learning, 2016

Speaker: Piotr Kozakowski (p.kozakowski@mimuw.edu.pl)

Paper under review for ICLR 2020:

Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning

References:

Zagoruyko et al. - Wide Residual Networks, 2016

Schulman et al. - Proximal Policy Optimization Algorithms, 2017

Kaiser et al. - Model-Based Reinforcement Learning for Atari, 2019

Vaswani et al. - Attention Is All You Need, 2017

Zoph et al. - Neural Architecture Search with Reinforcement Learning, 2016

Forecasting Deep Learning Dynamics for Hyperparameter Tuning

By Piotr Kozakowski

Forecasting Deep Learning Dynamics for Hyperparameter Tuning

Using Transformers to teach Transformers how to train Transformers

Hyperparameter tuning

Hyperparameter tuning

Automatic methods

Solution: RL

Adaptive tuning as a POMDP

Adaptive tuning as a POMDP

Tasks

Tuned hyperparameters

Model-free approach: PPO

Model-free approach: PPO

Model-free approach: PPO

Model-based approach: SimPLe

Model-based approach: SimPLe

Transformer language model

Time series forecasting

Time series forecasting

Time series forecasting

Time series forecasting

Time series forecasting

Transformer as a world model

Transformer as a world model

Transformer as a policy

Transformer as a policy

SimPLe results

SimPLe results

SimPLe vs PPO vs human

Summary and future work

Forecasting Deep Learning Dynamics for Hyperparameter Tuning

More from Piotr Kozakowski