Piotr Kozakowski
joint work with Łukasz Kaiser and Afroz Mohiuddin
at Google Brain
Hyperparameters in deep learning:
Tuning is important, but hard.
Done manually, takes a lot of time.
Needs to be re-done for every new architecture and task.
Some require scheduling, which takes even more work.
Zagoruyko et al. - Wide Residual Networks, 2016
Existing methods:
Problem: non-parametric - can't transfer knowledge to new architectures/tasks.
Problem: hyperparameters typically fixed throughout training, or scheduled using parametric curves.
Possible benefit from adapting based on current metrics.
Learn a policy for controlling hyperparameters based on the observed validation metrics.
Can train on a wide range of architectures and tasks.
Zero-shot or few-shot transfer to new architectures and tasks.
Long-term goal: a general system that can automatically tune anything.
Open-source the policies, so all ML practitioners can use them.
Transition: a fixed number of training steps, followed by evaluation.
Observation :
Action : discrete; for every hyperparameter - increase/decrease by a fixed %, or keep the same.
Reward : the increase of a chosen metric since the last environment step.
Partially observable: observing all parameter values is intractable.
Nondeterministic: random weight initialization and dataset permutation.
For Transformers:
For Wide ResNet:
PPO: Proximal Policy Optimization
Use the Transformer language model without input embedding as a policy.
Schulman et al. - Proximal Policy Optimization Algorithms, 2017
Experiment setup:
some nice plots here
SimPLe: Simulated Policy Learning
Elements:
Train the world model on data collected in the environment.
Train the policy using PPO in the environment simulated by the world model.
Much more sample-efficient than model-free PPO.
Kaiser et al. - Model-Based Reinforcement Learning for Atari, 2019
Vaswani et al. - Attention Is All You Need, 2017
The metric curves are stochastic.
Autoregressive factorization:
Typically, need to assume a distribution for
(e.g. Gaussian, mixture of Gaussians)
Using fixed-precision encoding, we can model any distribution within a set precision, using symbols per number, with symbols in the alphabet.
Loss: cross-entropy weighted by symbol significance .
Example: 2 numbers, base-8 encoding using 2 symbols.
Representable range: .
Precision: .
Experiment on synthetic data.
Data designed to mimic accuracy curves, converging to 1 at varying rates .
Parameter estimated back from generated curves:
Modelled sequence:
Input: both observations and actions.
Predict only observations.
Rewards calculated based on the two last observations.
Share the architecture with the world model (also input embedding).
Input same as for the world model.
Output: action distribution, value estimate.
Action distribution independent with respect to each hyperparameter.
Preinitialize from world model parameters.
This empirically works much better.
Intuition:
Experiment setup:
some nice plots here
task | SimPLe | PPO | human |
---|---|---|---|
LM1B | 0.35 | ||
WMT EN -> DE | 0.595 | ||
Penn Treebank | 0.168 | ||
CIFAR-10 | 0.933 |
Final accuracies:
Amount of data needed for now: ~11K model trainings.
Comparable to the first work in Neural Architecture Search.
Not practically applicable yet.
Future work:
Zoph et al. - Neural Architecture Search with Reinforcement Learning, 2016
Speaker: Piotr Kozakowski (p.kozakowski@mimuw.edu.pl)
Paper under review for ICLR 2020:
Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning
References:
Zagoruyko et al. - Wide Residual Networks, 2016
Schulman et al. - Proximal Policy Optimization Algorithms, 2017
Kaiser et al. - Model-Based Reinforcement Learning for Atari, 2019
Vaswani et al. - Attention Is All You Need, 2017
Zoph et al. - Neural Architecture Search with Reinforcement Learning, 2016