Dataefficient Optimization with Bayesian Optimization
Roberto Calandra
Facebook AI Research
Mediterranean Machine Learning Summer School  13 January 2021
Slides online at: https://slides.com/rcalandra/m2l2020/
Goals of the Talk
 Understand how BO works
 Recognize in which problems it is appropriate to use Bayesian optimization (BO)
 Knowing the strengths and limitations of BO
 Being familiar with several variants of BO
 Being aware of several realworld applications which use BO
 Being sufficiently familiar with the topic to be able to read technical papers on the topic.
 Knowing the relevant BO notation
 Recognize the similarities to the Markov Decision Process formulation and Reinforcement Learning
Why Bayesian Optimization?
Example in Robotics
 Bipedal Robot with an existing finitestate machine controller (8 parameters)
 We want to maximize the distance walked within a given amount of time.
 Repeating an experiment might not yield the exact same outcome.
 Average motor life is ~200 trials.
[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 523]
How do we tune the parameters of the controller?
Traditional Optimization Approaches
 Manual tuning (requires expert knowledge)
 Grid Search (does not scale to large parameter spaces)
 Random Search (better than grid search, but still too many evaluations)
 Gradient descent (we can not compute the gradient)
 Evolutionary strategies (requires thousands of evaluations)
 ...
All of these methods were impractical for this problem.
What else can we use?
Bayesian optimization!
Blackbox Optimization
Optimized parameters
Objective function
Parameters to optimize
A Taxonomy of Objective Functions
Single minimum
(e.g., convex functions)
Multiple minimum
(a.k.a., global optimization)
Firstorder
(we can measure gradients)
Zeroorder
(no gradients available)
Noiseless
(repeating the evaluation yield the same result)
Stochastic
(repeating the evaluation yield different results)
Nice and easy to solve
(e.g., with gradient descent)
Cheap Evaluation
(virtually infinite number of evaluations allowed)
Difficult to optimize
Expensive Evaluation
(limited to tens or hundreds of evaluations)
Here we want to use BO!
Examples of Applications
Oil drilling
Design and manufacturing
Drug design
Robotics
Hyperparameters optimization
Hyperparameters Optimization
 Machine learning models are growing more and more complex
 Modern deep learning models have dozens of hyperparameters that need to be tuned (e.g., learning rates, number of layers, batch size...)
 To achieve stateoftheart results, finding good hyperparameters is crucial
 Even for experts finding good hyperparameters can be difficult and time consuming
 How can we automatically optimize them?
Vibrant community dedicated to automated machine learning (AutoML)
How does Bayesian
Optimization works?
Intuition Behind Bayesian Optimization
 Many optimizers capture only local information about the objective function
 Can we instead use all information (i.e., the evaluations) collected so far to make a more informed decision, hence improving dataefficiency?
 How to do this in practice?
e.g.,
We can create a surrogate model
Gradient descent
Bayesian Optimization
 Learn response surface
 Based on the response surface, select next parameters to evaluate
 Evaluate
on the objective function  Repeat until stop criteria
[credit: Marc Deisenroth]
Bayesian Optimization
 Learn response surface
 Based on the response surface, select next parameters to evaluate
 Evaluate
on the objective function  Repeat until stop criteria
[credit: Marc Deisenroth]
Response Surface
Large variety of models used throughout the literature:
 Polynomial functions
 Random forests
 Bayesian neural networks
 Gaussian processes
 ...
By far the most commonly used (at the moment)
Surrogate model (a.k.a. response surface) need to accurately approximate (and generalize) the underlying function based on the available data
Gaussian Processes
Additional reading:
Rasmussen, C. E. & Williams, C. K. I.
Gaussian Processes for Machine Learning
The MIT Press, 2006
 Distribution over functions
 Probabilistic Model
Mean of a GP = Kernel ridge regression
 The posterior predictive distribution for an arbitrary input is computed as:
 Flexible Bayesian regression method
Intuition of Gaussian Processes
Covariance Functions and GP Training
Square exponential
parameters of the GP
(often referred to as hyperparameters)
Multiple ways to optimize the hyperparameters
 MAP estimate (by optimizing marginal likelihood)
 Numerical integration (proper Bayesian way, but often more complicated)
Additional reading:
Rasmussen, C. E. & Williams, C. K. I.
Gaussian Processes for Machine Learning
The MIT Press, 2006
Why Gaussian Processes?
Pro:
 Mathematically wellunderstood
 Calibrated uncertainties
 Possibility of specifying priors (e.g., of the underlying function)
 Easy to enforce Lipschitzian smoothness (by choosing appropriate kernel)
 Good modeling capabilities in lowdata regime
Cons:
 Difficult to scale to highdimensional input space
 Computationally expensive
 Quality of the model dependent from use of appropriate kernel
Bayesian Optimization
 Learn response surface
 Based on the response surface, select next parameters to evaluate
 Evaluate
on the objective function  Repeat until stop criteria
[credit: Marc Deisenroth]
Acquisition Function
 How do we select the next parameters to evaluate?

Intuition: A good acquisition function need to strike a smart balance between exploration and exploitation
 Too much exploration, and we will keep trying parameters that are unlikely to perform well
 Too much exploitation, and we might get stuck in a local minima
 In either extremes, performance will suffer
 The tradeoff between exploration and exploitation traditionally* happens through the mean and variance of the response surface
* But not always
Acquisition Functions

Many acquisition functions in the literature:
 Probability of improvement [Kushner 1964]
 Expected improvement [Mockus 1978]
 Upper confidence bound
 Entropy search
 Predictive entropy search
 ...
 Ensembles of aquisition functions
 No Golden bullet
Optimizing the Acquisition Function
 Optimizing the acquisition function is by itself a challenging optimization problem
 What have we gained by converting the original optimization to this?
 No longer stochastic
 No longer zeroorder
(We can usually compute gradients and Hessian of the acquisition function)  Not expensive to compute
(Does not require realworld evaluations. Although it might potentially be computationally intensive)
 In theory, any global optimizer can be used to optimize the acquisition function
 In practice, often used a global optimizer (e.g., CMAES or DIRECT) followed by a firstorder optimizer (e.g., gradient descent)
Recap
Going back to our robot...
Learning to Walk in 80 Trials
[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 523]
Learning Curve  Fox
[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 523]
Learned model
Not Symmetrical (about 5° difference). Why?
Because it is walking in a circle!
[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 523]
Extentions of Bayesian Optimization
Extensions of Standard BO
Numberless extensions in the literature:
 Constrained optimization (linear and nonlinear constraints)
 Multitask optimization (we can exploit the statistical correlation to previously optimized tasks)
 Robust optimization
 Safe optimization
 Batch optimization (several parameters are evaluated at once)
 Contextual optimization (more about this soon)
 Multiobjective optimization (more about this soon)
 Greybox optimization
 ...
Example of Contextual BO
 In previous example the robot learned to walk on a flat surface
 What if we want to learn to walk over surfaces with different (potentially unseen) inclines?
Contextual Bayesian Optimization
Optimized parameters
Objective function
Parameters to optimize
Context
Contextual Bayesian Optimization
 Very common case
 Striking similarities with RL  context is akin to the state in an MDP
 Minimal changes to the basic BO algorithm:
 Response surface is now
 The optimization of the acquisition function becomes constrained
Example of Contextual BO
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Simulated hexapod:
 12 Degrees of Freedom (2 per legs)
 Used Central Pattern Generators (CPG) as controllers
 Learned using three different inclines, and then tested on unseen configurations.
Example of MultiObjective BO
 In previous example, the objective optimized was walking speed
 What if we care not only about speed, but also about energy efficiency?
Multiobjective Optimization
 Most engineering problems are truly multiobjective
Pareto Front
 Not all objective functions can be optimized at once
 Solving this optimization means finding the
Multiobjective Bayesian Optimization
 Multiobjective BO aims at finding the Pareto Front in a dataefficient manner
 Common way to solve is to scalarize the multiple objectives into a single one
[Knowles, J. ParEGO: A hybrid algorithm with online landscape approximation for expensive multiobjective optimization problems IEEE Transactions on Evolutionary Computation, 2006, 10, 5066]
 For example, by using a Tchebycheff scalarization
 Or through Hypervolume
Hardcoded CPG Gaits
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Multiobjective
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Comparison Gaits
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Discovering New Gaits
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
To Conclude
Brief History of BO & Related Fields
 Independently "reinvented" over and over throughout the last 50 years in different communities.
 [Krige 1951] Widely adopted in geostatistics under the name of "Kriging"
 [Kiefer 1959] as optimum experimental designs
 [Kushner 1964]
 [Mockus 1978]
 [Jones 1998 ] as Efficient Global Optimization (EGO)
 BO can be considered a continuous version of the "multiarmed bandit" problem
 BO can also be considered a policy search algorithm
(family of reinforcement learning algorithms)  Main difference with reinforcement learning is that standard BO formulation does not have a state that changes over time
Limitations of Bayesian Optimization
 Surprisingly efficient in many applications
 Often easy to incorporate more structure/priors/past evaluations for a given problem
 Interpretable models can provide insight
 It works in the realworld !
 Does not scale to highdimensional parameter space
(~30D for Gaussian processes, and a few hundred with more structured models)  No guarantees of convergence*
 If the underlying function is difficult to model, it often results in poor performance
(e.g., discontinuous functions, lot of local minima, nonGaussian noise, etc)
Research in BO
 BO is a very active field of research with plenty of realworld applications
 Some of the current challenges:
 Scaling BO to hundreds and thousands of parameters
(while maintaining its distinctive data efficiency)  Design and understand acquisition functions
 Incorporate more structure/knowledge into the models
 Scaling BO to hundreds and thousands of parameters
Software
 BoTorch (in Python/Torch) https://botorch.org/
 BayesOpt (in C++) https://github.com/rmcantin/bayesopt
 GPyOpt (in Python) https://github.com/SheffieldML/GPyOpt
 RoBO (in Python) https://github.com/automl/RoBO
 Spearmint (in Python) https://github.com/HIPS/Spearmint
 Opto https://github.com/robertocalandra/opto
 ...
Easy to use, for educational purposes
Well mantained, my choice atm
No active development
No active development
No active development
No active development
Summary
 Bayesian optimization is a popular tool for global optimization of expensive blackbox functions
(Many engineering problems can be reliably solved by BO)  Introduced standard BO notation
 Presented BO algorithms and its components
 Briefly discussed special instances of BO (contextual and multiobjective)
 Described strengths and limitations of BO
 Examples from realworld applications of BO
Please give Feedback at: https://tinyurl.com/m2lbayesianoptimization
Questions ?
References
 Brochu, E.; Cora, V. M. & De Freitas, N.
A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning
arXiv preprint arXiv:1012.2599, 2010 
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P. & de Freitas, N.
Taking the human out of the loop: A review of Bayesian optimization
Proceedings of the IEEE, IEEE, 2016, 104, 148175 
Rasmussen, C. E. & Williams, C. K. I.
Gaussian Processes for Machine Learning
The MIT Press, 2006  Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P.
Bayesian Optimization for Learning Gaits under Uncertainty
Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 523  Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K.
Learning Flexible and Reusable Locomotion Primitives for a Microrobot
IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911  Kushner, H. J.
A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise
Journal of Basic Engineering, 1964, 86, 97106  Mockus, J.; Tiesis, V. & Zilinskas, A.
The application of Bayesian methods for seeking the extremum
Towards Global Optimization, Amsterdam: Elsevier, 1978, 2, 117129  Jones, D. R.; Schonlau, M. & Welch, W. J.
Efficient global optimization of expensive blackbox functions
Journal of Global Optimization, Springer, 1998, 13, 455492 
Knowles, J.
ParEGO: A hybrid algorithm with online landscape approximation for expensive multiobjective optimization problems
IEEE Transactions on Evolutionary Computation, 2006, 10, 5066 
Hutter, F.; Hoos, H. H. & LeytonBrown, K.
Sequential modelbased optimization for general algorithm configuration
Learning and Intelligent Optimization (LION), Springer, 2011, 507523 
Snoek, J.; Larochelle, H. & Adams, R. P.
Practical Bayesian Optimization of Machine Learning Algorithms
arXiv preprint arXiv:1206.2944, 2012
Credits
 Page 6: image oil drill  CGP Grey (https://flic.kr/p/8s4LXw)  CC BY 2.0
 Page 6: image drug design  Jamie (https://flic.kr/p/9KVkMX)  CC BY 2.0
 Page 6: image Design and manufacturing  Michele Mischitelli (https://flic.kr/p/2hv3RWE)  CC BYND 2.0
 Page 6: image hyperparameters optimization  Samuel Humeau, Kurt Shuster, MarieAnne Lachaux, Jason Weston  Polyencoders: Transformer Architectures and Pretraining Strategies for Fast and Accurate Multisentence Scoring
 Page 6: image robot  Facebook
 Page 8: images BO  Marc Deisenroth
Additional Slides
Example of Multiobjective BO
Simulated hexapod:
 12 Degrees of Freedom (2 per legs)
 No good physics models at that scale
 we use Central Pattern Generators (CPG) as controllers
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Singleobjective
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Resulting Dual Tripod Gait
[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot IEEE Robotics and Automation Letters (RAL), 2018, 3, 19041911]
Dataefficient Optimization with Bayesian Optimization
By Roberto Calandra
Dataefficient Optimization with Bayesian Optimization
[M2L  13 January 2021]
 418