Data-efficient Optimization with Bayesian Optimization

Roberto Calandra

Facebook AI Research

  Mediterranean Machine Learning Summer School - 13 January 2021

Goals of the Talk

  • Understand how BO works
  • Recognize in which problems it is appropriate to use Bayesian optimization (BO)
    • Knowing the strengths and limitations of BO
    • Being familiar with several variants of BO
    • Being aware of several real-world applications which use BO
  • Being sufficiently familiar with the topic to be able to read technical papers on the topic.
    • Knowing the relevant BO notation
    • Recognize the similarities to the Markov Decision Process formulation and Reinforcement Learning

Why Bayesian Optimization?

Example in Robotics

  • Bipedal Robot with an existing finite-state machine controller (8 parameters)
  • We want to maximize the distance walked within a given amount of time.
  • Repeating an experiment might not yield the exact same outcome.
  • Average motor life is ~200 trials.

[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 5-23]

How do we tune the parameters of the controller?

Traditional Optimization Approaches

  • Manual tuning (requires expert knowledge)
  • Grid Search (does not scale to large parameter spaces)
  • Random Search (better than grid search, but still too many evaluations)
  • Gradient descent (we can not compute the gradient)
  • Evolutionary strategies (requires thousands of evaluations)
  • ...

All of these methods were impractical for this problem.
What else can we use?

Bayesian optimization!

Black-box Optimization

x^* = \text{arg min} \quad f(x)

Optimized parameters

Objective function

Parameters to optimize

{x\in R^d}

A Taxonomy of Objective Functions

Single minimum
(e.g., convex functions)

Multiple minimum
(a.k.a., global optimization)

First-order

(we can measure gradients)

Zero-order

(no gradients available)

Noise-less

(repeating the evaluation yield the same result)

Stochastic

(repeating the evaluation yield different results)

Nice and easy to solve
(e.g., with gradient descent)

Cheap Evaluation

(virtually infinite number of evaluations allowed)

Difficult to optimize

Expensive Evaluation

(limited to tens or hundreds of evaluations)

Here we want to use BO!

Examples of Applications

Oil drilling

Design and manufacturing

Drug design

Robotics

Hyperparameters optimization

Hyperparameters Optimization

  • Machine learning models are growing more and more complex
  • Modern deep learning models have dozens of hyperparameters that need to be tuned (e.g., learning rates, number of layers, batch size...)
  • To achieve state-of-the-art results, finding good hyperparameters is crucial
  • Even for experts finding good hyperparameters can be difficult and time consuming
  • How can we automatically optimize them?

Vibrant community dedicated to automated machine learning (AutoML)

How does Bayesian
Optimization works?

Intuition Behind Bayesian Optimization

  • Many optimizers capture only local information about the objective function
  • Can we instead use all information (i.e., the evaluations) collected so far to make a more informed decision, hence improving data-efficiency?
  • How to do this in practice?
x_{t+1} = g(x_t, f(x_t))
x_{t+1} = x_t + \gamma g(\nabla f(x_t))

e.g.,

D =\{x_i, f(x_i)\}\,, i=1\ldots t
x_{t+1} = g(D)
\tilde{f}(x)|_{D} \sim f(x)

We can create a surrogate model

x^* = \text{arg min} \quad \tilde{f}(x)

Gradient descent

Bayesian Optimization

  • Learn response surface
  • Based on the response surface, select next parameters to evaluate
  • Evaluate
     on the objective function
  • Repeat until stop criteria
\tilde{f}(x)
x_{t+1}
x_{t+1}

[credit: Marc Deisenroth]

Bayesian Optimization

  • Learn response surface
  • Based on the response surface, select next parameters to evaluate
  • Evaluate
     on the objective function
  • Repeat until stop criteria
\tilde{f}(x)
x_{t+1}
x_{t+1}

[credit: Marc Deisenroth]

Response Surface

Large variety of models used throughout the literature:

  • Polynomial functions
  • Random forests
  • Bayesian neural networks
  • Gaussian processes
  • ...

By far the most commonly used (at the moment)

Surrogate model (a.k.a. response surface) need to accurately approximate (and generalize) the underlying function based on the available data

 
D =\{X, y\}\,, X=x_i\,, y=f(x_i)
y = \tilde{f}(X)\, \qquad \longrightarrow f(x) = \tilde{f}(x)

Gaussian Processes

Additional reading:

Rasmussen, C. E. & Williams, C. K. I.
Gaussian Processes for Machine Learning
The MIT Press, 2006

  • Distribution over functions
\tilde{f} \sim \text{GP}(m_f, k_f)
  • Probabilistic Model
p(\tilde{f}(x^*)|D, x^*)\sim N(\mu, \sigma^2)\,,
\mu|_{x^*} = k(X,x^*)^T k(X,X)^{-1} y\,,

Mean of a GP = Kernel ridge regression

\sigma^2|_{x^*} = k(x^*, x^*) - k(X, x^*)^T k(X, X)^{-1} k(X, x^*)
y = \tilde{f}(X) + \epsilon\,, \quad \epsilon\sim N(0,\sigma^2)
  • The posterior predictive distribution for an arbitrary input          is computed as:
x^*
  • Flexible Bayesian regression method

Intuition of Gaussian Processes

Covariance Functions and GP Training

k(x_i, x_j) = \sigma^2_f \text{exp} \large( - \frac{(x_i - x_j)^2}{2l^2}\large) + \delta_{ij} \sigma_n^2

Square exponential

parameters of the GP
(often referred to as hyperparameters)

Multiple ways to optimize the hyperparameters

  • MAP estimate (by optimizing marginal likelihood)
  • Numerical integration (proper Bayesian way, but often more complicated)

Additional reading:

Rasmussen, C. E. & Williams, C. K. I.
Gaussian Processes for Machine Learning
The MIT Press, 2006

Why Gaussian Processes?

Pro:

  • Mathematically well-understood
  • Calibrated uncertainties
  • Possibility of specifying priors (e.g., of the underlying function)
  • Easy to enforce Lipschitzian smoothness (by choosing appropriate kernel)
  • Good modeling capabilities in low-data regime

Cons:

  • Difficult to scale to high-dimensional input space
  • Computationally expensive
  • Quality of the model dependent from use of appropriate kernel
N(O^3)

Bayesian Optimization

  • Learn response surface
  • Based on the response surface, select next parameters to evaluate
  • Evaluate
     on the objective function
  • Repeat until stop criteria
\tilde{f}(x)
x_{t+1}
x_{t+1}

[credit: Marc Deisenroth]

Acquisition Function

  • How do we select the next parameters to evaluate?

 

 

  • Intuition: A good acquisition function               need to strike a smart balance between exploration and exploitation
    • Too much exploration, and we will keep trying parameters that are unlikely to perform well
    • Too much exploitation, and we might get stuck in a local minima
    • In either extremes, performance will suffer
  • The tradeoff between exploration and exploitation traditionally* happens through the mean and variance of the response surface

 

 

x^* = \text{arg min} \quad \alpha(\tilde{f}(x), D)
\alpha(\cdot)
y = \alpha(\mu(x^*), \sigma(x^*))

* But not always

Acquisition Functions

y = \mu(x^*) -\beta \sigma(x^*))
  • Many acquisition functions in the literature:
    • Probability of improvement [Kushner 1964]
    • Expected improvement [Mockus 1978]
    • Upper confidence bound
    • Entropy search
    • Predictive entropy search
    • ...
    • Ensembles of aquisition functions
  • No Golden bullet

Optimizing the Acquisition Function

x^* = \text{arg min} \quad \alpha(\tilde{f}(x), D)
{x\in R^d}
  • Optimizing the acquisition function is by itself a challenging optimization problem
  • What have we gained by converting the original optimization to this?
    • No longer stochastic
    • No longer zero-order
      (We can usually compute gradients and Hessian of the acquisition function)
    • Not expensive to compute
      (Does not require real-world evaluations. Although it might potentially be computationally intensive)
  • In theory, any global optimizer can be used to optimize the acquisition function
  • In practice, often used a global optimizer (e.g., CMA-ES or DIRECT) followed by a first-order optimizer (e.g., gradient descent)

Recap

Going back to our robot...

Learning to Walk in 80 Trials

[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 5-23]

Learning Curve - Fox

[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 5-23]

Learned model

Not Symmetrical (about 5° difference). Why?

Because it is walking in a circle!

[Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P. Bayesian Optimization for Learning Gaits under Uncertainty Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 5-23]

Extentions of Bayesian Optimization

Extensions of Standard BO

Numberless extensions in the literature:

  • Constrained optimization (linear and non-linear constraints)
  • Multi-task optimization (we can exploit the statistical correlation to previously optimized tasks)
  • Robust optimization
  • Safe optimization
  • Batch optimization (several parameters are evaluated at once)
  • Contextual optimization (more about this soon)
  • Multi-objective optimization (more about this soon)
  • Grey-box optimization
  • ...

Example of Contextual BO

  • In previous example the robot learned to walk on a flat surface
  • What if we want to learn to walk over surfaces with different (potentially unseen) inclines?

Contextual Bayesian Optimization

x^* = \text{arg min} \quad f(x)

Optimized parameters

Objective function

Parameters to optimize

{x\in R^d}
x^* = \text{arg min} \quad f(x, c )

Context

Contextual Bayesian Optimization

  • Very common case

 

  • Striking similarities with RL -- context is akin to the state in an MDP

 

  • Minimal changes to the basic BO algorithm:
    • Response surface is now
    • The optimization of the acquisition function becomes constrained
y = \tilde{f}(X,c)\quad \longrightarrow f(x) = \tilde{f}(x,c)
x^* = \text{arg min} \quad \alpha(\tilde{f}(x, c), D)
{x\in R^d}
s.t.\quad c = c*

Example of Contextual BO

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Simulated hexapod:

  • 12 Degrees of Freedom (2 per legs)
  • Used Central Pattern Generators (CPG) as controllers
  • Learned using three different inclines, and then tested on unseen configurations.

Example of Multi-Objective BO

  • In previous example, the objective optimized was walking speed
  • What if we care not only about speed, but also about energy efficiency?

Multi-objective Optimization

  • Most engineering problems are truly multi-objective
{x\in R^d}
x^* = \text{arg min} \quad \{f_1(x),\ldots,f_n(x) \}

Pareto Front

  • Not all objective functions can be optimized at once
  • Solving this optimization means finding the

Multi-objective Bayesian Optimization

  • Multi-objective BO aims at finding the Pareto Front in a data-efficient manner
  • Common way to solve is to scalarize the multiple objectives into a single one
x^* = \text{arg min} \quad g(f_1(x), \ldots, f_n(x))

[Knowles, J. ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems IEEE Transactions on Evolutionary Computation, 2006, 10, 50-66]

  • For example, by using a Tchebycheff scalarization

 

 

 

  • Or through Hypervolume
g(x) = \text{max}_{i=1}^n \left( \lambda_i\, f_i \left( x \right) \right) + \rho \sum_{i=1}^n \lambda_i\, f_i(x)

Hard-coded CPG Gaits

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Multi-objective

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Comparison Gaits

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Discovering New Gaits

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

To Conclude

Brief History of BO & Related Fields

  • Independently "re-invented" over and over throughout the last 50 years in different communities.
  • [Krige 1951] Widely adopted in geostatistics under the name of "Kriging"
  • [Kiefer 1959] as optimum experimental designs
  • [Kushner 1964]
  • [Mockus 1978]
  • [Jones 1998 ] as Efficient Global Optimization (EGO)
  • BO can be considered a continuous version of the "multi-armed bandit" problem
  • BO can also be considered a policy search algorithm
    (family of reinforcement learning algorithms)
  • Main difference with reinforcement learning is that standard BO formulation does not have a state that changes over time

Limitations of Bayesian Optimization

  • Surprisingly efficient in many applications
  • Often easy to incorporate more structure/priors/past evaluations for a given problem
  • Interpretable models can provide insight
  • It works in the real-world !
 
  • Does not scale to high-dimensional parameter space
    (~30-D for Gaussian processes, and a few hundred with more structured models)
  • No guarantees of convergence*
  • If the underlying function is difficult to model, it often results in poor performance
    (e.g., discontinuous functions, lot of local minima, non-Gaussian noise, etc)
 

Research in BO

  • BO is a very active field of research with plenty of real-world applications
  • Some of the current challenges:
    • Scaling BO to hundreds and thousands of parameters
      (while maintaining its distinctive data efficiency)
    • Design and understand acquisition functions
    • Incorporate more structure/knowledge into the models

Software

Easy to use, for educational purposes

Well mantained, my choice atm

No active development

No active development

No active development

No active development

Summary

  • Bayesian optimization is a popular tool for global optimization of expensive black-box functions
    (Many engineering problems can be reliably solved by BO)
  • Introduced standard BO notation
  • Presented BO algorithms and its components
  • Briefly discussed special instances of BO (contextual and multi-objective)
  • Described strengths and limitations of BO
  • Examples from real-world applications of BO

Questions ?

References

  • Brochu, E.; Cora, V. M. & De Freitas, N.
    A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning
    arXiv preprint arXiv:1012.2599, 2010
  • Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P. & de Freitas, N.
    Taking the human out of the loop: A review of Bayesian optimization
    Proceedings of the IEEE, IEEE, 2016, 104, 148-175

  • Rasmussen, C. E. & Williams, C. K. I.
    Gaussian Processes for Machine Learning
    The MIT Press, 2006

  • Calandra, R.; Seyfarth, A.; Peters, J. & Deisenroth, M. P.
    Bayesian Optimization for Learning Gaits under Uncertainty
    Annals of Mathematics and Artificial Intelligence (AMAI), 2015, 76, 5-23
  • Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K.
    Learning Flexible and Reusable Locomotion Primitives for a Microrobot
    IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911
  • Kushner, H. J.
    A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise
    Journal of Basic Engineering, 1964, 86, 97-106
  • Mockus, J.; Tiesis, V. & Zilinskas, A.
    The application of Bayesian methods for seeking the extremum
    Towards Global Optimization, Amsterdam: Elsevier, 1978, 2, 117-129
  • Jones, D. R.; Schonlau, M. & Welch, W. J.
    Efficient global optimization of expensive black-box functions
    Journal of Global Optimization, Springer, 1998, 13, 455-492
  • Knowles, J.
    ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems
    IEEE Transactions on Evolutionary Computation, 2006, 10, 50-66

  • Hutter, F.; Hoos, H. H. & Leyton-Brown, K.
    Sequential model-based optimization for general algorithm configuration
    Learning and Intelligent Optimization (LION), Springer, 2011, 507-523

  • Snoek, J.; Larochelle, H. & Adams, R. P.
    Practical Bayesian Optimization of Machine Learning Algorithms
    arXiv preprint arXiv:1206.2944, 2012

Credits

  • Page 6: image oil drill - CGP Grey (https://flic.kr/p/8s4LXw) - CC BY 2.0
  • Page 6: image drug design - Jamie (https://flic.kr/p/9KVkMX) - CC BY 2.0
  • Page 6: image Design and manufacturing - Michele Mischitelli (https://flic.kr/p/2hv3RWE) - CC BY-ND 2.0
  • Page 6: image hyperparameters optimization - Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston - Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
  • Page 6: image robot - Facebook
  • Page 8: images BO - Marc Deisenroth

Additional Slides

Example of Multi-objective BO

Simulated hexapod:

  • 12 Degrees of Freedom (2 per legs)
  • No good physics models at that scale
  • we use Central Pattern Generators (CPG) as controllers

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Single-objective

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Resulting Dual Tripod Gait

[Yang, B.; Wang, G.; Calandra, R.; Contreras, D.; Levine, S. & Pister, K. Learning Flexible and Reusable Locomotion Primitives for a Microrobot  IEEE Robotics and Automation Letters (RA-L), 2018, 3, 1904-1911]

Data-efficient Optimization with Bayesian Optimization

By Roberto Calandra

Data-efficient Optimization with Bayesian Optimization

[M2L - 13 January 2021]

  • 2,049