How I learned to stop worrying and love the Gaussian Process

Jeriek Van den Abeele

Higgs&DM Meeting

Oslo – March 20, 2020

Correlation length-scale

Gaussian process prediction with uncertainty

Estimate from data! (kernels)

Global fits and the need for speed

Idea: consistent comparison of theories to all available data

  • High-dimensional parameter space with varying phenomenology                (e.g. MSSM-24)
  • Quick prediction of next-to-leading order cross sections is crucial!
  • Existing tools have drawbacks   (Prospino: slow, NLL-fast: limited validity)

\(\mathcal{L} = \mathcal{L}_\mathsf{collider} \times  \mathcal{L}_\mathsf{Higgs} \times  \mathcal{L}_\mathsf{DM} \times  \mathcal{L}_\mathsf{EWPO} \times  \mathcal{L}_\mathsf{flavour} \times \ldots\)

Global fits and the need for speed

Idea: consistent comparison of theories to all available data

  • High-dimensional parameter space with varying phenomenology                (e.g. MSSM-24)
  • Quick prediction of next-to-leading order cross sections is crucial!
  • Existing tools have drawbacks   (Prospino: slow, NLL-fast: limited validity)

\(\mathcal{L} = \mathcal{L}_\mathsf{collider} \times  \mathcal{L}_\mathsf{Higgs} \times  \mathcal{L}_\mathsf{DM} \times  \mathcal{L}_\mathsf{EWPO} \times  \mathcal{L}_\mathsf{flavour} \times \ldots\)

[GAMBIT, 1705.07919]

Fast estimate of SUSY (strong) production cross sections at NLO, and uncertainties from

  • regression itself
  • renormalisation scale
  • PDF variation
  • \(\alpha_s\) variation

Goal

$$ pp\to\tilde g \tilde g,\ \tilde g \tilde q_i,\ \tilde q_i \tilde q_j, $$

$$\tilde q_i \tilde q_j^{*},\ \tilde b_i \tilde b_i^{*},\ \tilde t_i \tilde t_i^{*}$$

Interface

Method

Pre-trained, distributed Gaussian processes

Stand-alone Python code, also implemented in GAMBIT

Processes

at \(\mathsf{\sqrt{s}=7/8/13/14}\) TeV

Soon public on GitHub!

A quick example: gluino pair production at 13 TeV

Gaussian Processes 101

Radial Basis Function kernel

prior distribution over all functions

with the estimated smoothness

posterior distribution over functions

with updated \(m(\vec x)\)

f(\vec x) \sim \mathcal{GP}\left[m(\vec x), k(\vec x, \vec x')\right]

data

\mathrm{E}\left[f(\vec x)\right] = m(\vec x)
\mathrm{Cov}\left[f(\vec x), f(\vec x')\right] = k(\vec x, \vec x')

Gaussian Processes 101

Matérn kernel

prior distribution over all functions

with the estimated smoothness

posterior distribution over functions

with updated \(m(\vec x)\)

f(\vec x) \sim \mathcal{GP}\left[m(\vec x), k(\vec x, \vec x')\right]

data

\mathrm{E}\left[f(\vec x)\right] = m(\vec x)
\mathrm{Cov}\left[f(\vec x), f(\vec x')\right] = k(\vec x, \vec x')

All the scary math

Regression problem, with 'measurement' noise:

\(y=f(\vec x) + \varepsilon, \ \varepsilon\sim \mathcal{N}(0,\sigma_\varepsilon^2) \quad \rightarrow \quad \) infer \(f\), given data  \(\mathcal{D} = \{\vec X, \vec y\}\)

Assume covariance structure expressed by a kernel function, like

k(\vec x, \vec x') = \sigma_f^2\, \exp\!\left(-\frac{|\vec x' - \vec x|^2}{2\mathcal{l}^2}\right) + \sigma_n^2\, \delta^{(D)}(\vec x' - \vec x).

Consider the data as a sample from a multivariate Gaussian distribution.

\([\vec x_1, \vec x_2, \ldots]\)

\([y_1, y_2, \ldots]\)

signal kernel

white-noise kernel

All the scary math

Regression problem, with 'measurement' noise:

\(y=f(\vec x) + \varepsilon, \ \varepsilon\sim \mathcal{N}(0,\sigma_\varepsilon^2) \quad \rightarrow \quad \) infer \(f\), given data  \(\mathcal{D} = \{\vec X, \vec y\}\)

Training: optimise kernel hyperparameters by maximising the marginal likelihood

p\left( \vec y\ |\ \vec X, \vec \theta \right) = \mathcal{N} \left( \vec y \ |\ m(\vec X), K_{\vec\theta} \right)
y_*\, |\, \mathcal D, \vec x_* \sim \mathcal{N}\left(\mu_*, \sigma_*^2\right)
\propto |K_{\vec\theta}|^{-\frac{1}{2}} \times \exp \left\{ -\frac{1}{2} \left( \vec y - m(\vec X) \right)^T K_{\vec\theta}^{-1} \left(\vec y - m(\vec X) \right) \right\} % %%k(\vec X, \vec X)^{-1} \left( \vec y - \mu(\vec X) \right), % k(\vec x_*, \vec x_*) - k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1} k(\vec X, \vec x_*)

Posterior predictive distribution at a new point  \(\vec x_*\) :

with

\mu_* = m(\vec x_*) + K_* K^{-1}(\vec{y}-m(\vec X)) %%\tilde\mu(\vec x_*) = \mu(\vec x_*) + k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1}\vec{y}\\ %\tilde\sigma^2(\vec x_*) = k(\vec x_*, \vec x_*) - k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1} k(\vec X, \vec x_*)

Implicit integration over points not in  \(\vec X\)

\sigma_*^2 = K_{**}-K_*K^{-1}K_*^T %%\tilde\mu(\vec x_*) = \mu(\vec x_*) + k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1}\vec{y}\\ %\tilde\sigma^2(\vec x_*) = k(\vec x_*, \vec x_*) - k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1} k(\vec X, \vec x_*)

[

K = k(\vec X, \vec X) \\ K_* = k(x_*, \vec X) \\ K_{**} = k(\vec x_*, \vec x_*)

All the scary math

Regression problem, with 'measurement' noise:

\(y=f(\vec x) + \varepsilon, \ \varepsilon\sim \mathcal{N}(0,\sigma_\varepsilon^2) \quad \rightarrow \quad \) infer \(f\), given data  \(\mathcal{D} = \{\vec X, \vec y\}\)

Training: optimise kernel hyperparameters by maximising the marginal likelihood

p\left( \vec y\ |\ \vec X, \vec \theta \right) = \mathcal{N} \left( \vec y \ |\ m(\vec X), K_{\vec\theta} \right)
y_*\, |\, \mathcal D, \vec x_* \sim \mathcal{N}\left(\mu_*, \sigma_*^2\right)
\propto |K_{\vec\theta}|^{-\frac{1}{2}} \times \exp \left\{ -\frac{1}{2} \left( \vec y - m(\vec X) \right)^T K_{\vec\theta}^{-1} \left(\vec y - m(\vec X) \right) \right\} % %%k(\vec X, \vec X)^{-1} \left( \vec y - \mu(\vec X) \right), % k(\vec x_*, \vec x_*) - k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1} k(\vec X, \vec x_*)

Posterior predictive distribution at a new point  \(\vec x_*\) :

with

\mu_* = m(\vec x_*) + K_* K^{-1}(\vec{y}-m(\vec X)) %%\tilde\mu(\vec x_*) = \mu(\vec x_*) + k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1}\vec{y}\\ %\tilde\sigma^2(\vec x_*) = k(\vec x_*, \vec x_*) - k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1} k(\vec X, \vec x_*)

Implicit integration over points not in  \(\vec X\)

\sigma_*^2 = K_{**}-K_*K^{-1}K_*^T %%\tilde\mu(\vec x_*) = \mu(\vec x_*) + k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1}\vec{y}\\ %\tilde\sigma^2(\vec x_*) = k(\vec x_*, \vec x_*) - k(\vec x_*, \vec X) k(\vec X, \vec X)^{-1} k(\vec X, \vec x_*)

[

K = k(\vec X, \vec X) \\ K_* = k(x_*, \vec X) \\ K_{**} = k(\vec x_*, \vec x_*)

My kingdom for a good kernel ...

GPs allow us to use probabilistic inference to learn a function from data, in an interpretable, analytical, yet non-parametric Bayesian framework.

 

A GP model is fully specified once the mean function, kernel and its hyperparameters are chosen.

 

The probabilistic interpretation only holds under the assumption that the chosen kernel accurately describes the true correlation structure.

e.g. correlation length scales

My kingdom for a good kernel ...

The choice of kernel allows for great flexibility. But once chosen, it fixes the type of functions likely under the GP prior and determines the kind of structure captured by the model, e.g., periodicity and differentiability.

My kingdom for a good kernel ...

The choice of kernel allows for great flexibility. But once chosen, it fixes the type of functions likely under the GP prior and determines the kind of structure captured by the model, e.g., periodicity and differentiability.

My kingdom for a good kernel ...

The choice of kernel allows for great flexibility. But once chosen, it fixes the type of functions likely under the GP prior and determines the kind of structure captured by the model, e.g., periodicity and differentiability.

My kingdom for a good kernel ...

For our multi-dimensional case of cross-section regression, we get good results by multiplying Matérn (\(\nu = 3/2\)) kernels over the different mass dimensions:

\displaystyle k_s(\vec x, \vec x') = \prod_{d=1}^D \sigma_d^2 \left( 1 + \sqrt{3} \frac{|x'_d - x_d|}{l_d} \right) \exp\left(-\sqrt{3}\frac{|x'_d - x_d|}{l_d}\right)

The different lengthscale parameters  \(l_d\)  lead to automatic relevance determination for each feature: short-range correlations for important features over which the latent function varies strongly.

This is an anisotropic, stationary kernel. It allows for functions that are less smooth than with the standard squared-exponential kernel.

... with good hyperparameters

Short lengthscale, small noise

Long lengthscale, large noise

Underfitting, almost linear

Overfitting of fluctuations,

can lead to large uncertainties!

Typically, kernel hyperparameters are estimated by maximising the (log) marginal likelihood \(p( \vec y\ |\ \vec X, \vec \theta) \), aka the empirical Bayes method.

Alternative: MCMC integration over a range of \(\vec \theta\).

Gradient-based optimisation is widely used, but can get stuck in local optima and plateaus. Multiple initialisations can help, or global optimisation methods like differential evolution.

... with good hyperparameters

Typically, kernel hyperparameters are estimated by maximising the (log) marginal likelihood \(p( \vec y\ |\ \vec X, \vec \theta) \), aka the empirical Bayes method.

Alternative: MCMC integration over a range of \(\vec \theta\).

Gradient-based optimisation is widely used, but can get stuck in local optima and plateaus. Multiple initialisations can help, or global optimisation methods like differential evolution.

Global optimum: somewhere in between

The standard approach systematically underestimates prediction errors.

After accounting for the additional uncertainty from learning the hyper-parameters, the prediction error increases when far from training points.

... with good hyperparameters

[Wågberg+, 1606.03865]

... with good hyperparameters

Other tricks to improve the numerical stability of training:

  • Normalising features and target values (avoid meaningful zero prior)
  • Factoring out known behaviour from the target values
  • Training on log-transformed targets, for extreme values and/or a large range, or to ensure positive predictions

Sometimes, a curious problem arises: negative predictive variances!

It is due to numerical errors when computing the inverse of the covariance matrix \(K\). When \(K\) contains many training points, there is a good chance that some of them are similar:

 

\vec x_1 \sim \vec x_2 \quad \implies \quad k(\vec x_1, \vec X) \simeq k(\vec x_2, \vec X).

Nearly equal columns make \(K\) ill-conditioned. One or more eigenvalues \(\lambda_i\) are close to zero and \(K\) can no longer be inverted reliably. The number of significant digits lost is roughly the \(\log_{10}\) of the condition number

\kappa = \frac{\lambda_\mathsf{max}}{\lambda_\mathsf{min}}.

This becomes problematic when \(\kappa \gtrsim 10^8 \). In the worst-case scenario,

\kappa = N \left(\frac{\sigma_f}{\sigma_n}\right)^2 + 1.

signal-to-noise ratio

number of points

GP Regularisation

GP Regularisation

  • Use the Moore-Penrose matrix pseudo-inverse
    • Disregard redundant points,  averaging with zero uncertainty at repeated points
  • Add a nugget/jitter term to the kernel
    • White noise, can compute minimal value that reduces \(\kappa\) below \(\kappa_\mathsf{max}\)
    • Can model discrepancy between GP model and latent function, e.g., regarding assumptions of stationarity and differentiability
  • Avoid a large S/N ratio by adding a likelihood penalty term guiding the optimisation
    • Nearly noiseless data is problematic ... Artificially increasing the noise level may be necessary, degrading the model a bit

[Mohammadi+, 1602.00853]

Workflow

Generating data

Random sampling

SUSY spectrum

Cross sections

Optimise kernel hyperparameters

Training GPs

GP predictions

Input parameters

Linear algebra

Cross section

estimates

Compute covariances

Training scales as \(\mathcal{O}(n^3)\), prediction as \(\mathcal{O}(n^2)\)

A balancing act

Mix of random samples with different priors in mass space

Evaluation speed

Sample coverage

Need to cover a large parameter space

Distributed Gaussian processes

[Liu+, 1806.00720]

Generalized Robust Bayesian Committee Machines

Aggregation model for dealing with large datasets

  1. Split the data into 1 global communication set \(\mathcal D_c\) and any number of local expert sets \(\mathcal D_i\)
  2. Train a separate GP on each data set
  3. Augment \(\mathcal D_i\) with \(\mathcal D_c\) , without re-training
  4. Combine predictions from the separate GPs, giving larger weight to experts with small uncertainty

Fast estimate of SUSY (strong) production cross sections at NLO, and uncertainties from

  • regression itself
  • renormalisation scale
  • PDF variation
  • \(\alpha_s\) variation

Goal

$$ pp\to\tilde g \tilde g,\ \tilde g \tilde q_i,\ \tilde q_i \tilde q_j, $$

$$\tilde q_i \tilde q_j^{*},\ \tilde b_i \tilde b_i^{*},\ \tilde t_i \tilde t_i^{*}$$

Interface

Method

Pre-trained, distributed Gaussian processes

Stand-alone Python code, also implemented in GAMBIT

Processes

at \(\mathsf{\sqrt{s}=7/8/13/14}\) TeV

Soon public on GitHub!

Thank You!

How I Learned To Stop Worrying And Love The Gaussian Process

By jeriek

How I Learned To Stop Worrying And Love The Gaussian Process

  • 371