On Gaussian Processes

Vidhi Lalchand

 

MIT Kavli Institute for Astrophysics and Space Research

22-03-2023 

 

Research Seminar

"Functions describe the world"

- Thomas Garrity

Outline

Gaussian processes as a "function" learning paradigm.

Regression with GPs: Both inputs \((X)\) and outputs \( (Y)\) are observed.

Latent Variable Modelling with GP: Only outputs \( (Y)\) are observed.

Without loss of generality, we are going to assume \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times D}\) and \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N},  Y \in \mathbb{R}^{N \times 1}\)

Hot take: Almost all machine learning comes down to modelling functions.

\(x  \in \mathbb{R}^{d}\)

Model selection is a hard problem!

What if we were not forced to decide the complexity of \( f\) at the outset. What if \(f\) could calibrate its complexity on the fly as it sees the data - this is precisely called non-parameteric learning.

Gaussian Processes

Gaussian processes are a powerful non-parametric paradigm for performing state-of-the-art regression.

  • They are probabilistic \( \rightarrow\) user has a sense of prediction uncertainty.
  • They don't have standard parameters \( \) they model the mapping \( f \) directly by placing a prior in the space of functions!

We need to understand the notion of distribution over functions.

  • When we think of a function in a mathematical sense we immediately try to think of a parametric form. For example, \( 5x -2, x^2, 3x^3 - x, e^{x} \).
  • But in GP world there is a fundamental shift in thinking about functions. We completely abandon the parametric form viewpoint.
  • A continuous function \( f \) on the real domain \( \mathbb{R}^{d} \), can be thought of as an infinitely long vector evaluated at some index set \( [x_{1}, x_{2}, ......]\).

  • \( [f(x_{1}), f(x_{2}), f(x_{3}),.......]\)

  • Gaussian processes are probability distributions over functions!

Interpretation of functions

Sticking point: We cannot represent infinite dimensional vectors on a computer....true, but bear with me.

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

\(m(x)\) is a mean function.

\(k(x, x')\) is a covariance function.

What is a GP?

The most intuitive way of understanding GPs is understanding the correspondence between Gaussian distributions and Gaussian processes.

What is a GP?

A sample from a \(k\)-dimensional Gaussian \( \mathbf{x} \sim \mathcal{N}(\mu, \Sigma) \) is a vector of size \(k\). $$ \mathbf{x} = [x_{1}, \ldots, x_{k}] $$

The mathematical crux of a GP is that \( [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]\) is just a N-dimensional multivariate Gaussian \( \mathcal{N}(\mu, K) \).

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

A GP is an infinite dimensional analogue of a Gaussian distribution \( \rightarrow \)  a sample from it is a vector of infinite length?

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

But at any given point, we only need to represent our function \( f(x) \) at a finite index set \( \mathcal{X} = [x_{1},\ldots, x_{500}]\). So we are interested in our long function vector \( [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]\).

Function samples from a GP

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

The kernel function \( k(x,x')\) is the heart of a GP, it controls all of the inductive biases in our function space like shape, periodicity, smoothness.

prior over functions \( \rightarrow \)

Sample draws from a zero mean GP prior under different kernel functions.

In reality, they are just draws from a multivariate Gaussian \( \mathcal{N}(0, K)\) where the covariance matrix has been evaluated by applying the kernel function to all pairs of data points.

Infinite dimensional prior:

 \(f(x) \sim \mathcal{GP}(m(x), k_{\theta}(x,x^{\prime})) \)

 \(f(X) \sim \mathcal{N}(m(X), K_{X})\)

For a finite set of points, \( X \):

\( k_{\theta}(x,x^{\prime})\) encodes the support and inductive biases in function space.

Gaussian Process Regression

How do we fit functions to noisy data with GPs?

1. Given some noisy data \( \bm{y} = \lbrace{y_{i}}\rbrace_{i=1}^{N} \) at \( X = \{ x_{i}\}_{i=1}^N\) input locations.

2. You believe your data comes from a function \( f\) corrupted by Gaussian noise.  

$$ \bm{y} = f(X) + \epsilon, \hspace{10pt} \epsilon \sim \mathcal{N}(0, \sigma^{2})$$

Data Likelihood: \( \hspace{10pt}  y|f \sim \mathcal{N}(f(x), \sigma^{2}) \)

Prior over functions: \( f|\theta \sim \mathcal{GP}(0, k_{\theta}) \)

(The choice of kernel function \( k_{\theta}\) controls how your functions space looks.)

\rightarrow
\begin{aligned} f_{i} &\sim \mathcal{N}(0, K_{\theta}) \\ K_{i,j} &= k_{\theta}(x_{i}, x_{j}) \\ &\hspace{2mm} \forall i, j\\ \end{aligned}

.....but we still need to fit the kernel hyperparameters \( \theta\)

Learning Step:

 

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}
\bm{\theta_{\star}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta})

Learning in Gaussian process models occurs through the maximisation of the marginal likelihood w.r.t the kernel hyperparameters.

Data likelihood

Prior

Denominator of Bayes Rule

Learning in a GP

\begin{aligned} p(f|y, \theta) &= \dfrac{p(y|f)p(f|\theta)}{p(y|\theta)} \\ \end{aligned}
p(f|\theta)
\begin{aligned} \bm{\theta_{*}} &= \argmax_{\bm{\theta}}\mathcal{L(\bm{\theta})} \\ p(y|\theta) &= \int p(y|f)p(f|\theta)df = \mathcal{N}(0, K + \sigma^{2}\mathbb{I})\\ \textrm{where, } \mathcal{L(\bm{\theta})} = \log p(y|\theta) &= -\overbrace{\frac{1}{2}y^{T}(K_{\theta} + \sigma^{2}_{n})^{-1}y}^{\textrm{model fit}} -\overbrace{\frac{1}{2}log|K_{\theta} + \sigma^{2}_{n}\mathbb{I}|}^{\textrm{complexity penalty}} - \dfrac{n}{2}log2\pi \\ \end{aligned}

Popular Kernels

Usually, the user picks one on the basis of prior knowledge.

Each kernel depends on some hyperparameters \( \theta\), which are tuned in the training step.

Predictions in a GP

We want to infer latent function values \(f_{*}\) at any arbitrary input locations  \( X_{*} \), so in a distribution sense we want,

$$ p(f_{*}|X_{*}, y, \theta) $$

Posterior Predictive Distribution

The predictive posterior is closed form (because we are operating a world of Gaussians):

  • Write down the joint.
  • Derive the conditional from joint.  (Gaussian identity)
\begin{aligned} \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \mu_{2} \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12}^{T} & \Sigma_{22}\\ \end{bmatrix} \right) \end{aligned}
\begin{aligned} x_{1}|x_{2} &\sim \mathcal{N}(\mu_{3}, \Sigma_{3})\\ \mu_{3} &= \mu_{1} + \Sigma_{12}\Sigma_{22}^{-1}(A_{2} - \mu_{2}) \\ \Sigma_{3} &= \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned}

Joint

x_{2}|x_{1}

Conditional

can be derived using symmetry arguments

Predictions in a GP

\begin{aligned} p(f_{*}|y, X_{*}, X, \theta_{*}) &= \mathcal{N}( \mu_{*}, \hspace{5pt} \Sigma_{*}) \\ \mu_{*} &= K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}y \\ \Sigma_{*} &= K_{**} - K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}K_{*}^T \\ \end{aligned}
\begin{aligned} p(f, f_{*}|X, X_{*}, \theta) &= \mathcal{N} \left( 0, \begin{bmatrix} k(X,X) & k(X, X_{*}) \\ k(X_{*}, X) & k(X_{*}, X_{*}) \end{bmatrix} \right) = \mathcal{N}\left(0, \begin{bmatrix} K & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ p(y, f_{*}|X, X_{*}, \theta) &= \mathcal{N}\left(0, \begin{bmatrix} K + \sigma^{2}\mathbb{I} & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ \end{aligned}

Joint

Conditional

Examples of GP Regression

Examples of GP Regression

Ground Truth

Reconstruction

Examples of GP Regression

Gaussian processes can also be used in contexts where the observations are a gigantic data matrix \( Y \equiv \{ y_{n}\}_{n=1}^{N}, y_{n} \in \mathbb{R}^{D}\). \(D\) can be pretty big \(\approx 1000s\).

Imagine a stack of images, where each image has been flattened into a vector of pixels and stacked together rowise in a matrix.

28

28

n = number of images

d = 784

N x D

The Gaussian process bridge

2d latent space

High-dimensional data space

Schematic of a Gaussian process Latent Variable Model

 

. . . 

. . . 

. . . 

N

D

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

Structure / clustering in latent space can reveal insights into the high-dimensional data - for instance, which points are similar.

each cluster is a digit (coloured by labels)

\( Z \in \mathbb{R}^{N \times Q}\)

\( F \in \mathbb{R}^{N \times D}\)

\( Y \in \mathbb{R}^{N \times D} (= F + noise)\)

Mathematical set-up

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

Data Likelihood:

p(f) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

Prior structure:

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\(Z\)

\(z_{n}\)

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

Mathematical set-up

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}
p(f|Z) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})
F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\(Z\)

\(z_{n}\)

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

Optimisation objective: 

\int \ldots \int p(Y|f, Z)p(f|Z)df_{1}\ldots df_{D} = \prod_{d=1}^{D}p(y_{d}|Z) = \prod_{d=1}^{D}\mathcal{N}(0, K_{d}+\sigma^{2}_{n}\mathbb{I})
\text{argmax}_{Z, \theta} \prod_{d=1}^{D}\mathcal{N}(0, K_{d}+\sigma^{2}_{n}\mathbb{I})

Treatment

Latent batch effect

Cell cycle phase

Disentanglement of cell cycle and treatment effects

Robust to Missing Data: MNIST Reconstruction

30% 

60% 

Robust to Missing Data: Motion Capture 

Thank you! 

 

vr308@cam.ac.uk

@VRLalchand

MIT Kavli Institute Talk

By Vidhi Lalchand

MIT Kavli Institute Talk

  • 58