On Gaussian Processes

Vidhi Lalchand

MIT Kavli Institute for Astrophysics and Space Research

22-03-2023

Research Seminar

"Functions describe the world"

- Thomas Garrity

Outline

Gaussian processes as a "function" learning paradigm.

Regression with GPs: Both inputs $(X)$ and outputs $ (Y)$ are observed.

Latent Variable Modelling with GP: Only outputs $ (Y)$ are observed.

Without loss of generality, we are going to assume $ X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times D}$ and $ Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times 1}$

Hot take: Almost all machine learning comes down to modelling functions.

$x \in \mathbb{R}^{d}$

Model selection is a hard problem!

What if we were not forced to decide the complexity of $ f$ at the outset. What if $f$ could calibrate its complexity on the fly as it sees the data - this is precisely called non-parameteric learning.

Gaussian Processes

Gaussian processes are a powerful non-parametric paradigm for performing state-of-the-art regression.

They are probabilistic $ \rightarrow$ user has a sense of prediction uncertainty.
They don't have standard parameters  they model the mapping $ f $ directly by placing a prior in the space of functions!

We need to understand the notion of distribution over functions.

When we think of a function in a mathematical sense we immediately try to think of a parametric form. For example, $ 5x -2, x^2, 3x^3 - x, e^{x} $.
But in GP world there is a fundamental shift in thinking about functions. We completely abandon the parametric form viewpoint.
A continuous function $ f $ on the real domain $ \mathbb{R}^{d} $, can be thought of as an infinitely long vector evaluated at some index set $ [x_{1}, x_{2}, ......]$.
$ [f(x_{1}), f(x_{2}), f(x_{3}),.......]$
Gaussian processes are probability distributions over functions!

Interpretation of functions

Sticking point: We cannot represent infinite dimensional vectors on a computer....true, but bear with me.

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

$m(x)$ is a mean function.

$k(x, x')$ is a covariance function.

What is a GP?

The most intuitive way of understanding GPs is understanding the correspondence between Gaussian distributions and Gaussian processes.

What is a GP?

A sample from a $k$-dimensional Gaussian $ \mathbf{x} \sim \mathcal{N}(\mu, \Sigma) $ is a vector of size $k$. $$ \mathbf{x} = [x_{1}, \ldots, x_{k}] $$

The mathematical crux of a GP is that $ [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]$ is just a N-dimensional multivariate Gaussian $ \mathcal{N}(\mu, K) $.

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

A GP is an infinite dimensional analogue of a Gaussian distribution $ \rightarrow $ a sample from it is a vector of infinite length?

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

But at any given point, we only need to represent our function $ f(x) $ at a finite index set $ \mathcal{X} = [x_{1},\ldots, x_{500}]$. So we are interested in our long function vector $ [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]$.

Function samples from a GP

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

The kernel function $ k(x,x')$ is the heart of a GP, it controls all of the inductive biases in our function space like shape, periodicity, smoothness.

prior over functions $ \rightarrow $

Sample draws from a zero mean GP prior under different kernel functions.

In reality, they are just draws from a multivariate Gaussian $ \mathcal{N}(0, K)$ where the covariance matrix has been evaluated by applying the kernel function to all pairs of data points.

Infinite dimensional prior:

$f(x) \sim \mathcal{GP}(m(x), k_{\theta}(x,x^{\prime})) $

$f(X) \sim \mathcal{N}(m(X), K_{X})$

For a finite set of points, $ X $:

$ k_{\theta}(x,x^{\prime})$ encodes the support and inductive biases in function space.

Gaussian Process Regression

How do we fit functions to noisy data with GPs?

1. Given some noisy data $ \bm{y} = \lbrace{y_{i}}\rbrace_{i=1}^{N} $ at $ X = \{ x_{i}\}_{i=1}^N$ input locations.

2. You believe your data comes from a function $ f$ corrupted by Gaussian noise.

$$ \bm{y} = f(X) + \epsilon, \hspace{10pt} \epsilon \sim \mathcal{N}(0, \sigma^{2})$$

Data Likelihood: $ \hspace{10pt} y|f \sim \mathcal{N}(f(x), \sigma^{2}) $

Prior over functions: $ f|\theta \sim \mathcal{GP}(0, k_{\theta}) $

(The choice of kernel function $ k_{\theta}$ controls how your functions space looks.)

\rightarrow

\begin{aligned} f_{i} &\sim \mathcal{N}(0, K_{\theta}) \\ K_{i,j} &= k_{\theta}(x_{i}, x_{j}) \\ &\hspace{2mm} \forall i, j\\ \end{aligned}

.....but we still need to fit the kernel hyperparameters $ \theta$

Learning Step:

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}

\bm{\theta_{\star}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta})

Learning in Gaussian process models occurs through the maximisation of the marginal likelihood w.r.t the kernel hyperparameters.

Data likelihood

Prior

Denominator of Bayes Rule

Learning in a GP

\begin{aligned} p(f|y, \theta) &= \dfrac{p(y|f)p(f|\theta)}{p(y|\theta)} \\ \end{aligned}

p(f|\theta)

\begin{aligned} \bm{\theta_{*}} &= \argmax_{\bm{\theta}}\mathcal{L(\bm{\theta})} \\ p(y|\theta) &= \int p(y|f)p(f|\theta)df = \mathcal{N}(0, K + \sigma^{2}\mathbb{I})\\ \textrm{where, } \mathcal{L(\bm{\theta})} = \log p(y|\theta) &= -\overbrace{\frac{1}{2}y^{T}(K_{\theta} + \sigma^{2}_{n})^{-1}y}^{\textrm{model fit}} -\overbrace{\frac{1}{2}log|K_{\theta} + \sigma^{2}_{n}\mathbb{I}|}^{\textrm{complexity penalty}} - \dfrac{n}{2}log2\pi \\ \end{aligned}

Popular Kernels

Usually, the user picks one on the basis of prior knowledge.

Each kernel depends on some hyperparameters $ \theta$, which are tuned in the training step.

Predictions in a GP

We want to infer latent function values $f_{*}$ at any arbitrary input locations $ X_{*} $, so in a distribution sense we want,

$$ p(f_{*}|X_{*}, y, \theta) $$

Posterior Predictive Distribution

The predictive posterior is closed form (because we are operating a world of Gaussians):

Write down the joint.
Derive the conditional from joint. (Gaussian identity)

\begin{aligned} \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \mu_{2} \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12}^{T} & \Sigma_{22}\\ \end{bmatrix} \right) \end{aligned}

\begin{aligned} x_{1}|x_{2} &\sim \mathcal{N}(\mu_{3}, \Sigma_{3})\\ \mu_{3} &= \mu_{1} + \Sigma_{12}\Sigma_{22}^{-1}(A_{2} - \mu_{2}) \\ \Sigma_{3} &= \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned}

Joint

x_{2}|x_{1}

Conditional

can be derived using symmetry arguments

Predictions in a GP

\begin{aligned} p(f_{*}|y, X_{*}, X, \theta_{*}) &= \mathcal{N}( \mu_{*}, \hspace{5pt} \Sigma_{*}) \\ \mu_{*} &= K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}y \\ \Sigma_{*} &= K_{**} - K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}K_{*}^T \\ \end{aligned}

\begin{aligned} p(f, f_{*}|X, X_{*}, \theta) &= \mathcal{N} \left( 0, \begin{bmatrix} k(X,X) & k(X, X_{*}) \\ k(X_{*}, X) & k(X_{*}, X_{*}) \end{bmatrix} \right) = \mathcal{N}\left(0, \begin{bmatrix} K & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ p(y, f_{*}|X, X_{*}, \theta) &= \mathcal{N}\left(0, \begin{bmatrix} K + \sigma^{2}\mathbb{I} & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ \end{aligned}

Joint

Conditional

Examples of GP Regression

Ground Truth

Reconstruction

Examples of GP Regression

Gaussian processes can also be used in contexts where the observations are a gigantic data matrix $ Y \equiv \{ y_{n}\}_{n=1}^{N}, y_{n} \in \mathbb{R}^{D}$. $D$ can be pretty big $\approx 1000s$.

Imagine a stack of images, where each image has been flattened into a vector of pixels and stacked together rowise in a matrix.

n = number of images

d = 784

N x D

The Gaussian process bridge

2d latent space

High-dimensional data space

Schematic of a Gaussian process Latent Variable Model

. . .

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

Structure / clustering in latent space can reveal insights into the high-dimensional data - for instance, which points are similar.

each cluster is a digit (coloured by labels)

$ Z \in \mathbb{R}^{N \times Q}$

$ F \in \mathbb{R}^{N \times D}$

$ Y \in \mathbb{R}^{N \times D} (= F + noise)$

Mathematical set-up

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

Data Likelihood:

p(f) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

Prior structure:

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

$Z$

$z_{n}$

Mathematical set-up

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

p(f|Z) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}