Machine Learning for

Time Series Analysis VIII

Gaussian Processes

Fall 2022 - UDel PHYS 667
dr. federica bianco 

 

@fedhere

MLTSA:

 

missing data

1

What to do if you have missing data?

 

- most models will fail

 

- statistics will be messed up

 

MLTSA:

missing data

MLTSA:

reminder: Stochastic process

A random variable indexed by time. 

For any subset of points in time the dependent variable follows the a probability distribution

 e.g. 

p(x_{t1}…x_{tn}) \sim N(\mu, \sigma)
pl.figure(figsize=(20,5))
N = 200
np.random.seed(100)
y = np.random.randn(N)
t = np.linspace(0, N, N, endpoint=False)
pl.plot(t, y, lw=2)
pl.xlabel("time")
pl.ylabel("y");

Discrete time stochastic process

pl.hist(y[20:70])
pl.hist(y[100:150])

randomely distributed missing observation

MLTSA:

kinds of missing data

e.g. on off times of sensors

aggregate statistics should be preserved if the process is stockastic

missing data due to sensors' sensitivity

MLTSA:

kinds of missing data

e.g. CCD cameras: need minimum light to generate signal

censored data (upper or lower limits)

aggregate statistics will be biased - specifically variance will always be suppresses

data aggregated above or below a threshold

MLTSA:

kinds of missing data

e.g. medical records of people older than 90 are often aggregated as >90.

aggregate statistics will be biased - specifically variance will always be suppresses

censored data (upper or lower limits)

pandas imputation methods

- most models will fail

MLTSA:

the missing data problem

MLTSA:

kinds of missing data

most models do not work with missing data

MLTSA:

 

data imputation

2

Impute data with mean:

 

if your goal was to estimate the mean you may be ok. if your goal was to estimate the variance you have effectively suppressed it

MLTSA:

data imputation

Impute data with mean:

 

if your goal was to estimate the mean you may be ok. if your goal was to estimate the variance you have effectively suppressed it. In time domain it may cause significant jumps.

A local mean would be better tho.

MLTSA:

data imputation

Backward and forward filling:

 

particularly dangerous with time series analysis. You are implying stability in the system at scales where there is no statbility!

MLTSA:

data imputation

Linear, Cubic, Quadratic interpolation. Its model dependent. Unless you have a reason for the model you are constraining the data and will not likely get a food fit. Incerased flexibility in the model allows a better fit but exposes to overfitting risk

MLTSA:

data imputation

linear interpolation

quadratic interpolation

Can use Nearest Neighbor algorithm to impute. Thinking about time series in particular, this could be done along the time or along the feature axis

MLTSA:

data imputation - kNN

MLTSA:

 

Gaussian Processes

3

MLTSA:

Gaussian Processes:

 

Gaussian Processes are a probabilistic framework for missing data imputation

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Advantages

 

Provides a framework to estimate values of missing data and their uncertainty

 

 

 

 

uncertainty will take into account how isolate the point is and how "predictable" (i.e. stationary) the process is

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Advantages

 

Provides a framework to estimate values of missing data and their uncertainty

 

This is a Bayesian framework. 

It is however NOT entirely a non-parametric method: you need to define a model (embedded in the kernel function, as we will see in a few slides)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Advantages

 

It's computationally "tractable"

The framework defines an infinite set of "functions" that can be evaluated at each point: a single calculation provides the answer to the interpolation at any N points.

 

Each pair of function evaluations follows a bivariate normal distribution 

(y_1, y_2) \sim N(\overrightarrow{\mu}, \Sigma)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Applications


Prediction:

Predicting data based on the training set (this can naturally be done be in a regression but also in a classification context - GaussianProcess classifiers)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Applications


Data imputation:

Fill in missing data and generating data at regular intervals of the exogenous variable (really important since most time-domain models require data to be evenly sampled)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Applications


Survey design:

Identify where the uncertainty is largest to decide where to make observation (e.g. geospatial applications: where do I put my sensors? Time domain: when to "mine" data from an existing but hard to access dataset.)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

 

 

Parameteric (simlpe) approach - define a functional form (e.g. linear).

  • Problem: if the functional form is not the same as the generative process the data will be fit poorly.
  • Not a solution: Increase number of parameters in the function may improve the fit but induces overfitting and loss of generalization

Data imputation requires definition of a function:

linear fit

10th degree polynomial fit

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

 

 

Probabilistic approach - Consider all possible functions but assign a prior to the ones that are more "likely"

  • Problem: naively: there are infinity^infinity functions (infinity functions in an infinite parameter space) -
  • Not a problem!: turns out this infinite set of functions can be described by their aggregate statistics (like an infinit set of numbers can be described by its distribution

Data imputation requires definition of a function:

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

 

A process is a function i.e. something that can be calculated at any value of a (1- or N- dimensional) variable.

 

A process is a collection of random variables

MLTSA:

 

GP definitions

3.1

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

 

A process is a collection of random variables

 

A Gaussian processes is a collection of random variables any finite subset of which have joint Gaussian distribution.

MLTSA:

Gaussian Processes:

math: Multivariate Gaussian distributions

multivariate Gaussian distribution

μ: vector of means (expectation values)

σ: vector of standard deviation

Σ: matrix of covariance

\mu = [\mu_1...\mu_n]\\ \sigma = [\sigma_1...\sigma_N]\\

Σ =

\mu_2
\mu_1

MLTSA:

Gaussian Processes:

math: Multivariate Gaussian distributions

multivariate Gaussian distribution

μ: vector of means (expectation values)

σ: vector of standard deviation

Σ: matrix of covariance

\mu = [\mu_1...\mu_n]\\ \sigma = [\sigma_1...\sigma_N]\\

Σ =

\sigma_2
\sigma_1

MLTSA:

Gaussian Processes:

math: Multivariate Gaussian distributions

multivariate Gaussian distribution

μ: vector of means (expectation values)

σ: vector of standard deviation

Σ: matrix of covariance

\mu = [\mu_1...\mu_n]\\ \sigma = [\sigma_1...\sigma_N]\\

Σ =

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

 

A Gaussian Process is entirely specified by its mean and covariance function.

The mean and covariance functions can be static or time-evolving

m(\bold{x}) ~=~ E\left[f(\bold{x})\right] \\ k(\bold{x}, \bold{x'}) ~=~ E \left[(f(\bold{x}) - \mu(\bold{x})) (f(\bold{x'}) - \mu(\bold{x'}))\right]\\ f(\bold{x}) = GP( \mu(\bold{x}), k(\bold{x,x'}))

MLTSA:

 

GP kernels

4

MLTSA:

Gaussian Processes:

  

We predict each y(t) point based each other point in the time series and our prior believe about the points and their locations:

Define:

  • how a point at t depends on the points at any t's
  • how it depends on the uncertainty of those points
  • how closely I want it to go through each point I know

 

MLTSA:

Gaussian Processes:

Kernels

 

MLTSA:

Gaussian Processes:

Kernels

 

This Kernel defines the mean. 

With no observed data you can define a family of functions all pointwise distributes as a Gaussian around that mean

MLTSA:

Gaussian Processes:

Kernels

 

Each "function" is the P(t,y) joint probability distribution of t and y

P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)}

P(Y|X) is the probability of a predicted Ys realization given observed Xs. At t where x(t) you can set

P(Y) = 1 for y(t) = x(t),

P(Y) = 0 at y(t) = x(t)

MLTSA:

Gaussian Processes:

Kernels

 

The Kernel is the prior distribution:

 

cov(f(x_p), f(x_q)) ~= ~k(x_p, x_q) = \exp{-\frac{1}{2} |x_p - x_q|^2}

This says that we "believe" that the expectation of the values of y E[f(x)] = 

e.g.

\hat{y}

"double exponential kernel"

MLTSA:

Gaussian Processes:

Kernels

 

The Kernel is the prior distribution:

 

k(x_p, x_q) = \exp{-\frac{L^2}{2l^2} }

e.g.

L: lag

l: characteristic length

"double exponential kernel"

MLTSA:

Gaussian Processes:

Kernels

 

This Kernel defines the mean. 

With no observed data you can define a family of functions all pointwise distributed as a Gaussian around that mean

Each "function" is the P(y|t) posterior probability. The prior is the kernel, the likelihood is P(observed | y) where observed are the available data 

 

MLTSA:

Gaussian Processes:

Kernels

 

This Kernel defines the mean. 

With no observed data you can define a family of functions all pointwise distributes as a Gaussian around that mean

 

At each t there is a Gaussian distributed family of y values

P(X') probability of the model (prior)

MLTSA:

Gaussian Processes:

Kernels

 

The Kernel also defines the posterior as a "similarity" function or as the memory of a time evolving process: how similar are points to each other? Including observed data

 

P(X | X') probability of data given model

MLTSA:

Gaussian Processes:

  

MLTSA:

Gaussian Processes:

Kernels

 

MLTSA:

Gaussian Processes:

  

MLTSA:

Gaussian Processes:

Kernels

 

MLTSA:

Gaussian Processes:

  

MLTSA:

Gaussian Processes:

Kernels

 

MLTSA:

Gaussian Processes:

Kernels

 

This Kernel defines the mean. 

As we include observations (the "training" data) the functions become limited to the functions that passes through (or near the data)

MLTSA:

Gaussian Processes:

Kernels

 

The Kernel is the prior distribution:

 

the kernel defines the consistency relation:

  • how similar are points to each other given a lag l
  • how fast does the function evolve
  • can define periodicity etc
  • can be time-evolving

MLTSA:

Gaussian Processes:

Kernels

 

The Kernel is the prior distribution:

 

kernels are functions (hence the method is parametric) and they have parameters that are learned

 

MLTSA:

Gaussian Processes:

Kernels

 

The Kernel is the prior distribution:

 

 (see kernel trick: the kernel allowed the solution to be found in a linear and therefore analytically solvable framework)

kernels are functions (hence the method is parametric) and they have parameters that are learned

 

MLTSA:

 

MLTSA:

Gaussian Processes:

Kernels

 

the kernel defines the consistency relation of the Gaussian process:

  • how similar are points to each other given a lag l
  • how fast does the function evolve
  • can define periodicity etc
  • can be time-evolving

MLTSA:

 

GP matrix formulation

4

MLTSA:

Gaussian Processes:

  

Joint distribution or training and test data

MLTSA:

Gaussian Processes:

Kernels

 

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

MLTSA:

Gaussian Processes:

  

Joint distribution or training and test data

MLTSA:

Gaussian Processes:

Kernels

 

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

assume the processes are mean 0. This is not necessary but simplified the math. Data can be preprocessed to be mean 0

MLTSA:

Gaussian Processes:

  

Joint distribution or training and test data

MLTSA:

Gaussian Processes:

Kernels

 

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\
f(\bold{t_*}) | \bold{t_*}, \bold{t}, f(\bold{t}) \sim N(K(\bold{t_*}, \bold{t} )K(\bold{t}, \bold{t})^{-1} f(\bold{t}),\\ K(\bold{t_*}, \bold{t_*} ) - K(\bold{t_*}, \bold{t})K(\bold{t}, \bold{t})^{-1} K(\bold{t}, \bold{t}))\\

required matrix inversion

MLTSA:

Gaussian Processes:

  

Joint distribution or training and test data, including uncertainties

MLTSA:

Gaussian Processes:

Kernels

 

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) + \sigma_n^2\bold{I} & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\
cov(y_p, y_q) = k(t_p, t_q) + \sigma_n^2\delta_{pq}
cov(y) = K(\bold{X}, \bold{X}) + \sigma_n^2\bold{I}

with independent observations and uncertainties with variance σ

MLTSA:

Gaussian Processes:

  

Joint distribution or training and test data, including uncertainties

MLTSA:

Gaussian Processes:

Kernels

 

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) + \sigma_n^2\bold{I} & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\
cov(y_p, y_q) = k(t_p, t_q) + \sigma_n^2\delta_{pq}
cov(y) = K(\bold{X}, \bold{X}) + \sigma_n^2\bold{I}

with independent observations and uncertainties with variance σ

MLTSA:

Gaussian Processes:

  

MLTSA:

Gaussian Processes:

Kernels

 

MLTSA:

Gaussian Processes:

  

MLTSA:

Gaussian Processes:

Kernels

 

MLTSA:

 

GP exercises

3

references

A Visual Exploration of Gaussian Processes

Jochen Görtler, Rebecca Kehlbeck, and Oliver Deussen

https://distill.pub/2019/visual-exploration-gaussian-processes/

 

 

Gaussian Processes for Machine Learning, the MIT Press

Rasmussen & Williams 2006

http://www.gaussianprocess.org/gpml/chapters/RW.pdf

its considered one of the best ML text ever written

references

A gentle introduction to Gaussian Process Regression

Dan Foreman Mackey - George user manual

https://george.readthedocs.io/en/latest/tutorials/first/#

HW

2D GP in time and wavelength

Copy of MLTSA_08 2022

By federica bianco

Copy of MLTSA_08 2022

Gaussian Processes

  • 395