Machine Learning for

Time Series Analysis VIII

Gaussian Processes

Fall 2022 - UDel PHYS 667
dr. federica bianco

@fedhere

fbianco@udel.edu

this slide deck:

https://slides.com/federicabianco/mltsa8

MLTSA:

missing data

1

What to do if you have missing data?

- most models will fail

- statistics will be messed up

MLTSA:

missing data

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

MLTSA:

reminder: Stochastic process

A random variable indexed by time.

For any subset of points in time the dependent variable follows the a probability distribution

e.g.

p(x_{t1}…x_{tn}) \sim N(\mu, \sigma)

pl.figure(figsize=(20,5))
N = 200
np.random.seed(100)
y = np.random.randn(N)
t = np.linspace(0, N, N, endpoint=False)
pl.plot(t, y, lw=2)
pl.xlabel("time")
pl.ylabel("y");

Discrete time stochastic process

pl.hist(y[20:70])

pl.hist(y[100:150])

randomely distributed missing observation

MLTSA:

kinds of missing data

e.g. on off times of sensors

aggregate statistics should be preserved if the process is stockastic

missing data due to sensors' sensitivity

MLTSA:

kinds of missing data

e.g. CCD cameras: need minimum light to generate signal

censored data (upper or lower limits)

aggregate statistics will be biased - specifically variance will always be suppresses

data aggregated above or below a threshold

MLTSA:

kinds of missing data

e.g. medical records of people older than 90 are often aggregated as >90.

aggregate statistics will be biased - specifically variance will always be suppresses

censored data (upper or lower limits)

https://github.com/fedhere/MLTSA_FBianco/blob/master/CodeExamples/GPlecture.ipynb

pandas imputation methods

- most models will fail

MLTSA:

the missing data problem

MLTSA:

kinds of missing data

most models do not work with missing data

MLTSA:

data imputation

2

Impute data with mean:

if your goal was to estimate the mean you may be ok. if your goal was to estimate the variance you have effectively suppressed it

MLTSA:

data imputation

Impute data with mean:

if your goal was to estimate the mean you may be ok. if your goal was to estimate the variance you have effectively suppressed it. In time domain it may cause significant jumps.

A ~~local~~ mean would be better tho.

MLTSA:

data imputation

Backward and forward filling:

particularly dangerous with time series analysis. You are implying stability in the system at scales where there is no statbility!

MLTSA:

data imputation

Linear, Cubic, Quadratic interpolation. Its model dependent. Unless you have a reason for the model you are constraining the data and will not likely get a food fit. Incerased flexibility in the model allows a better fit but exposes to overfitting risk

MLTSA:

data imputation

linear interpolation

quadratic interpolation

Can use Nearest Neighbor algorithm to impute. Thinking about time series in particular, this could be done along the time or along the feature axis

MLTSA:

data imputation - kNN

MLTSA:

Gaussian Processes

3

MLTSA:

Gaussian Processes:

Gaussian Processes are a probabilistic framework for missing data imputation

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Advantages

Provides a framework to estimate values of missing data and their uncertainty

uncertainty will take into account how isolate the point is and how "predictable" (i.e. stationary) the process is

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Advantages

Provides a framework to estimate values of missing data and their uncertainty

This is a Bayesian framework.

It is however NOT entirely a non-parametric method: you need to define a model (embedded in the kernel function, as we will see in a few slides)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Advantages

It's computationally "tractable"

The framework defines an infinite set of "functions" that can be evaluated at each point: a single calculation provides the answer to the interpolation at any N points.

Each pair of function evaluations follows a bivariate normal distribution

(y_1, y_2) \sim N(\overrightarrow{\mu}, \Sigma)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Applications

Prediction:

Predicting data based on the training set (this can naturally be done be in a regression but also in a classification context - GaussianProcess classifiers)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Applications

Data imputation:

Fill in missing data and generating data at regular intervals of the exogenous variable (really important since most time-domain models require data to be evenly sampled)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Applications

Survey design:

Identify where the uncertainty is largest to decide where to make observation (e.g. geospatial applications: where do I put my sensors? Time domain: when to "mine" data from an existing but hard to access dataset.)

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

Parameteric (simlpe) approach - define a functional form (e.g. linear).

Problem: if the functional form is not the same as the generative process the data will be fit poorly.
Not a solution: Increase number of parameters in the function may improve the fit but induces overfitting and loss of generalization

Data imputation requires definition of a function:

linear fit

10th degree polynomial fit

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

Probabilistic approach - Consider all possible functions but assign a prior to the ones that are more "likely"

Problem: naively: there are infinity^infinity functions (infinity functions in an infinite parameter space) -
Not a problem!: turns out this infinite set of functions can be described by their aggregate statistics (like an infinit set of numbers can be described by its distribution

Data imputation requires definition of a function:

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

A process is a function i.e. something that can be calculated at any value of a (1- or N- dimensional) variable.

A process is a collection of random variables

Rasmussen & Williams 2006

MLTSA:

GP definitions

3.1

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

A process is a collection of random variables

A Gaussian processes is a collection of random variables any finite subset of which have joint Gaussian distribution.

Rasmussen & Williams 2006

MLTSA:

Gaussian Processes:

math: Multivariate Gaussian distributions

multivariate Gaussian distribution

μ: vector of means (expectation values)

σ: vector of standard deviation

Σ: matrix of covariance

\mu = [\mu_1...\mu_n]\\ \sigma = [\sigma_1...\sigma_N]\\

Σ =

\mu_2

\mu_1

MLTSA:

Gaussian Processes:

math: Multivariate Gaussian distributions

multivariate Gaussian distribution

μ: vector of means (expectation values)

σ: vector of standard deviation

Σ: matrix of covariance

\mu = [\mu_1...\mu_n]\\ \sigma = [\sigma_1...\sigma_N]\\

Σ =

\sigma_2

\sigma_1

MLTSA:

Gaussian Processes:

math: Multivariate Gaussian distributions

multivariate Gaussian distribution

μ: vector of means (expectation values)

σ: vector of standard deviation

Σ: matrix of covariance

\mu = [\mu_1...\mu_n]\\ \sigma = [\sigma_1...\sigma_N]\\

Σ =

MLTSA:

Gaussian Processes:

a probabilistic imputation method - Bayesian Framework

A Gaussian Process is entirely specified by its mean and covariance function.

The mean and covariance functions can be static or time-evolving

Rasmussen & Williams 2006

m(\bold{x}) ~=~ E\left[f(\bold{x})\right] \\ k(\bold{x}, \bold{x'}) ~=~ E \left[(f(\bold{x}) - \mu(\bold{x})) (f(\bold{x'}) - \mu(\bold{x'}))\right]\\ f(\bold{x}) = GP( \mu(\bold{x}), k(\bold{x,x'}))

MLTSA:

GP kernels

4

MLTSA:

Gaussian Processes:

We predict each y(t) point based each other point in the time series and our prior believe about the points and their locations:

Define:

how a point at t depends on the points at any t's
how it depends on the uncertainty of those points
how closely I want it to go through each point I know

MLTSA:

Gaussian Processes:

Kernels

MLTSA:

Gaussian Processes:

Kernels

This Kernel defines the mean.

With no observed data you can define a family of functions all pointwise distributes as a Gaussian around that mean

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

https://distill.pub/2019/visual-exploration-gaussian-processes/

Each "function" is the P(t,y) joint probability distribution of t and y

P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)}

P(Y|X) is the probability of a predicted Ys realization given observed Xs. At t where x(t) you can set

P(Y) = 1 for y(t) = x(t),

P(Y) = 0 at y(t) = x(t)

MLTSA:

Gaussian Processes:

Kernels

The Kernel is the prior distribution:

cov(f(x_p), f(x_q)) ~= ~k(x_p, x_q) = \exp{-\frac{1}{2} |x_p - x_q|^2}

This says that we "believe" that the expectation of the values of y E[f(x)] =

e.g.

\hat{y}

"double exponential kernel"

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

The Kernel is the prior distribution:

k(x_p, x_q) = \exp{-\frac{L^2}{2l^2} }

e.g.

L: lag

l: characteristic length

"double exponential kernel"

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

This Kernel defines the mean.

With no observed data you can define a family of functions all pointwise distributed as a Gaussian around that mean

Each "function" is the P(y|t) posterior probability. The prior is the kernel, the likelihood is P(observed | y) where observed are the available data

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

This Kernel defines the mean.

With no observed data you can define a family of functions all pointwise distributes as a Gaussian around that mean

At each t there is a Gaussian distributed family of y values

https://distill.pub/2019/visual-exploration-gaussian-processes/

P(X') probability of the model (prior)

MLTSA:

Gaussian Processes:

Kernels

The Kernel also defines the posterior as a "similarity" function or as the memory of a time evolving process: how similar are points to each other? Including observed data

https://distill.pub/2019/visual-exploration-gaussian-processes/

P(X | X') probability of data given model

MLTSA:

Gaussian Processes:

MLTSA:

Gaussian Processes:

Kernels

MLTSA:

Gaussian Processes:

MLTSA:

Gaussian Processes:

Kernels

MLTSA:

Gaussian Processes:

MLTSA:

Gaussian Processes:

Kernels

MLTSA:

Gaussian Processes:

Kernels

This Kernel defines the mean.

As we include observations (the "training" data) the functions become limited to the functions that passes through (or near the data)

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

The Kernel is the prior distribution:

the kernel defines the consistency relation:

how similar are points to each other given a lag l
how fast does the function evolve
can define periodicity etc
can be time-evolving

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

The Kernel is the prior distribution:

kernels are functions (hence the method is parametric) and they have parameters that are learned

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

The Kernel is the prior distribution:

(see kernel trick: the kernel allowed the solution to be found in a linear and therefore analytically solvable framework)

kernels are functions (hence the method is parametric) and they have parameters that are learned

https://distill.pub/2019/visual-exploration-gaussian-processes/

MLTSA:

Gaussian Processes:

Kernels

the kernel defines the consistency relation of the Gaussian process:

how similar are points to each other given a lag l
how fast does the function evolve
can define periodicity etc
can be time-evolving

MLTSA:

GP matrix formulation

4

Rasmussen & Williams 2006

MLTSA:

Gaussian Processes:

Joint distribution or training and test data

MLTSA:

Gaussian Processes:

Kernels

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

Rasmussen & Williams 2006

MLTSA:

Gaussian Processes:

Joint distribution or training and test data

MLTSA:

Gaussian Processes:

Kernels

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

assume the processes are mean 0. This is not necessary but simplified the math. Data can be preprocessed to be mean 0

Rasmussen & Williams 2006

MLTSA:

Gaussian Processes:

Joint distribution or training and test data

MLTSA:

Gaussian Processes:

Kernels

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

f(\bold{t_*}) | \bold{t_*}, \bold{t}, f(\bold{t}) \sim N(K(\bold{t_*}, \bold{t} )K(\bold{t}, \bold{t})^{-1} f(\bold{t}),\\ K(\bold{t_*}, \bold{t_*} ) - K(\bold{t_*}, \bold{t})K(\bold{t}, \bold{t})^{-1} K(\bold{t}, \bold{t}))\\

required matrix inversion

Rasmussen & Williams 2006

MLTSA:

Gaussian Processes:

Joint distribution or training and test data, including uncertainties

MLTSA:

Gaussian Processes:

Kernels

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) + \sigma_n^2\bold{I} & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

cov(y_p, y_q) = k(t_p, t_q) + \sigma_n^2\delta_{pq}

cov(y) = K(\bold{X}, \bold{X}) + \sigma_n^2\bold{I}

with independent observations and uncertainties with variance σ

Rasmussen & Williams 2006

MLTSA:

Gaussian Processes:

Joint distribution or training and test data, including uncertainties

MLTSA:

Gaussian Processes:

Kernels

\begin{bmatrix} f(t)\\f(t_*) \end{bmatrix} \sim N\left(0, \begin{bmatrix} K(t, t) + \sigma_n^2\bold{I} & K(t,t_*)\\K(t_*, t) & K(t_*,t_*) \end{bmatrix} \right)\\

cov(y_p, y_q) = k(t_p, t_q) + \sigma_n^2\delta_{pq}

cov(y) = K(\bold{X}, \bold{X}) + \sigma_n^2\bold{I}

with independent observations and uncertainties with variance σ

Rasmussen & Williams 2006