Machine Learning for Time Series Analysis IV

State Space Models and MCMC

Fall 2022 - UDel PHYS 667
dr. federica bianco

@fedhere

fbianco@udel.edu

3 Bayesian statistics

combinatorial statistics

Bayes' theorem

2 model fit: gradient descent

optimization

gradient descent

stochastic gradient descent

4 optimization with MCMC

MCMC

space state models for time series analysis

BSTS fitting

1 time series decomposition

- Nowcasting : short term forecast

includes descriptions of the local linear trend and seasonal state models, as well as spike and slab priors for regressions with large numbers of predictors.

- Long term forecasting

describes a situation where the local level and local linear trend models would be inappropriate. It offers a semilocal linear trend model as an alternative.

Recession modeling

describes a model where the response variable is non-Gaussian. The goal in Example 3 is not to predict the future, but to control for serial dependence in an explanatory model that seeks to identify relevant predictor variables.

Weekly initial claims for unemployment in the US.
Scott and Varian (2014, 2015)

+

Weekly initial claims for unemployment in the US.
Scott and Varian (2014, 2015)

this slide deck:

https://slides.com/federicabianco/mltsa4

MLTSA:

Model fit:

Optimization and Likelihood

1

MLTSA:

Optimization & Likelihood

change a, b to minimize χ2

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{y_i - (a~x_i + b))^2}{\sigma^2})

y_i = (a~x_i + b)

Linear regression

MLTSA:

Optimization & Likelihood

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{y_i - (a~x_i + b))^2}{\sigma^2})

y_i = (a~x_i + b)

Univariate Linear regression

b

a

https://arxiv.org/pdf/1008.4686.pdf

N\sim \frac{1}{\sqrt{2\pi}\sigma}exp\frac{-(x-\mu)^2}{2\sigma^2}

Gaussian

Probability of data given a model, e.g. Gaussian

Maximizing the likelihood we seek the parameters that maximize the probability of the observed data under the chosen model

Likelihood: the probability of the model given the data. Same Gaussian assumptions

MLTSA:

Likelihood

unknown variables

L(\mu, \sigma | \vec{x}) \sim \frac{1}{\sqrt{2\pi}\sigma}exp\frac{-(\vec{x}-\mu)^2}{2\sigma^2}

P(\vec{x} |\mu, \sigma ) \sim \frac{1}{\sqrt{2\pi}\sigma}exp\frac{-(\vec{x}-\mu)^2}{2\sigma^2}

MLTSA:

Optimization

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

b

a

Gradient descent is an iterative algorithm, that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.

(Toward data science)

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

b

a

Gradient Descent

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

MLTSA:

Optimization

Choose a target function Q(p) of the parameters p
Choose a (random) initial value for the parameters: (e.g.
p0 = (a0, b0))
Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

Repeat steps 4, 5, 6 until "convergence":
Calculate the gradient Q' of the target function for the current parameter values calculating it over all observations in the training set
Calculate the next step sizes for each feature :
stepsize = Q'(p_now) * η
Calculate the new parameters p_new as :
p_new = p_now - stepsize

\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε tollrance

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

MLTSA:

Optimization

p_{\mathrm{new}} = p_{\mathrm{old}} - \sum_j\eta_j\sum_{i=1}^N \frac{df'(x_{i,j})}{dx_{i,j}}

Gradient Descent

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

b

a

MLTSA:

Optimization

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

b

a

Add stochasticity to avoid getting stuck in a local minimum

MLTSA:

Optimization

\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

stochastic gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε tollrance

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

https://datascience.stackexchange.com/questions/52884/possible-for-batch-size-of-neural-network-to-be-too-small

Choose a target function Q(p) of the parameters p
Choose a (random) initial value for the parameters: (e.g.
p0 = (a0, b0))
Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

Repeat steps 4, 5, 6 until "convergence":
Calculate the gradient Q' of the target function for the current parameter values on a subset of the observations (extreme: size=1)
Calculate the next step sizes for each feature :
stepsize = Q'(p_now) * η
Calculate the new parameters p_new as :
p_new = p_now - stepsize

\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

stochastic gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε tollrance

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

Choose a target function Q(p) of the parameters p
Choose a (random) initial value for the parameters: (e.g.
p0 = (a0, b0))
Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

Repeat steps 4, 5, 6 until "convergence":
Calculate the gradient Q' of the target function for the current parameter values on a randomely selected subset of the observations (extreme: size=1)
Calculate the next step sizes for each feature :
stepsize = Q'(p_now) * η
Calculate the new parameters p_new as :
p_new = p_now - stepsize

stochastic gradient descent algorithm

MLTSA:

Optimization

Choose a target function Q(p) of the parameters p
Choose a (random) initial value for the parameters: (e.g.
p0 = (a0, b0))
Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

Repeat steps 4, 5, 6 until "convergence":
Calculate the gradient Q' of the target function for the current parameter values calculating it over all observations in the training set
Calculate the next step sizes for each feature :
stepsize = Q'(p_now) * η
Calculate the new parameters p_new as :
p_new = p_now - stepsize

gradient descent algorithm

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

MLTSA:

Optimization

p_{\mathrm{new}} = p_{\mathrm{old}} - \sum_j\eta_j \frac{df'(x_{i,j})}{dx_{i,j}}_{i\in N}

Stochastic Gradient Descent

where i are elements of the full N-dimensional observation set (a subset)

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

b

a

idea 2. start a bunch of parallel optimizations

(we will see it in next class)

MLTSA:

Optimization

MLTSA:

Bayes Theorem &

Bayesian statistics

2

MLTSA:

combinatorial statistics

P(A|B)P(B) = P(B|A)P(A)

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

probability of A conditioned on B

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

P(A|B) = P(A)

A and B are independent

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

A=red; B=blue

P(A)= ?

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

A=red; B=blue

P(A)= ?

P(A|B) = \frac{P(A\cap B)}{P(B)}

P(A|B)

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

new sample space after B happened

P(A|B) = \frac{P(A\cap B)}{P(B)}

A

B

MLTSA:

combinatorial statistics

P(A|B)P(B) = P(B|A)P(A)

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

MLTSA:

Bayes theorem

P(\mathrm{model}|\mathrm{data})P(\mathrm{data}) = P(\mathrm{data}|\mathrm{model})P(\mathrm{model})

P(\mathrm{model}|\mathrm{data}) = \frac{P(\mathrm{data}|\mathrm{model})P(\mathrm{model})}{P(\mathrm{data})}

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

posterior

likelihood

prior

evidence

model parameters

data

\theta

D

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

constraints on the model

e.g. flux from a star is never negative

P(f<0) = 0 P(f>=0) = 1

prior:

model parameters

data

\theta

D

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

constraints on the model

people's weight <1000lb

& people's weight >0lb

P(w) ~ N(105lb, 90lb)

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

constraints on the model

people's weight <1000lb

& people's weight >0lb

P(w) ~ N(105lb, 90lb)

the prior should not be 0 anywhere the probability might exist

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

"uninformative prior"

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

"uninformative prior"

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

likelihood:

this is our model

model parameters

data

\theta

D

P(D|\theta) = ax + b + \epsilon; \epsilon \sim N(\mu, \sigma)

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

evidence

????

model parameters

data

\theta

D

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

evidence

????

model parameters

data

\theta

D

it does not matter if I want to use this for model comparison

MLTSA:

Bayes theorem

P(\theta_1|\mathrm{D}) = \frac{P(\mathrm{D}|\theta_1)P(\theta_1)}{P(\mathrm{D})}

model parameters

data

\theta

D

P(\theta_2|\mathrm{D}) = \frac{P(\mathrm{D}|\theta_2)P(\theta_2)}{P(\mathrm{D})}

it does not matter if I want to use this for model comparison

which has the highest posterior probability?

P(\theta|\mathrm{D}) \propto{P(\mathrm{D}|\theta)P(\theta)}

MLTSA:

Bayes theorem

posterior: joint probability distributin of a parameter set (θ, e.g. (m, b)) condition upon some data D and a model hypothesys f

evidence: marginal likelihood of data under the model

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

prior: “intellectual” knowledge about the model parameters condition on a model hypothesys f. This should come from domain knowledge or knowledge of data that is not the dataset under examination

MLTSA:

Bayes theorem

posterior: joint probability distributin of a parameter set (θ, e.g. (m, b)) condition upon some data D and a model hypothesys f

evidence: marginal likelihood of data under the model

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

prior: “intellectual” knowledge about the model parameters condition on a model hypothesys f. This should come from domain knowledge or knowledge of data that is not the dataset under examination

P(D|f) = \int_{-\infty}^\infty P(D|\theta,f)P(\theta|f)d\theta

in reality all of these quantities are conditioned on the shape of the model: this is a model fitting, not a model selection methodology

MLTSA:

State Space Model

4

MLTSA:

space state model

motion equations for the position or "state" of a spacecraft

x(t): location

y(t): information that can be observed from a tracking device such as velocity and azimuth

https://www.stat.pitt.edu/stoffer/tsa4/Chapter6.pdf

MLTSA:

space state model

The state process is assumed to be a Markov process; this means that the future and past , are independent conditional on the present,
The second condition is that the observations, are independent given the states

https://www.stat.pitt.edu/stoffer/tsa4/Chapter6.pdf

x_{t>\mathrm{now}}

x_{t<\mathrm{now}}

x_{t}

y_{t}

x_{t}

MLTSA:

space state model

Y_t=x_t\beta+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)

x_{t+1} =x_{t} + \nu_t;~~\nu_t∼N(0,\sigma^2_\nu)

unobserved state

Underlying state x is a time varying Markovian process (the position of the spacecraft)

The observed variable depends at least on the state and on noise.

Other elements (e.g. seasonality) can be included in the model too

MLTSA:

BSTS

we can write a Bayesian structural model like this:

Y_t=\mu_t+x_t\beta+S_t+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)\\ \mu_t= \mu_{t-1}+\nu_t;~~\nu∼N(0,\sigma^2_\nu)

local level

state

seasonal trend

MLTSA:

BSTS

we can write a Bayesian structural model like this:

Y_t=\mu_t+x_t\beta+S_t\alpha+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)

\mu_{t+1} =\mu_{t} + \nu_t;~~\nu_t∼N(0,\sigma^2_\nu)

seasonal variations

unobserved trend

Its a Markovian Process:

stochastic process with 1-step memory

there is a hidden or latent process xt called the state process (the position of the space craft)

MLTSA:

BSTS

a bit more complex

MLTSA:

BSTS

https://github.com/wwrechard/pydlm

A model that accepts regression on exogenous variables allows feature extraction: which features are important in the prediction?

e.g. CO2 or Irradiance in climate change (HW)

MLTSA:

Model Optimization

with MCMC

5

MLTSA:

Monte Carlo Markov Chain

stochastic

"markovian"

sequence

posterior: joint probability distributin of a parameter set (m, b) conditioned upon some data D and a model hypothesys f

MLTSA:

MCMC

Goal: sample the posterior distribution

slope

intercept

slope

Goal: sample the posterior distribution

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr < 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

stochasticity allows us to explore the whole surface but spend more time in interesting spots

Goal: sample the posterior distribution

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

Questions:

1. how do I choose the next point?

Any Markovian ergodic process

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

A process is Markovian if the next state of the system is determined stochastically as a perturbation of the current state of the system, and only the current state of the system, i.e. the system has no memory of earlier states (a memory-less process).

A Markovian Process

Definition

Ergodic Process

(given enough time) the entire parameter space would be sampled.

At equilibrium, each elementary process should be equilibrated by its reverse process.

reversible Markov process

Detailed Balance is a sufficient condition for ergodicity

\pi (x_1)P(x_2 | x_1)=\pi (x_2)P(x_1 | x_2)

Metropolis Rosenbluth Rosenbluth Teller 1953 - Hastings 1970

Definition

it can be shown that

If the chains are a Markovian Ergodic process

the algorithm is guaranteed to explore the entire likelihood surface given infinite time

This is in contrast to gradient descent, which can get stuck in local minima or in local saddle points.

how to choose the next point

how you make this decision names the algorithm

simulated annealing (good for multimodal)

parallel tempering (good for multimodal)
differential evolution (good for covariant spaces)

Gibbs sampling (move in along one variable at a time)

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

The chains: the algorithm creates a "chain" (a random walk) that "explores" the likelihood surface.

More efficient is to run many parallel chains - each exploring the surface, an "ensemble"

The path of the chains can be shown along each feature

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

step

feature value

Examples of how to choose the next point

affine invariant : EMCEE package

1

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

step

feature value

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)