Machine Learning for Time Series Analysis IV

State Space Models and MCMC

Fall 2022 - UDel PHYS 667
dr. federica bianco 

 

@fedhere

3 Bayesian statistics

combinatorial statistics

Bayes' theorem

 

2  model fit: gradient descent

optimization

gradient descent

stochastic gradient descent

4 optimization with MCMC

MCMC

space state models for time series analysis

BSTS fitting

1  time series decomposition

 

- Nowcasting : short term forecast

includes descriptions of the local linear trend and seasonal state models, as well as spike and slab priors for regressions with large numbers of predictors. 

- Long term forecasting

describes a situation where the local level and local linear trend models would be inappropriate. It offers a semilocal linear trend model as an alternative.

Recession modeling 

describes a model where the response variable is non-Gaussian. The goal in Example 3 is not to predict the future, but to control for serial dependence in an explanatory model that seeks to identify relevant predictor variables.

 

Weekly initial claims for unemployment in the US.
Scott and Varian (2014, 2015)

+

Weekly initial claims for unemployment in the US.
Scott and Varian (2014, 2015)

MLTSA:

Model fit:

Optimization and Likelihood

1

MLTSA:

Optimization & Likelihood

change a, b to minimize χ2

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{y_i - (a~x_i + b))^2}{\sigma^2})
y_i = (a~x_i + b)

Linear regression

MLTSA:

Optimization & Likelihood

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{y_i - (a~x_i + b))^2}{\sigma^2})
y_i = (a~x_i + b)

Univariate Linear regression

b

a

N\sim \frac{1}{\sqrt{2\pi}\sigma}exp\frac{-(x-\mu)^2}{2\sigma^2}

Gaussian

Probability of data given a model, e.g. Gaussian

 

Maximizing the likelihood we seek the parameters that maximize the probability of the observed data under the chosen model

Likelihood: the probability of the model given the data. Same Gaussian assumptions

 

MLTSA:

Likelihood

unknown variables

unknown variables

L(\mu, \sigma | \vec{x}) \sim \frac{1}{\sqrt{2\pi}\sigma}exp\frac{-(\vec{x}-\mu)^2}{2\sigma^2}
P(\vec{x} |\mu, \sigma ) \sim \frac{1}{\sqrt{2\pi}\sigma}exp\frac{-(\vec{x}-\mu)^2}{2\sigma^2}

MLTSA:

Optimization

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

b

a

Gradient descent is an iterative algorithm, that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.

(Toward data science)

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

b

a

Gradient Descent

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

MLTSA:

Optimization

  1. Choose a target function Q(p) of the parameters p
     
  2. Choose a (random) initial value for the parameters: (e.g.
    p
    0 = (a0, b0))
     
  3. Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

    Repeat steps 4, 5, 6 until "convergence":
     
  4. Calculate the gradient Q' of the target function for the current parameter values calculating it over all observations in the training set
     
  5. Calculate the next step sizes for each feature :
    stepsize = Q'(p_now) * η 
     
  6. Calculate the new parameters p_new as :
    p_new = p_now - stepsize
\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε  tollrance

 

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}
Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

MLTSA:

Optimization

p_{\mathrm{new}} = p_{\mathrm{old}} - \sum_j\eta_j\sum_{i=1}^N \frac{df'(x_{i,j})}{dx_{i,j}}

Gradient Descent

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

b

a

MLTSA:

Optimization

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

b

a

Add stochasticity to avoid getting stuck in a local minimum

MLTSA:

Optimization

\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

stochastic gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε  tollrance

 

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}
Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

  1. Choose a target function Q(p) of the parameters p
     
  2. Choose a (random) initial value for the parameters: (e.g.
    p
    0 = (a0, b0))
     
  3. Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

    Repeat steps 4, 5, 6 until "convergence":
     
  4. Calculate the gradient Q' of the target function for the current parameter values on a subset of the observations (extreme: size=1)
     
  5. Calculate the next step sizes for each feature :
    stepsize = Q'(p_now) * η 
     
  6. Calculate the new parameters p_new as :
    p_new = p_now - stepsize
\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

stochastic gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε  tollrance

 

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}
Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

  1. Choose a target function Q(p) of the parameters p
     
  2. Choose a (random) initial value for the parameters: (e.g.
    p
    0 = (a0, b0))
     
  3. Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

    Repeat steps 4, 5, 6 until "convergence":
     
  4. Calculate the gradient Q' of the target function for the current parameter values on a randomely selected subset of the observations (extreme: size=1)
     
  5. Calculate the next step sizes for each feature :
    stepsize = Q'(p_now) * η 
     
  6. Calculate the new parameters p_new as :
    p_new = p_now - stepsize

stochastic gradient descent algorithm

MLTSA:

Optimization

  1. Choose a target function Q(p) of the parameters p
     
  2. Choose a (random) initial value for the parameters: (e.g.
    p
    0 = (a0, b0))
     
  3. Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

    Repeat steps 4, 5, 6 until "convergence":
     
  4. Calculate the gradient Q' of the target function for the current parameter values calculating it over all observations in the training set
     
  5. Calculate the next step sizes for each feature :
    stepsize = Q'(p_now) * η 
     
  6. Calculate the new parameters p_new as :
    p_new = p_now - stepsize

gradient descent algorithm

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

MLTSA:

Optimization

p_{\mathrm{new}} = p_{\mathrm{old}} - \sum_j\eta_j \frac{df'(x_{i,j})}{dx_{i,j}}_{i\in N}

Stochastic Gradient Descent

where i are elements of the full N-dimensional observation set (a subset) 

Target Funcion

 

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})
y_i = (a~x_i + b)

Univariate Linear regression

b

a

idea 2. start a bunch of parallel optimizations

(we will see it in next class)

MLTSA:

Optimization

MLTSA:

Bayes Theorem &

Bayesian statistics

2

MLTSA:

combinatorial statistics

P(A|B)P(B) = P(B|A)P(A)

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

probability of A conditioned on B

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)
P(A|B) = P(A)

 A and B are independent

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

P(A|B)

A=red; B=blue

P(A)= ?

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

A=red; B=blue

P(A)= ?
P(A|B) = \frac{P(A\cap B)}{P(B)}
P(A|B)

MLTSA:

combinatorial statistics

P(A|B)

probability of A conditioned on B

new sample space after B happened

P(A|B) = \frac{P(A\cap B)}{P(B)}

A

B

MLTSA:

combinatorial statistics

P(A|B)P(B) = P(B|A)P(A)
P(A|B) = \frac{P(B|A)P(A)}{P(B)}

MLTSA:

Bayes theorem

P(\mathrm{model}|\mathrm{data})P(\mathrm{data}) = P(\mathrm{data}|\mathrm{model})P(\mathrm{model})
P(\mathrm{model}|\mathrm{data}) = \frac{P(\mathrm{data}|\mathrm{model})P(\mathrm{model})}{P(\mathrm{data})}

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

posterior

likelihood

prior

evidence

model parameters

data

\theta

D

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

constraints on the model

e.g. flux from a star is never negative

P(f<0) = 0 P(f>=0) = 1

prior:

model parameters

data

\theta

D

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

constraints on the model

people's weight <1000lb

& people's weight >0lb

P(w) ~ N(105lb, 90lb)

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

constraints on the model

people's weight <1000lb

& people's weight >0lb

P(w) ~ N(105lb, 90lb)

the prior should not be 0 anywhere the probability might exist

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

"uninformative prior"

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

D

prior:

"uninformative prior"

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

likelihood:

this is our model

model parameters

data

\theta

D

P(D|\theta) = ax + b + \epsilon; \epsilon \sim N(\mu, \sigma)

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

evidence

????

model parameters

data

\theta

D

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

evidence

????

model parameters

data

\theta

D

it does not matter if I want to use this for model comparison

MLTSA:

Bayes theorem

P(\theta_1|\mathrm{D}) = \frac{P(\mathrm{D}|\theta_1)P(\theta_1)}{P(\mathrm{D})}

model parameters

data

\theta

D

P(\theta_2|\mathrm{D}) = \frac{P(\mathrm{D}|\theta_2)P(\theta_2)}{P(\mathrm{D})}

it does not matter if I want to use this for model comparison

which has the highest posterior probability?

P(\theta|\mathrm{D}) \propto{P(\mathrm{D}|\theta)P(\theta)}

MLTSA:

Bayes theorem

posterior: joint probability distributin of a parameter set (θ, e.g. (m, b)) condition upon some data D and a model hypothesys f

evidence: marginal likelihood of data under the model

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

prior: “intellectual” knowledge about the model parameters condition on a model hypothesys f. This should come from domain knowledge or knowledge of data that is not the dataset under examination

MLTSA:

Bayes theorem

posterior: joint probability distributin of a parameter set (θ, e.g. (m, b)) condition upon some data D and a model hypothesys f

evidence: marginal likelihood of data under the model

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

prior: “intellectual” knowledge about the model parameters condition on a model hypothesys f. This should come from domain knowledge or knowledge of data that is not the dataset under examination

P(D|f) = \int_{-\infty}^\infty P(D|\theta,f)P(\theta|f)d\theta

in reality all of these quantities are conditioned on the shape of the model: this is a model fitting, not a model selection methodology

MLTSA:

 

State Space Model

4

MLTSA:

space state model

motion equations for the position or "state" of a spacecraft

x(t): location

 y(t): information that can be observed from a tracking device such as velocity and azimuth

MLTSA:

space state model

  • The state process is assumed to be a Markov process; this means that the future                   and past                   , are independent conditional on the present,
  • The second condition is that the observations,      are independent given the states
x_{t>\mathrm{now}}
x_{t<\mathrm{now}}
x_{t}
y_{t}
x_{t}

MLTSA:

space state model

Y_t=x_t\beta+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)
x_{t+1} =x_{t} + \nu_t;~~\nu_t∼N(0,\sigma^2_\nu)

unobserved state

Underlying state x is a time varying Markovian process (the position of the spacecraft)

 

The observed variable depends at least on the state and on noise.

 

Other elements (e.g. seasonality) can be included in the model too

MLTSA:

BSTS

we can write a Bayesian structural model like this:

Y_t=\mu_t+x_t\beta+S_t+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)\\ \mu_t= \mu_{t-1}+\nu_t;~~\nu∼N(0,\sigma^2_\nu)

local level

state

seasonal trend

MLTSA:

BSTS

we can write a Bayesian structural model like this:

Y_t=\mu_t+x_t\beta+S_t\alpha+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)
\mu_{t+1} =\mu_{t} + \nu_t;~~\nu_t∼N(0,\sigma^2_\nu)

seasonal variations

unobserved trend

Its a Markovian Process:

stochastic process with 1-step memory

there is a hidden or latent process xt called the state process (the position of the space craft)

MLTSA:

BSTS

a bit more complex

MLTSA:

BSTS

A model that accepts regression on exogenous variables allows feature extraction: which features are important in the prediction? 

e.g. CO2 or Irradiance in climate change (HW)

MLTSA:

 

Model Optimization

 with MCMC

5

MLTSA:

Monte Carlo Markov Chain

stochastic 

"markovian" 

sequence

 

posterior: joint probability distributin of a parameter set (m, b) conditioned upon some data D and a model hypothesys f

MLTSA:

MCMC

Goal: sample the posterior distribution

slope

intercept

slope

Goal: sample the posterior distribution

MLTSA:

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr < 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

stochasticity allows us to explore the whole surface but spend more time in interesting spots

Goal: sample the posterior distribution

MLTSA:

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                 

Questions:

1. how do I choose the next  point?

Any Markovian ergodic process

 

 

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

MLTSA:

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

A process is Markovian if the next state of the system is determined stochastically as a perturbation of the current state of the system, and only the current state of the system, i.e. the system has no memory of earlier states (a memory-less process).

 

A Markovian Process

Definition

Ergodic Process

(given enough time) the entire parameter space would be sampled.

 
 

At equilibrium, each elementary process should be equilibrated by its reverse process.

reversible Markov process

 

Detailed Balance is a sufficient condition for ergodicity

\pi (x_1)P(x_2 | x_1)=\pi (x_2)P(x_1 | x_2)

Metropolis Rosenbluth Rosenbluth Teller 1953 - Hastings 1970

Definition

it can be shown that

If the chains are a Markovian Ergodic process

the algorithm is guaranteed to explore the entire likelihood surface given infinite time

 
 

This is in contrast to gradient descent, which can get stuck in local minima or in local saddle points. 

how to choose the next point

how you make this decision names the algorithm

simulated annealing (good for multimodal)

parallel tempering (good for multimodal)
differential evolution (good for covariant spaces)

Gibbs sampling (move in along one variable at a time)

MLTSA:

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

The chains: the algorithm creates a "chain" (a random walk) that "explores" the likelihood surface.

More efficient is to run many parallel chains - each exploring the surface, an "ensemble"

The path of the chains can be shown along each feature

MLTSA:

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

step

feature value

Examples of how to choose the next point

affine invariant : EMCEE package

1

MLTSA:

MCMC

 

 choose a starting point in the parameter space: current = θ0 = (m0,b0)

 WHILE convergence criterion is met:

       calculate the current posterior pcurr = P(D|θ0,f)

       //proposal

       choose a new set of parameters new = θnew = (mnew,bnew)

       calculate the proposed posterior pnew = P(D|θnew,f)

       IF pnew/pcurr > 1:

                current = new

       ELSE:

                //probabilistic step: accept with probability pnew/pcurr

               draw a random number r ૯U[0,1]

               IF r > pnew/pcurr >:
                          current = new

               ELSE:

                          pass // do nothing

 

step

feature value

MCMC convergence

MCMC convergence

  1. check autocorrelation within a chain (Raftery)
  2. check that all chains coverged to same region (a stationary distribution GelmanRubin)
  3. mean at beginning = mean at end (Geweke)
  4. check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

  1. check autocorrelation within a chain (Raftery)
  2. check that all chains coverged to same region (a stationary distribution GelmanRubin)
  3. mean at beginning = mean at end (Geweke)
  4. check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

  1. check autocorrelation within a chain (Raftery)
  2. check that all chains coverged to same region (a stationary distribution GelmanRubin)
  3. mean at beginning = mean at end (Geweke)
  4. check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

  1. check autocorrelation within a chain (Raftery)
  2. check that all chains coverged to same region (a stationary distribution GelmanRubin)
  3. mean at beginning = mean at end (Geweke)
  4. check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

VISUALIZATION 

OF THE WEEK

when to do circles? 

 

cyclic visualizations

(annual behavior)

 

when you do not want to impress importance on the hierarchical time order

references

Elements of Statistical Learning Chapter 7

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

additionally: MCMC

emcee: The MCMC Hammer

https://arxiv.org/abs/1202.3665

 

Forecasting at scale

(the facebook Prophet paper)

https://peerj.com/preprints/3190

READING

HW

Structural Time Series modeling wit prophet

 

MLTSA_04 2022

By federica bianco

MLTSA_04 2022

Bayes theorem and MCMC

  • 521