Machine Learning for Time Series Analysis III

State Space Models and Bayesian Analysis

Fall 2025 - UDel PHYS 641
dr. federica bianco

@fedhere

fbianco@udel.edu

3 Bayesian statistics

combinatorial statistics

Bayes' theorem

1 model fit: gradient descent

optimization

gradient descent

stochastic gradient descent

2 model selection: principle of parsimony

Ockham's razor

Information theoory

AIC BIC MDL

4 optimization with MCMC

MCMC

space state models for time series analysis

BSTS fitting

this slide deck:

https://slides.com/federicabianco/mltsa25_04

MLTSA:

Cross Validation

machine learning standard practices

Cross validation

test train validation

train parameters on training set

run only once on the test set to assess the model performance

Cross validation

test + train + validation

train parameters on training set

adjust parameters on validation set

run only once on the test set to assess the model performance

Cross validation

k-fold cross validation

Cross validation

https://scikit-learn.org/stable/modules/cross_validation.html

MLTSA RECAP:

gradient descent

math and vis for your review

1.1

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

MLTSA:

Optimization

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

Add stochasticity to avoid getting stuck in a local minimum

MLTSA:

Optimization

\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

stochastic gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε tollrance

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

https://datascience.stackexchange.com/questions/52884/possible-for-batch-size-of-neural-network-to-be-too-small

Choose a target function Q(p) of the parameters p
Choose a (random) initial value for the parameters: (e.g.
p0 = (a0, b0))
Choose a learning rate η (this could be a multidimensional vector ηi setting a different learning rate for different features)

Repeat steps 4, 5, 6 until "convergence":
Calculate the gradient Q' of the target function for the current parameter values on a subset of the observations (extreme: size=1)
Calculate the next step sizes for each feature :
stepsize = Q'(p_now) * η
Calculate the new parameters p_new as :
p_new = p_now - stepsize

\Delta Q(p_\mathrm{final}) \in [-\epsilon, \epsilon]

stochastic gradient descent algorithm

"convergence" is reached when the gradient is ~0: with ε tollrance

{\displaystyle p_\mathrm{new}:=p_\mathrm{old}-\eta \nabla Q(p)}

Q: ~\mathrm{target ~function}\\ p: ~\mathrm{parameters}\\ \eta : ~\mathrm{learning rate}\\ \epsilon : ~\mathrm{tolerance}

MLTSA:

Optimization

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

MLTSA:

Optimization

p_{\mathrm{new}} = p_{\mathrm{old}} - \sum_j\eta_j \frac{df'(x_{i,j})}{dx_{i,j}}_{i\in N}

Stochastic Gradient Descent

where i 1 elements of the full N-dimensional observation set (a subset)

Target Funcion

\chi^2 = \sum_{i=1}^N(\frac{(y_i - (a~x_i + b))^2}{\sigma_i^2})

y_i = (a~x_i + b)

Univariate Linear regression

idea 2. start a bunch of parallel optimizations

(we will see it in next class)

MLTSA:

Optimization

MLTSA recap:

Correlation

1.2

MLTSA:

formal definition of correlation function

probability and statistics

2

Crush Course in Statistics

freee statistics book: http://onlinestatbook.com/

Introduction to Statistics: An Interactive e-Book

David M. Lane

statistics

takes us from observing a limited number of samples to infer on the population

3

descriptive statistics:

we summarize the proprties of a distribution

descriptive statistics

Basic Probability Frequentist interpretation

<=>

fraction of times something happens

probability of it happening

P(obs|model) = P(model|obs)

Basic Probability Frequentist interpretation

<=>

A DISTRIBUTION

The ratio of occurrency of values:

if I had infinite coin tasses what fraction of times would I get heads vs tails

Coin toss => Bernoulli distribution

p(tail) = 0.5

fraction of times something happens

probability of it happening

P(obs|model) = P(model|obs)

Basic Probability Frequentist interpretation

Basic Probability Bayesian interpretation

represents a level of certainty relating to a potential outcome or idea:

if I believe the coin is unfair (tricked) then even if I get a head and a tail I will still believe I am more likely to get heads than tails

<=>

fraction of times something happens

probability of it happening

P(obs|model) = P(model|obs)

Basic Probability Frequentist interpretation

fraction of times something happens

probability of it happening

<=>

P(obs|model)/ P(model) = P(model|obs) / P(obs)

Coin toss => Bernoulli distribution

p(tail) = 0.3

p(tail) = 0.5

P(obs|model) = P(model|obs)

Basic Probability Bayesian interpretation

TAXONOMY

central tendency: mean, median, mode

spread : variance, interquantile range

Descriptive Statistics deals with the characterization of distributions

descriptive statistics:

we summarize the proprties of a distribution

{\displaystyle \mu _{n}=\int _{-\infty }^{\infty }(x-c)^{n}\,f(x)\,\mathrm {d} x.}

mean: n=1

\mu= \frac{1}{N}\sum_1^N x_i

other measures of centeral tendency:

median: 50% of the distribution is to the left,

50% to the right

mode: most popular value in the distribution

The moments of a distribution

central tendency (n=1)

descriptive statistics:

we summarize the proprties of a distribution

{\displaystyle \mu _{n}=\int _{-\infty }^{\infty }(x-c)^{n}\,f(x)\,\mathrm {d} x.}

variance: n=2

\operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right]

\operatorname {\sigma} (X)=\operatorname {E} \left[(X-\mu )\right]

standard deviation

68%

The moments of a distribution

spread (n=2)

descriptive statistics:

we summarize the proprties of a distribution

{\displaystyle \mu _{n}=\int _{-\infty }^{\infty }(x-c)^{n}\,f(x)\,\mathrm {d} x.}

variance: n=2

\operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right]

\operatorname {\sigma} (X)=\operatorname {E} \left[(X-\mu )\right]

standard deviation

In a Gaussian distribution:

1σ contains 68% of the distribution

68%

descriptive statistics:

we summarize the proprties of a distribution

{\displaystyle \mu _{n}=\int _{-\infty }^{\infty }(x-c)^{n}\,f(x)\,\mathrm {d} x.}

variance: n=2

\operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right]

\operatorname {\sigma} (X)=\operatorname {E} \left[(X-\mu )\right]

standard deviation

In a Gaussian distribution:

2σ contains 95% of the distribution

95%

descriptive statistics:

we summarize the proprties of a distribution

{\displaystyle \mu _{n}=\int _{-\infty }^{\infty }(x-c)^{n}\,f(x)\,\mathrm {d} x.}

variance: n=2

\operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right]

\operatorname {\sigma} (X)=\operatorname {E} \left[(X-\mu )\right]

standard deviation

In a Gaussian distribution:

3σ contains 99.7% of the distribution

99.7%

Memorize the following:

1σ = 68%

2σ = 95%

3σ = 99.7%

5σ = 99.999971428

= 1 in 3.5 million

p-value statistics

When we set a confidence value or interval on inferred quantities we imply that we had 1 in x chances of getting that result (technically "a result as extreme as that" ... we will see this in more detail

MLTSA:

formal definition of correlation function

CF = \frac{lim_{T->\infty}\int_Tf(t)\cdot g(t)dt}{\sigma_f\sigma_g}

continuous

MLTSA:

the meaning of the correlation function

The correlation function defines the degree to which a function / dataset is representative of another function / dataset

high correlation

low correlation

MLTSA:

formal definition of correlation function

continuous case

CF = \frac{lim_{T->\infty}\int_Tf(t)\cdot g(t)dt}{\sigma_f\sigma_g}

CF(j) = \frac{\sum_{i=1}^{N-j}(x_i-\mu)(y_{i}-\mu )}{\sigma^2}

continuous

discrete

MLTSA:

formal definition of correlation function

continuous case

CF(t) = \frac{lim_{T->\infty}\int_Tf(t)\cdot g(t)dt}{\sigma(y)^2}

dot product

What does it mean?

- the integral under a curve is the area under the curve

MLTSA:

formal definition of autocorrelation function

continuous case

dot product

What does it mean?

- the dot product is larger if the points are similar

CF(t) = \frac{lim_{T->\infty}\int_Tf(t)\cdot g(t)dt}{\sigma(y)^2}

https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs/data

IDENTICALLY
CORRELATED

Stock Market data from Kaggle (prob HW3)

CORRELATED

NOT

CORRELATED

PEARSONS CORRELATION COEFFICIENT

(use this discussion and slides for the homework!)

If you have many time series, you can show a "correlation matrix" that indicates the amount of correlation between each pair of time series

MLTSA:

autocorrelation

MLTSA:

the meaning of the autocorrelation function

The autocorrelation function defines the degree to which a process has memory of itself

MLTSA:

formal definition of autocorrelation function

continuous case

ACF(\Delta t) = \frac{lim_{T->\infty}\int_Ty(t)\cdot y(t+\Delta t)dt}{\sigma(y)^2}

\Delta t

is the "lag" operator

dot product

MLTSA:

formal definition of autocorrelation function

discrete case

ACF(j) = \frac{\sum_{i=1}^{N-j}(y_i-E(y))(y_{i+j}-E(y))}{\sum_{i=1}^N(y_i-E(y))^2}

defined for a lag j (of j steps)

ACF(j) = \frac{\sum_{i=1}^{N-j}(y_i-\mu)(y_{i+j}-\mu )}{\sigma^2}

MLTSA:

formal definition of autocorrelation function

discrete case for lag j

Definition

ACF(j) = \frac{\sum_{i=1}^{N-j}(y_i-\mu)(y_{i+j}-\mu )}{\sigma^2}

MLTSA:

formal definition of partialautocorrelation function

discrete case for lag 2

Definition

PACF(x)_2= \dfrac{\text{cov}(x_t, x_{t-2}| x_{t-1})}{\sigma(x_t|x_{t-1})\sigma(x_{t-2}|x_{t-1})}

MLTSA:

autocorrelation self-similarity of the time series at a lag t

ACF(j) = \frac{\sum_{i=1}^{N-j}(y_i-\mu)(y_{i+j}-\mu )}{\sigma^2}

MLTSA:

partialautocorrelation function

like ACF but controls for lag interdependence

y(t) \sim y(t-1) + y(t-2)

MLTSA:

partialautocorrelation function

like ACF but controls for lag interdependence

y(t) \sim y(t-1) + y(t-2)

MLTSA:

Autocorrelation and

Partial-autocorrelation

uncertainty region

(2-sigma)

anything lag falls within this region is not significant

MLTSA:

Autocorrelation and

Partial-autocorrelation

MLTSA:

formal definition of partialautocorrelation function

discrete case for lag 3

Definition

PACF(x)_3= \dfrac{\text{cov}(x_t, x_{t-3}| x_{t-1})}{\sigma(x_t|x_{t-1},x_{t-2})\sigma(x_{t-3}|x_{t-1}x_{t-2})}

MODEL

Model: a model is a mathematical formula that describes a process

it should predict some quantity (endogenous variable) from input observations (data)

it represent the way in which the endogenous variable is generated by the data

y = \sum_{i=0}^N a_n x^n

MODEL

Model: a model is a mathematical formula that describes a process

it should predict some quantity (endogenous variable) from input observations (data)

it represent the way in which the endogenous variable is generated by the data

a simplification of

y = \sum_{i=0}^N a_n x^n

MODEL

Model: a model is a mathematical formula that describes a process

it should predict some quantity (endogenous variable) from input observations (data)

it represent the way in which the endogenous variable is generated by the data

a simplification of

Parameter: the parameters of the model are the element of the formula which I learn from the data

y = \sum_{i=0}^N a_n x^n

MODEL

Model: a model is a mathematical formula that describes a process

it should predict some quantity (endogenous variable) from input observations (data)

it represent the way in which the endogenous variable is generated by the data

a simplification of

Parameter: the parameters of the model are the element of the formula which I learn from the data

y = \sum_{i=0}^N a_n x^n

MODEL

Hyperparameters: what the model designer chooses before optimization

eg: the degree N of a polynomial fit (line fit N=1)

descriptive statistics:

we summarize the properties of a distribution

\mu= \frac{1}{N}\sum_{i=1}^N x_i = E(x)

covariance:

\operatorname {Cov} (X,Y)=\frac{1}{N}\sum_{i=1}^N{(x_i-E(x))(y_i-E(y))}\\ \operatorname {Cov} (X,Y)={Corr}(X,Y) \sigma(x) \sigma(y) \\

\operatorname {\sigma} (X)=\operatorname {E} \left[(X-\mu )\right] = \sqrt(Var(X))

\operatorname {Var} (X)=\operatorname {E} \left[(X-\mu )^{2}\right]

mean:

variance:

standard deviation:

MLTSA:

stochastic and stationary time

processes

MLTSA:

Stochastic process

A random variable indexed by time.

For any subset of points in time the dependent variable follows the a probability distribution

e.g.

p(x_{t1}…x_{tn}) \sim N(\mu, \sigma)

pl.figure(figsize=(20,5))
N = 200
np.random.seed(100)
y = np.random.randn(N)
t = np.linspace(0, N, N, endpoint=False)
pl.plot(t, y, lw=2)
pl.xlabel("time")
pl.ylabel("y");

Discrete time stochastic process

pl.hist(y[20:70])

pl.hist(y[100:150])

MLTSA:

Stochastic process

A random variable indexed by time.

For any subset of points in time the dependent variable follows the a probability distribution

e.g.

p(x_{t1}…x_{tn}) \sim N(\mu, \sigma)

pl.figure(figsize=(20,5))
N = 200
np.random.seed(100)
y = np.random.randn(N) 
t = np.linspace(0, N, N, endpoint=False)
pl.plot(t, y * t, lw=2)
pl.xlabel("time")
pl.ylabel("y");

Discrete time stochastic process

pl.hist(y[20:70])

pl.hist(y[100:150])

MLTSA:

Stochastic process

A random variable indexed by time.

Definition

Note that for the process to be stochastic the variability has to be instrinsic, not just due to noise.

MLTSA:

strictly stationary process

p (t_i....t_{n+i}) \sim p(t_i + \Delta t....t_{n+i}+ \Delta t)

https://github.com/fedhere/MLTSA22_FBianco/blob/main/Lab2Distributions/StationaryTSAnimation.ipynb

A time series is strictly stationary if for any i and Δt

MLTSA:

strictly stationary process

p (t_i....t_{n+i}) \sim p(t_i + \Delta t....t_{n+i}+ \Delta t)

Definition

A time series is strictly stationary if for any i and Δt

\mu(t_i...t_{i+N}) = \mu(t_i+\Delta t...t_{i+N}+\Delta t)

\operatorname{Var}(t_i...t_{i+N}) = \operatorname{Var}(t_i+\Delta t...t_{i+N}+\Delta t)

which implies

MLTSA:

covariance stationary process

A time series is covariance stationary if for any i and Δt

\mu(t_i...t_{i+N}) = \mu(t_i+\Delta t...t_{i+N}+\Delta t)

\operatorname{Var}(t_i...t_{i+N}) = \operatorname{Var}(t_i+\Delta t...t_{i+N}+\Delta t)

\operatorname{Cov}(t_i...t_N, t_{i}+\tau...t_{i+N}+\tau) = \operatorname{Cov}(t_i+\Delta t...t_{i+N}+\Delta t, t_{i}+\tau+\Delta t...t_{i+N}+\tau+\Delta t)

Any two segments of a time series

have same mean, same variance,

same covariance

MLTSA:

covariance stationary process

Definition

A time series is covariance stationary if for any i and Δt

\mu(t_i...t_{i+N}) = \mu(t_i+\Delta t...t_{i+N}+\Delta t)

\operatorname{Var}(t_i...t_{i+N}) = \operatorname{Var}(t_i+\Delta t...t_{i+N}+\Delta t)

\operatorname{Cov}(t_i...t_N, t_{i}+\tau...t_{i+N}+\tau) = \operatorname{Cov}(t_i+\Delta t...t_{i+N}+\Delta t, t_{i}+\tau+\Delta t...t_{i+N}+\tau+\Delta t)

MLTSA:

The ARMA family of models

MLTSA:

the AR in ARMA: AutoRegressive

y(t) = a~y(t-1) + \epsilon

The behavior at time (t) depends linearly on the behavior at time (t-1)

MLTSA:

the AR in ARMA: AutoRegressive

y(t) = a~y(t-1) + \epsilon

The behavior at time (t) depends linearly on the behavior at time (t-1)

\hat{y}(t) = a~y(t-1)

of course at time t I do not know the error of my prediction (ε) so the prediction is

MLTSA:

the AR in ARMA: AutoRegressive

y(t) = \sum_{i=1}^P a_i~y(t-i) + \epsilon_t

The behavior at time (t) depends linearly on the behavior at times (t-1)...(t-P)

coefficients

This is y = ax:

a line with slope=a and intercept = 0

MLTSA:

the AR in ARMA: AutoRegressive

y(t) = \sum_{i=1}^P a_i~y(t-i) + \epsilon_t

The behavior at time (t) depends linearly on the behavior at times (t-1)...(t-P)

coefficients

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

A moving average process MA(q) is a process whose current value y(t) is on average stationary and in time depends linearly on the q past values.

MA models noise around the mean.

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

A moving average process MA(q) is a process whose current value y(t) is on average stationary and in time depends linearly on the q past values.

MA models noise around the mean.

At time t the data value, y(t), consists of a constant, μ, plus a fraction θ (the moving-average coefficient), of the previous random noise, plus the error on this some random noise

\epsilon_t \sim N(0,s)

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

import pylab as pl
import pandas as pd

tss["aa.us"].plot()
tss["aa.us"].rolling(30, center=True).mean().plot(
        label="bins:%d"%300, lw=3)
    
pl.legend();

the average of the process

some random noise cause all processes are stochastic

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

import pylab as pl
import pandas as pd

tss["aa.us"].plot()
tss["aa.us"].rolling(30, center=False).mean().plot(
        label="bins:%d"%300, lw=3)
    
pl.legend();

the average of the process

some random noise cause all processes are stochastic

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

import pylab as pl
import pandas as pd

tss["aa.us"].plot()
tss["aa.us"].rolling(30, center=False).mean().plot(
        label="bins:%d"%300, lw=3)
    
pl.legend();

the average of the process

some random noise cause all processes are stochastic

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

expect to sell on average 30 cupcakes

every day adjust based on how many cupcakes you are off:

- if you have 6 unsold, make 27 (coefficient=0.5),

- if you ran out and 2 more customers came in for a cupcake, make 31

MLTSA:

the MA in ARMA:

Moving Average

y(t)=\mu+\epsilon_t+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+...+\theta_q \epsilon_{t-q}

import pylab as pl
import pandas as pd

tss["aa.us"].plot()
tss["aa.us"].rolling(30, center=False).mean().plot(
        label="bins:%d"%300, lw=3)
    
pl.legend();

coefficients

MLTSA:

the MA in ARMA:

Moving Average

Note that AR(1) = MA( $\infty$ ) model.

Using repeated substitution, we can demonstrate:

AR(1) : y(t) = a_1y(t-1)+\epsilon_t = a_1(a_1 y_{t−2}+\epsilon_{t−1})+\epsilon_t =\\ a_1^2 y_{t−2}+a_1 \epsilon_{t−1}+\epsilon_t = \\ a^3_1 y_{t−3}+a^2_1\epsilon_{t−2}+a_1 \epsilon_{t−1}+\epsilon_t... = MA(\infty)

MLTSA:

are ARMA models good?

It turns out there is a theorem that ensures that they are! BUT:

- this theorem means that if you put infinite terms the model is exact... but we put a finite number of terms in of course

- this only holds for stationary processes

MLTSA:

the I in ARIMA:

Integrative

y(t)=\mu+\epsilon_t+\sum_{i=1}^P a_i~y(t-i) + \sum_{i=1}^Q\theta_i\epsilon_{t-i}

I: integrative removes trends.

{\displaystyle y^{'}(t)=y(t) - y(t-1)+\varepsilon _{t}}

Turns out this model would not converge if the time series is not stationary.

MLTSA:

ARIMA

y(t)=\mu+\epsilon_t+\sum_{i=1}^P a_i~y(t-i) + \sum_{i=1}^Q\theta_i\epsilon_{t-i} + y^{'}(t)

MLTSA:

the I in ARIMA:

Autoregressive Integrative Moving Average model

Autoregression (AR) a model that shows a changing variable that regresses on its own lagged values.
Integrated (I) represents the differencing of raw observations to allow for the time series to become stationary. Values are replaced by the difference between the lagged data values.
Moving average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

y(t)=\mu+\epsilon_t+\sum_{p=1}^P a_i~y(t-p) + \sum_{q=1}^Q\theta_q\epsilon_{t-q} + \sum_d^D{dX_{t-1}-X_{t-d}}

MLTSA:

the I in ARIMA:

Autoregressive Integrative Moving Average model

p: lag order: number of lag observations in the AM model; AM(p).

q: order of the moving average: the size of the moving average window; MA(q)

d: degree of differencing: number of times that the raw observations are differenced; I(d)

Choose the What are the parameters of the model? fit for the coefficients

y(t)=\mu+\epsilon_t+\sum_{p=1}^P a_i~y(t-p) + \sum_{q=1}^Q\theta_q\epsilon_{t-q} + \sum_d^D{dX_{t-1}-X_{t-d}}

p ~ ?... q ~ ?

Consider the hints from the ACF and PACF as a starting point for your model design

Parameters vs hyperpareameters

Parameters: what is being optimized in model fitting -

eg: y = ax+b slope and intercept

ARMA coefficients

Hyperparameters: what the model designer chooses before optimization

eg: the degree N of a polynomial fit (line fit N=1)

y = \sum_{i=0}^N a_n x^n

Parameters vs hyperpareameters

Parameters: what is being optimized in model fitting -

eg: y = ax+b slope and intercept

ARMA coefficients

Hyperparameters: what the model designer chooses before optimization

eg: the degree N of a polynomial fit (line fit N=1)

y = \sum_{i=0}^N a_n x^n

y(t)=\mu+\epsilon_t+\sum_{i=1}^P a_i~y(t-i) + \sum_{i=1}^Q\theta_i\epsilon_{t-i} + y^{'}(t)

MLTSA:

special cases of ARIMA models

An ARIMA(0, 0, 0) model is a white noise model.

ARIMA(0, 1, 0) (or I(1) model) is a random walk.

ARIMA(0, 1, 0) with a constant is a random walk with drift.

ARIMA(0, 1, 2) is a Damped Holt's model.

ARIMA(0, 1, 1) is a basic exponential smoothing model.

An ARIMA(0, 2, 2) is equivalent to Holt's linear method with additive errors, or double exponential smoothing.

{\displaystyle X_{t}=X_{t-1}+\varepsilon _{t}}

{\displaystyle X_{t}=c+X_{t-1}+\varepsilon _{t}}

{\displaystyle X_{t}=\varepsilon _{t}}

{\displaystyle X_{t}=X_{t-1}+\theta_1\epsilon_{t-1}+\theta_2 \epsilon_{t−2}+\varepsilon _{t}}

{\displaystyle X_{t}=2X_{t-1}-X_{t-2}+(\alpha +\beta -2)\varepsilon _{t-1}+(1-\alpha )\varepsilon _{t-2}+\varepsilon _{t}}

{\displaystyle X_{t}=X_{t-1}+\theta_1\epsilon_{t-1}+\varepsilon _{t}}

MLTSA:

ARMA workflow

MLTSA:

Model Selection

Assess if the time series is stationary

AD Fuller test:

tests the presence of a unit root

no unit root = stationary

NH: there is a unit root

AH: there is no unit root

sm.tsa.stattools.adfuller(tss["aa.us"])

is stationary?

yes

(+)ARMA

(+)ARIMA

above threshold means cannot reject NH, i.e. there could be a unit root, i.e. it is not stationary

MLTSA:

Model Selection

Assess if the time series is stationary

AD Fuller test:

tests the presence of a unit root

no unit root = stationary

NH: there is a unit root

AH: there is no unit root

sm.tsa.stattools.adfuller(tss["aa.us"])

is stationary?

yes

(+)ARMA

(+)ARIMA

above threshold means cannot reject NH, i.e. there could be a unit root, i.e. it is not stationary

MLTSA:

Model Selection

Assess if the time series is stationary

AD Fuller test:

tests the presence of a unit root

no unit root = stationary

NH: there is a unit root

AH: there is no unit root

sm.tsa.stattools.adfuller(tss["aa.us"])

is stationary?

yes

(+)ARMA

(+)ARIMA

above threshold means cannot reject NH, i.e. there could be a unit root, i.e. it is not stationary

The second returned value is the p-value of the test.

Low p-values mean NH is unlikely

SET A THRESHOLD BEFORE PERFORMING THE TEST

MLTSA:

Model Selection

Guess the "parameter" p in AR(p)- really they are hyperparameters

here 2 is a good guess

here maybe 1 maybe 6?

MLTSA:

2 is a good guess

maybe 1 maybe 6?

MLTSA:

Model Selection

Guess the "parameter" p in AR(p)- really they are hyperparameters

MLTSA:

Fit models with different parameters

aics = np.zeros((5,5))
for p in range(5):
    for q in range(5):
            try:
                mod = sm.tsa.ARIMA(df["y"], (p,1,q)).fit()  
                aics[p][q] = mod.aic
            except:
                aics[p][q] = np.nan
        
p,q = np.where(aic == np.nanmin(aic))          
print("best parameters: p: {:d} q: {:d}".format(p[0],q[0]))

Fit model for parameters (making sure you include up to your best guess for p at least) and calculate the AIC: Aikiki Inference Criterion.

Choose the model that minimizes AIC

MLTSA:

inference and forecast at last!

Use the model to predict or inferr

FYI: Other ARMA models

SAR(I)MA : seasonal AR(I)MA version

CAR(I)MA: works on unevenly sampled time series

VAR(I)MA: multivariate AR(I)MA

MLTSA:

Model Selection

what model should I choose?

No matter what anyone tells you an answer to this question cannot be given in the abstract case: it is a domain specific question!

except:

the principle of parsimony

MLTSA:

principle of parsimony

or Ockham's razor

Pluralitas non est ponenda sine neccesitate

William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)

“Complexity needs not to be postulated without a need for it”

principle of parsimony

Peter Apian, Cosmographia, Antwerp, 1524 from Edward Grant,

"Celestial Orbs in the Latin Middle Ages", Isis, Vol. 78, No. 2. (Jun., 1987).

Geocentric models are intuitive:

from our perspective we see the Sun moving, while we stay still

the earth is round,

and it orbits around the sun

principle of parsimony

As observations improve

this model can no longer fit the data!

not easily anyways...

the earth is round,

and it orbits around the sun

Encyclopaedia Brittanica 1st Edition

Dr Long's copy of Cassini, 1777

principle of parsimony

A new model that is much simpler fit the data just as well

(perhaps though only until better data comes...)

the earth is round,

~~and it orbits around the sun~~

Heliocentric model from Nicolaus Copernicus' De revolutionibus orbium coelestium.

Heliocentric model from Nicolaus Copernicus'

"De revolutionibus orbium coelestium".

Author Dr Long's copy of Cassini, 1777

Peter Apian, Cosmographia, Antwerp, 1524

Okham's razor

Heliocentric model from Nicolaus Copernicus'

"De revolutionibus orbium coelestium".

Author Dr Long's copy of Cassini, 1777

Okham's razor

Two theories may explain a phenomenon just as well as each other. In that case you should prefer the simpler one

principle of parsimony

or Ockham's razor

Pluralitas non est ponenda sine neccesitate

William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)

“Complexity needs not to be postulated without a need for it”

“Between 2 theories that perform similarly choose the simpler one”

the principle of parsimony

or Ockham's razor

Between 2 theories that perform similarly choose the simpler one

In the context of model selection simpler means "with fewer parameters"

Key Concept

principle of parsimony

Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Ockham he should seek an economical description of natural phenomena

Since all models are wrong the scientist must be alert to what is importantly wrong.

Science and Statistics George E. P. Box (1976)

Journal of the American Statistical Association, Vol. 71, No. 356, pp. 791-799.

what model should I choose?

No matter what anyone says an answer to this question cannot be given in the abstract case: it is a domain specific question!

except:

the principle of parsimony

MLTSA:

the case of ARIMA models

Okham's razor

Ockham’s razor: Pluralitas non est ponenda sine neccesitate

or “the law of parsimony”

William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)

”Complexity needs not to be postulated without a need for it”

“Between 2 theories choose the simpler one”

“Between 2 theories choose the one with fewer parameters"

Okham's razor

data

model fit to data

Okham's razor

model fit to data

y = ax^2 + bx + c

y = ax + b

Okham's razor

model fit to data

1 variable: x

y = ax^2 + bx + c

y = ax + b

Okham's razor

model fit to data

parameters

the complexity of a model can be measured by the number of variables and the numbers of parameters

y = ax^2 + bx + c

y = ax + b

Okham's razor

the complexity of a model can be measured by the number of variables and the numbers of parameters

mathematically: given N data points there exist an N-features model that goes exactly through each data point. but is it useul??

Overfitting: fitting data with a model that is too complex and that does not extend to new data (low predictive power on test data)

https://colab.research.google.com/gist/fedhere/6a320a37476cbcfcbcb4de8aec0cdb1d/overfit_animation.ipynb#scrollTo=lSdDyWFAZrbx

Okham's razor

Model Diagnostics


model = sm.tsa.ARIMA(endog = train_set, order=(3, 
                                  iorder, 3)).fit()

In practice: for ARMA selection use AIC

Akaike information criterion (AIC) .

L is the likelihood of the data, p is the order of the autoregressive part and q is the order of the moving average part. The k represents the intercept of the ARIMA model.

AIC can only be used to compare ARIMA models with the same orders of differencing. For ARIMAs with different orders of differencing, RMSE can be used for model comparison.

{\displaystyle {\text{AIC}}=-2\log(L)+2(p+q+k)} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

Choose model (hyper)-parameters that minimize AIC

In practice: for ARMA selection use AIC

Akaike information criterion (AIC) .

The preferred ARIMA models among a family of models with the same orders of differencing is the one that minimized the Aikiki Imformation Criterion (AIC).

Key Concept

MLTSA:

Bayes Theorem &

Bayesian statistics

MLTSA:

combinatorial statistics

P(A|B)P(B) = P(B|A)P(A)

MLTSA:

combinatorial statistics

P(A|B)P(B) = P(B|A)P(A)

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

MLTSA:

Bayes theorem

P(\mathrm{model}|\mathrm{data})P(\mathrm{data}) = P(\mathrm{data}|\mathrm{model})P(\mathrm{model})

P(\mathrm{model}|\mathrm{data}) = \frac{P(\mathrm{data}|\mathrm{model})P(\mathrm{model})}{P(\mathrm{data})}

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

posterior

likelihood

prior

evidence

model parameters

data

\theta

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

constraints on the model

e.g. flux is never negative

P(f<0) = 0 P(f>=0) = 1

prior:

model parameters

data

\theta

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

constraints on the model

e.g. flux is never negative

P(f<0) = 0 P(f>=0) = 1

prior:

model parameters

data

\theta

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

prior:

constraints on the model

people's weight <1000lb

& people's weight >0lb

P(w) ~ N(105lb, 90lb)

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

prior:

constraints on the model

people's weight <1000lb

& people's weight >0lb

P(w) ~ N(105lb, 90lb)

the prior should not be 0 anywhere the probability might exist

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

model parameters

data

\theta

prior:

"uninformative prior"

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

likelihood:

this is our model

model parameters

data

\theta

P(D|\theta) = ax + b + \epsilon; \epsilon \sim N(\mu, \sigma)

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

evidence

????

model parameters

data

\theta

MLTSA:

Bayes theorem

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

evidence

????

model parameters

data

\theta

it does not matter if I want to use this for model comparison

MLTSA:

Bayes theorem

P(\theta_1|\mathrm{D}) = \frac{P(\mathrm{D}|\theta_1)P(\theta_1)}{P(\mathrm{D})}

model parameters

data

\theta

P(\theta_2|\mathrm{D}) = \frac{P(\mathrm{D}|\theta_2)P(\theta_2)}{P(\mathrm{D})}

it does not matter if I want to use this for model comparison

which has the highest posterior probability?

P(\theta|\mathrm{D}) \propto{P(\mathrm{D}|\theta)P(\theta)}

MLTSA:

Bayes theorem

posterior: joint probability distributin of a parameter set (θ, e.g. (m, b)) condition upon some data D and a model hypothesys f

evidence: marginal likelihood of data under the model

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

prior: “intellectual” knowledge about the model parameters condition on a model hypothesys f. This should come from domain knowledge or knowledge of data that is not the dataset under examination

MLTSA:

Bayes theorem

posterior: joint probability distributin of a parameter set (θ, e.g. (m, b)) condition upon some data D and a model hypothesys f

evidence: marginal likelihood of data under the model

P(\theta|\mathrm{D}) = \frac{P(\mathrm{D}|\theta)P(\theta)}{P(\mathrm{D})}

P(D|f) = \int_{-\infty}^\infty P(D|\theta,f)P(\theta|f)d\theta

in reality all of these quantities are conditioned on the shape of the model: this is a model fitting, not a model selection methodology

MLTSA:

model selection methodology

AIC BIC MLD

MLTSA:

model selection

Shannon 1948: A Mathematical Theory of Communication

a theory to find fundamental limits on signal processing and communication operations such as data compression

model selection is also based on the minimization of a quantity. Several quantities are suitable:

MLD

BIC

Bayese theorem

AIC

Optimism and likelihood maximization on the training set

MLTSA:

AIC, BIC, & MDL

Akaike information criterion (AIC) .

Based on

where is a family of function (=densities) containing the correct (=true) function and is the set of parameters that maximized the likelihood L

L is the likelihood of the data, k is the number of parameters,

N the number of variables.

{\displaystyle {\text{AIC}}=-\frac{2}{N}\log(L)+\frac{2}{N}k} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

number of parameters:

Model Complexity

Likelihood: Model Performance.

\lim_{N\to\infty} (-2 E(\log Pr_{\hat{\theta}}(Y)) ) = -\frac{2}{N} E ~\log(L) + d\frac{2}{N}

Pr_{\hat{\theta}}(Y)

\hat{\theta}

MLTSA:

AIC, BIC, & MDL

Akaike information criterion (AIC) .

Based on

where is a family of function (=densities) containing the correct (=true) function and is the set of parameters that maximized the likelihood L

L is the likelihood of the data, k is the number of parameters,

N the number of variables.

{\displaystyle {\text{AIC}}=-\frac{2}{N}\log(L)+\frac{2}{N}k} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

number of parameters:

Model Complexity

Likelihood: Model Performance.

\lim_{N\to\infty} (-2 E(\log Pr_{\hat{\theta}}(Y)) ) = -\frac{2}{N} E ~\log(L) + d\frac{2}{N}

Pr_{\hat{\theta}}(Y)

\hat{\theta}

"-" sign in front of the log-likelihood: AIC shrinks for better models,

AIC ~ k => is linearly proportional to (grows with) the number of parameters

MLTSA:

AIC, BIC, & MDL

{\displaystyle {\text{BIC}}=-2\log(L)+\log(N)k} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

Likelihood: Model Performance.

number of parameters:

Model Complexity

Bayesian information criterion (BIC) .

L is the likelihood of the data, k is the number of parameters,

N the number of variables.

stronger penalization of complexity (as long as N> )

e^2

The derivation is very different:

\frac{P(M_m | D)}{P(M_l | D)} = \frac{P(M_m)}{P(M_l)}\cdot\frac{P(D|M_m)}{P(D|M_l)}

Bayes Factor

MLTSA:

AIC, BIC, & MDL

{\displaystyle {\text{MDL}}= -\log(L(\theta)) – \log(L(y | X, \theta))} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

Minimum Description Length (MDL) .

negative log-likelihood of the model parameters (θ) and the negative log-likelihood of the target values (y) given the input values (X) and the model parameters (θ).

also: log(L(θ)): number of bits required to represent the model,

log(L(y| X,θ)): number of bits required to represent the predictions on observations

minimize the encoding of the model and its predictions

derived from Shannon's theorem of information

MLTSA:

AIC, BIC, & MDL

{\displaystyle {\text{MDL}}= -\log(L(\theta)) – \log(L(y | X, \theta))} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

{\displaystyle {\text{BIC}}=-2\log(L)+\log(N)k} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

{\displaystyle {\text{AIC}}=-\frac{2}{N}\log(L)+\frac{2}{N}k} %=\\ {\displaystyle AIC+(2(p+q+k)(p+q+k+1))/(T-p-q-k-1).}

Mathematically similar, though derived from different approaches. All used the same way: the preferred model is the model that minimized the estimator

MLTSA:

Model Optimization

with MCMC

MLTSA:

Monte Carlo Markov Chain

stochastic

"markovian"

sequence

posterior: joint probability distributin of a parameter set (m, b) conditioned upon some data D and a model hypothesys f

MLTSA:

MCMC

Goal: sample the posterior distribution

slope

intercept

slope

Goal: sample the posterior distribution

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr < 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

stochasticity allows us to explore the whole surface but spend more time in interesting spots

Goal: sample the posterior distribution

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

Questions:

1. how do I choose the next point?

Any Markovian ergodic process

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

A process is Markovian if the next state of the system is determined stochastically as a perturbation of the current state of the system, and only the current state of the system, i.e. the system has no memory of earlier states (a memory-less process).

A Markovian Process

Definition

Ergodic Process

(given enough time) the entire parameter space would be sampled.

At equilibrium, each elementary process should be equilibrated by its reverse process.

reversible Markov process

Detailed Balance is a sufficient condition for ergodicity

\pi (x_1)P(x_2 | x_1)=\pi (x_2)P(x_1 | x_2)

Metropolis Rosenbluth Rosenbluth Teller 1953 - Hastings 1970

Definition

it can be shown that

If the chains are a Markovian Ergodic process

the algorithm is guaranteed to explore the entire likelihood surface given infinite time

This is in contrast to gradient descent, which can get stuck in local minima or in local saddle points.

how to choose the next point

how you make this decision names the algorithm

simulated annealing (good for multimodal)

parallel tempering (good for multimodal)
differential evolution (good for covariant spaces)

Gibbs sampling (move in along one variable at a time)

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

The chains: the algorithm creates a "chain" (a random walk) that "explores" the likelihood surface.

More efficient is to run many parallel chains - each exploring the surface, an "ensemble"

The path of the chains can be shown along each feature

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

step

feature value

Examples of how to choose the next point

affine invariant : EMCEE package

MLTSA:

MCMC

choose a starting point in the parameter space: current = θ0 = (m0,b0)

WHILE convergence criterion is met:

calculate the current posterior pcurr = P(D|θ0,f)

//proposal

choose a new set of parameters new = θnew = (mnew,bnew)

calculate the proposed posterior pnew = P(D|θnew,f)

IF pnew/pcurr > 1:

current = new

ELSE:

//probabilistic step: accept with probability pnew/pcurr

draw a random number r ૯U[0,1]

IF r > pnew/pcurr >:
current = new

ELSE:

pass // do nothing

step

feature value

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MCMC convergence

check autocorrelation within a chain (Raftery)
check that all chains coverged to same region (a stationary distribution GelmanRubin)
mean at beginning = mean at end (Geweke)
check that entire chain reached stationary distribution (or a final fraction of the chain, Heidelberg-Welch using Cramer-von-Mises statistic)

MLTSA:

State Space Model

4.1

MLTSA:

space state model

motion equations for the position or "state" of a spacecraft

x(t): location

y(t): information that can be observed from a tracking device such as velocity and azimuth

https://www.stat.pitt.edu/stoffer/tsa4/Chapter6.pdf

MLTSA:

space state model

Y_t=x_t\beta+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)

x_{t+1} =x_{t} + \nu_t;~~\nu_t∼N(0,\sigma^2_\nu)

unobserved state

Underlying state x is a time varying Markovian process (the position of the pace craft)

The observed variable depends at least on the state and on noise.

Other elements (e.g. seasonality) can be included in the model too

MLTSA:

BSTS

we can write a Bayesian structural model like this:

Y_t=\mu_t+x_t\beta+S_t+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)\\ \mu_t= \mu_{t-1}+\nu_t;~~\nu∼N(0,\sigma^2_\nu)

local level

state

seasonal trend

MLTSA:

BSTS

we can write a Bayesian structural model like this:

Y_t=\mu_t+x_t\beta+S_t+\epsilon_t;~~\epsilon_t∼N(0,\sigma^2_\epsilon)

\mu_{t+1} =\mu_{t} + \nu_t;~~\nu_t∼N(0,\sigma^2_\nu)

seasonal variations

unobserved trend

Its a Markovian Process:

stochastic process with 1-step memory

there is a hidden or latent process xt called the state process (the position of the space craft)

MLTSA:

BSTS

a bit more complex

MLTSA:

BSTS

https://github.com/wwrechard/pydlm

A model that accepts regression on exogenous variables allows feature extraction: which features are important in the prediction?

e.g. CO2 or Irradiance in climate change (HW)

https://kw-engineering.com/energy-savings-calendar-heat-map/

The essence of the calendar heat map is viewing data over at a glance. Imagine a calendar, tipped on its side so the days of the week run vertically:

https://kw-engineering.com/energy-savings-calendar-heat-map/

sns.heatmap(minutely, robust=True, cmap='YlGnBu', 
            yticklabels=False, xticklabels=5, cbar=False)

Plotting a time series heat map with Pandas

https://jonisalonen.com/2019/plotting-a-time-series-heat-map-with-pandas/

Reading

Sorry ARIMA, but I’m Going Bayesian

https://multithreaded.stitchfix.com/blog/2016/04/21/forget-arima/

HW3

Sorry ARIMA, but I’m Going Bayesian

https://multithreaded.stitchfix.com/blog/2016/04/21/forget-arima/

references

Elements of Statistical Learning Chapter 7

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

Data Analysis: A Bayesian Tutorial

https://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia-ebook/dp/B01BHLXKEI

READING

Implementing Facebook Prophet efficiently

https://towardsdatascience.com/implementing-facebook-prophet-efficiently-c241305405a3

HW

Structural Time Series modeling wit prophet

MLTSA_04 2025

By federica bianco

MLTSA_04 2025

Autoregressive models, model selection, Bayes theorem

federica bianco PRO

astro | data science | data for good