Linear Autoregressive Models

ML in Feedback Sys #7

Fall 2025, Prof Sarah Dean

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Linear auto-regression

"What we do"

Given: data $\{y_k\}_{k=1}^{n+1}$ with $y_k\in\mathbb R^{d_y}$ and window length $L$
The linear auto-regressive least squares
- Split data into predictors $$\bar y_{k:k-L+1} = \begin{bmatrix} y_k^\top & ... & y_{k-L+1}\end{bmatrix}^\top \in\mathbb R^{Ld_y}$$ and targets $y_{k+1}$ for $k=L,...,n$
- Solve linear least squares optimization $$\min_{\Theta\in \mathbb R^{Ld_y\times d_y}} \sum_{k=L}^n\| \Theta^\top \bar y_{k:k-L+1} - y_{k+1}\|_2^2$$
Make predictions with $$\hat y_{t+1} = \hat\Theta^\top \hat{\bar y}_{t:t-L+1} = \sum_{\ell=0}^{L-1} \hat \Theta_{\ell+1}^\top \hat y_{t-\ell} $$

Linear auto-regression

"Why we do it"

Fact 1: Auto-regressive models with length $L$ are equivalent to partially observed linear dynamical systems (PO-LDS) which satisfy:
- The PO-LDS is observable
- Sufficient state dimension $d_s$
Fact 2: from an output perspective, PO-LDS do not have a unique state space representation, but they do define a unique subspace which can be learned from data

Partially observed dynamics

A partially observed dynamical system is defined by a difference equation and an observation (or measurement) equation $$s_{t+1} = F(s_t),\quad y_t = H(s_t)$$
Linear dynamics (PO-LDS) are defined by a dynamics matrix and an observation matrix $$s_{t+1} = Fs_t,\quad y_t = Hs_t$$
Fact 1a: given a linear autoregressive model, we can always construct a PO-LDS with $d_s= Ld_y$ which has identical outputs

Partially observed dynamics

Linear dynamics (PO-LDS) are defined by a dynamics matrix and an observation matrix $$s_{t+1} = Fs_t,\quad y_t = Hs_t, \quad s_0$$
Fact 1a: given a linear autoregressive model, we can always construct a PO-LDS with $d_s= Ld_y$ which has identical outputs
Example: Fibonacci sequence: $$1,1, 2, 3, 5, 8, 13, 21, ...$$
- Auto-regressive model:
  - $y_{t+1} = y_t + y_{t-1}$
- State space model:
  - $s_{t+1} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} s_t,~~ y_t = \begin{bmatrix} 1 &0\end{bmatrix}s_t,~~s_0=\begin{bmatrix} 1 \\ 0\end{bmatrix}$

Partially observed dynamics

Linear dynamics (PO-LDS) are defined by a dynamics matrix and an observation matrix $$s_{t+1} = Fs_t,\quad y_t = Hs_t,\quad s_0$$
Fact 1a: given a linear autoregressive model $ y_{t+1} = \Theta^\top {\bar y}_{t:t-L+1}$ we can always construct a PO-LDS with $d_s= Ld_y$ which has identical outputs
Proof by construction
- let $s_t = \begin{bmatrix}y_t^\top & ... & y_{t-L+1}^\top \end{bmatrix}^\top \in\mathbb R^{Ld_y}$
- $$ \text{let}~~ F = \begin{bmatrix}&& \Theta^\top \\ I & 0\\ & I & 0\\ && \ddots&& \\ &&& I &0\\\end{bmatrix},\quad H =\begin{bmatrix} I & 0 & ... \end{bmatrix}$$
- $y_k$ for $k\leq 0$ are free variables

Limits of partial observation

Example: The state $s=[\theta,\omega]$, output $y=\theta$, and

$$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

What can we predict about $y_t$ and $s_t$ as $t\to\infty$?
Suppose $y_0 = 1$. Can we predict $y_1$ or estimate $s_0$?
What if we also know that $y_1 = 1$?

Example: The state $s=[\theta,\omega]$, output $y=\theta$, and

$$\theta_{t+1} = 0.9\theta_t ,\quad \omega_{t+1} = 0.9 \omega_t$$

Definition: A PO system is observable if outputs $y_{0:t}$ uniquely determine the state trajectory $s_{0:t}$ for some finite $t$.

i.e., the map $\Phi_{0\to t}:\mathcal S\to \mathcal Y^{t+1}$ is injective

Observability

$s_{t+1} = F(s_t)$

$y_t = H(s_t)$

$(y_0,...,y_t) = \Phi_{0\to t}(s_0)$

Theorem: A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix}}_{\mathcal O}\Big) = d_s$$

$y_0 = H(s_0)$
$y_1 = H(F(s_0))$
$y_2 = H(F(F(s_0)))$

Observability

$s_{t+1} = F s_t$

$y_t = H s_t$

Theorem: A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix}}_{\mathcal O}\Big) = d_s$$

$\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=$

$\begin{bmatrix} Hs_0\\HFs_0\\\vdots\\HF^ts_0\end{bmatrix}$

Proof: The system response $\Phi_{0\to t}$ is defined in terms of $F\in\mathbb R^{d_s\times d_s}$ and $H\in\mathbb R^{d_s\times d_y}$

1) $\mathrm{rank}(\mathcal O) = d_s \implies$ observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{d_s-1}\end{bmatrix} = \begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix} s_0 $$

2) $\mathrm{rank}(\mathcal O) = d_s \impliedby$ observable

There is a unique solution $s_0$ when $\mathcal O$ is rank $d_s$
- $s_0 = (\mathcal O^\top \mathcal O)^{-1} \mathcal O^\top y_{0:d_s-1}$
Thus, $s_{0:t}$ is uniquely determined (for $t\geq d_s$)

Proof:

2) $\mathrm{rank}(\mathcal O) < d_s \implies$ not observable

Claim: $\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)$ for all $t$.
Thus $\mathrm{rank}(\mathcal O_t) <d_s$ so $s_0$ is not uniquely determined
Need to justify claim for $t\geq d_s$, when $\mathcal O_t$ has more rows than $\mathcal O$
Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree $d_s$).
Therefore, $F^k$ for $k\geq d_s$ is a linear combo of $I, F,\dots ,F^{d_s-1}$
Thus, $\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)$ as claimed.

$\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\underbrace{\begin{bmatrix} H\\HF\\\vdots\\HF^t\end{bmatrix}}_{\mathcal O_t} s_0$

Proof:

1) $\mathrm{rank}(\mathcal O) = d_s \implies$ observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{d_s-1}\end{bmatrix} = \begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix} s_0 $$

There is a unique solution $s_0$ when $\mathcal O$ is rank $d_s$
- $s_0 = (\mathcal O^\top \mathcal O)^{-1} \mathcal O^\top y_{0:d_s-1}$
Thus, $s_{0:t}$ is uniquely determined (for $t\geq d_s$)

Fact 1: Auto-regressive models with length $L$ are equivalent to partially observed linear dynamical systems (PO-LDS) which are observable and have sufficient dimension
- We proved 1a: AR ($L$) $\implies$ PO-LDS ($d_s=Ld_y$)
Now, we prove 1b: PO-LDS $\implies$ AR
- High level idea: $d_s$ outputs are sufficient to back out the state, and the state is sufficient to predict future
- Give $F$ and $H$ construct $$\begin{bmatrix} y_t\\y_{t-1}\\\vdots\\y_{t-d_s-1}\end{bmatrix}= \begin{bmatrix}HF^{d_s-1} \\ \vdots \\ HF\\H \end{bmatrix}s_{t-d_s-1} =\tilde{\mathcal O} s_{t-d_s-1}$$
- Then we have for $L=d_s$ $$y_{t+1} = Hs_{t+1} = HF^{d_s} s_{t-d_s-1} = HF^{d_s} (\tilde{\mathcal O}^\top \tilde{\mathcal O})^{-1} \tilde{\mathcal O}^\top \bar y_{t:t-L+1}$$

Equivalence

Non-unique state space

There is inherent ambiguity in state space representations
for partially observed dynamics

Example: Consider the sequence: $$1,1, 2, 3, 5, 8, 13, 21, ...$$

$s_{t+1} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} s_t$, $s_0 = \begin{bmatrix} 1 \\ 0\end{bmatrix}$
$y_t = \begin{bmatrix} 1 & 0\end{bmatrix}$

$s_{t+1} = \begin{bmatrix} 1 & 4 \\ \frac{1}{4} & 0 \end{bmatrix} s_t$, $s_0 = \begin{bmatrix} 2 \\ 0\end{bmatrix}$
$y_t = \begin{bmatrix} \frac{1}{2} & 0\end{bmatrix}$

Non-unique state space

There is inherent ambiguity in state space representations
for partially observed dynamics

$y_0 =\hat H\hat s_0$
$\hat s_1 = \hat F\hat s_0$
...
$\hat s_t = \hat F\hat s_{t-1}$
$y_t = \hat H\hat s_{t} $

Suppose $\hat F, \hat H, \hat s_0$ satisfies the equations

$y_0 =\hat H\textcolor{cyan}{MM^{-1}}\hat s_0$
$\textcolor{cyan}{M^{-1}}\hat s_1 = \textcolor{cyan}{M^{-1}}\hat F\textcolor{cyan}{MM^{-1}}\hat s_0$
...
$\textcolor{cyan}{M^{-1}}\hat s_t = \textcolor{cyan}{M^{-1}}\hat F\textcolor{cyan}{MM^{-1}}\hat s_{t-1}$
$y_t = \hat H\textcolor{cyan}{MM^{-1}}\hat s_{t} $

Then so does $\tilde F=M^{-1}\hat F M, \tilde H=\hat HM, \tilde s_0=M^{-1}\hat s_0$

Subspace identfication

$$\mathcal O_L = \begin{bmatrix}H\\HF\\\vdots \\ HF^{L-1}\end{bmatrix} $$

Recall the observability matrix

$$=\begin{bmatrix}HM\\HMM^{-1} FM\\\vdots \\ HMM^{-1} F^{L-1}M\end{bmatrix} $$

$$=\mathcal O_L M $$

Approach: subspace identification of the rank $d_s$ column space of $\mathcal O_L$ ($L\geq d_s$)

Note: column space is invariant under similarity transforms

\tilde {\mathcal O}_L = \begin{bmatrix}\tilde H\\\tilde H\tilde F\\\vdots \\ \tilde H\tilde F^{L-1}\end{bmatrix}

What we do:

$$\begin{bmatrix}y_0\\\vdots \\ y_{L-1}\end{bmatrix} =\mathcal O_L s_0 $$

$$\begin{bmatrix}y_1\\\vdots \\ y_{L}\end{bmatrix} =\mathcal O_L s_{1} $$

$\dots$

$$\begin{bmatrix}y_m\\\vdots \\ y_{L+m-1}\end{bmatrix} =\mathcal O_L s_{k} $$

Hankel matrix

$$Y_{L,m} =\mathcal O_L \begin{bmatrix}s_0 & s_1 &\dots & s_m\end{bmatrix}$$

The column space of the Hankel matrix is the same as $\mathcal O_L$ (as long as $s_{0:m}$ form a rank $d_s$ matrix)
also the same as the span of LS covariates $\bar y_{k:k-L+1}$, $k=L, L+1,...$

Construct the Hankel matrix and consider its column space $$\begin{bmatrix}y_0 & \dots &y_m\\\vdots & \ddots&\vdots\\ y_{L-1}&\dots& y_{L+m-1}\end{bmatrix} =Y_{L,m} $$

Why we do it:

Step 1: Estimate $\mathcal O_t$

Step 2: Recover $\hat F$ and $\hat H$ from $\hat{\mathcal O}$

$$\hat{\mathcal O}_t = \begin{bmatrix}\hat{\mathcal O}_t [0]\\\hat{\mathcal O}_t [1]\\\vdots \\ \hat{\mathcal O}_t [t]\end{bmatrix}\approx \begin{bmatrix}H\\HF\\\vdots \\ HF^{t}\end{bmatrix} $$

So we set $\hat H = \hat {\mathcal O}_t[0]$ and $\hat F$ as solving

$\hat {\mathcal O}_t[0] F = \hat {\mathcal O}_t[1] $
...
$\hat {\mathcal O}_t[t-1] F = \hat {\mathcal O}_t[t] $

Subspace identfication

Extract column space using singular value decomposition (SVD) $Y_{t,k} = \begin{bmatrix}U_{d_s} & U_{2}\end{bmatrix} \begin{bmatrix}\Sigma_{d_s} \\& 0\end{bmatrix}V^\top $

Set $\hat{\mathcal O}_t = U_{d_s}$

Summary

Linear AR model: $$\hat y_{t+1} = \Theta^\top \hat{\bar y}_{t:t-L+1} = \sum_{\ell=0}^{L-1} \hat \Theta_{\ell+1}^\top \hat y_{t-\ell} $$
Fact 1: Linear AR model $\iff$ partially observed LDS $$s_{t+1} = Fs_t,\quad y_t = Hs_t$$
Fact 2: from an output perspective, PO-LDS do not have a unique state space representation, but they do define a unique subspace: the column space of $\mathcal O$
- equal to the column space of the Hankel matrix
- equal to the span of the AR covariates

Recap

Autoregressive models
Partially observed dynamical systems
Observability and subspace ID

Next time: stochastic dynamics and filtering

Announcements

Third assignment due Thursday
Useful posts on Edstem about formatting PRs for submission

Reference: Callier & Desoer, "Linear Systems Theory" and Verhaegen & Verdult "Filtering and System Identification"

07 - Linear Autoregressive Models - ML in Feedback Sys F25

By Sarah Dean

07 - Linear Autoregressive Models - ML in Feedback Sys F25

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Linear Autoregressive Models

ML in Feedback Sys #7

Linear auto-regression

Linear auto-regression

Partially observed dynamics

Partially observed dynamics

Partially observed dynamics

Limits of partial observation

Observability

Observability

Equivalence

Non-unique state space

Non-unique state space

Subspace identfication

Hankel matrix

Subspace identfication

Summary

Recap

Announcements

07 - Linear Autoregressive Models - ML in Feedback Sys F25

More from Sarah Dean