Linear Autoregressive Models

ML in Feedback Sys #7

Fall 2025, Prof Sarah Dean

Linear auto-regression

"What we do"

  • Given: data \(\{y_k\}_{k=1}^{n+1}\) with \(y_k\in\mathbb R^{d_y}\) and window length \(L\)
  • The linear auto-regressive least squares
    • Split data into predictors $$\bar y_{k:k-L+1} = \begin{bmatrix} y_k^\top & ... & y_{k-L+1}\end{bmatrix}^\top \in\mathbb R^{Ld_y}$$ and targets \(y_{k+1}\) for \(k=L,...,n\)
    • Solve linear least squares optimization $$\min_{\Theta\in \mathbb R^{Ld_y\times d_y}} \sum_{k=L}^n\| \Theta^\top \bar y_{k:k-L+1} - y_{k+1}\|_2^2$$
  • Make predictions with $$\hat y_{t+1} = \hat\Theta^\top \hat{\bar y}_{t:t-L+1} = \sum_{\ell=0}^{L-1} \hat \Theta_{\ell+1}^\top \hat y_{t-\ell} $$

Linear auto-regression

"Why we do it"

  • Fact 1: Auto-regressive models with length \(L\) are equivalent to partially observed linear dynamical systems (PO-LDS) which satisfy:
    • The PO-LDS is observable
    • Sufficient state dimension \(d_s\)
  • Fact 2: from an output perspective, PO-LDS do not have a unique state space representation, but they do define a unique subspace which can be learned from data

Partially observed dynamics 

  • A partially observed dynamical system is defined by a difference equation and an observation (or measurement) equation $$s_{t+1} = F(s_t),\quad y_t = H(s_t)$$
  • Linear dynamics (PO-LDS) are defined by a dynamics matrix and an observation matrix $$s_{t+1} = Fs_t,\quad y_t = Hs_t$$
  • Fact 1a: given a linear autoregressive model, we can always construct a PO-LDS with \(d_s= Ld_y\) which has identical outputs

Partially observed dynamics 

  • Linear dynamics (PO-LDS) are defined by a dynamics matrix and an observation matrix $$s_{t+1} = Fs_t,\quad y_t = Hs_t, \quad s_0$$
  • Fact 1a: given a linear autoregressive model, we can always construct a PO-LDS with \(d_s= Ld_y\) which has identical outputs
  • Example: Fibonacci sequence: $$1,1, 2, 3, 5, 8, 13, 21, ...$$

    • Auto-regressive model:
      • \(y_{t+1} = y_t + y_{t-1}\)
    • State space model:
      • \(s_{t+1} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} s_t,~~ y_t = \begin{bmatrix} 1 &0\end{bmatrix}s_t,~~s_0=\begin{bmatrix} 1 \\ 0\end{bmatrix}\)

Partially observed dynamics 

  • Linear dynamics (PO-LDS) are defined by a dynamics matrix and an observation matrix $$s_{t+1} = Fs_t,\quad y_t = Hs_t,\quad s_0$$
  • Fact 1a: given a linear autoregressive model \( y_{t+1} = \Theta^\top {\bar y}_{t:t-L+1}\) we can always construct a PO-LDS with \(d_s= Ld_y\) which has identical outputs
  • Proof by construction

    • let \(s_t = \begin{bmatrix}y_t^\top & ... & y_{t-L+1}^\top \end{bmatrix}^\top \in\mathbb R^{Ld_y}\)

    • $$ \text{let}~~ F = \begin{bmatrix}&& \Theta^\top \\ I & 0\\ & I & 0\\ && \ddots&& \\ &&& I &0\\\end{bmatrix},\quad H =\begin{bmatrix} I & 0 & ... \end{bmatrix}$$

    • \(y_k\) for \(k\leq 0\) are free variables

Limits of partial observation

Example: The state \(s=[\theta,\omega]\), output \(y=\theta\), and

$$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

  • What can we predict about \(y_t\) and \(s_t\) as \(t\to\infty\)?
  • Suppose \(y_0 = 1\). Can we predict \(y_1\) or estimate \(s_0\)?
  • What if we also know that \(y_1 = 1\)?

Example: The state \(s=[\theta,\omega]\), output \(y=\theta\), and

$$\theta_{t+1} = 0.9\theta_t ,\quad \omega_{t+1} = 0.9 \omega_t$$

Definition: A PO system is observable if outputs \(y_{0:t}\) uniquely determine the state trajectory \(s_{0:t}\) for some finite \(t\).


 

 

  • i.e., the map \(\Phi_{0\to t}:\mathcal S\to \mathcal Y^{t+1}\) is injective

Observability

\(s_{t+1} = F(s_t)\)

\(y_t = H(s_t)\)

\((y_0,...,y_t) = \Phi_{0\to t}(s_0)\)

Theorem: A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix}}_{\mathcal O}\Big) = d_s$$

  • \(y_0 = H(s_0)\)
  • \(y_1 = H(F(s_0))\)
  • \(y_2 = H(F(F(s_0)))\)

Observability

\(s_{t+1} = F s_t\)

\(y_t = H s_t\)

Theorem: A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix}}_{\mathcal O}\Big) = d_s$$

\(\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\)

\(\begin{bmatrix} Hs_0\\HFs_0\\\vdots\\HF^ts_0\end{bmatrix}\)

Proof: The system response \(\Phi_{0\to t}\) is defined in terms of \(F\in\mathbb R^{d_s\times d_s}\) and \(H\in\mathbb R^{d_s\times d_y}\)

1) \(\mathrm{rank}(\mathcal O) = d_s \implies\) observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{d_s-1}\end{bmatrix} = \begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix} s_0 $$

2) \(\mathrm{rank}(\mathcal O) = d_s \impliedby\) observable

  • There is a unique solution \(s_0\) when \(\mathcal O\) is rank \(d_s\)
    • \(s_0 = (\mathcal O^\top \mathcal O)^{-1} \mathcal O^\top y_{0:d_s-1}\)
  • Thus, \(s_{0:t}\) is uniquely determined (for \(t\geq d_s\))

Proof:

2) \(\mathrm{rank}(\mathcal O) < d_s \implies\) not observable

  • Claim: \(\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)\) for all \(t\).
    Thus \(\mathrm{rank}(\mathcal O_t) <d_s\) so \(s_0\) is not uniquely determined
  • Need to justify claim for \(t\geq d_s\), when \(\mathcal O_t\) has more rows than \(\mathcal O\)
  • Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree \(d_s\)).
  • Therefore, \(F^k\) for \(k\geq d_s\) is a linear combo of \(I, F,\dots ,F^{d_s-1}\)
  • Thus, \(\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)\) as claimed.

\(\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\underbrace{\begin{bmatrix} H\\HF\\\vdots\\HF^t\end{bmatrix}}_{\mathcal O_t} s_0\)

Proof:

1) \(\mathrm{rank}(\mathcal O) = d_s \implies\) observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{d_s-1}\end{bmatrix} = \begin{bmatrix}H\\HF\\\vdots \\ HF^{d_s-1}\end{bmatrix} s_0 $$

  • There is a unique solution \(s_0\) when \(\mathcal O\) is rank \(d_s\)
    • \(s_0 = (\mathcal O^\top \mathcal O)^{-1} \mathcal O^\top y_{0:d_s-1}\)
  • Thus, \(s_{0:t}\) is uniquely determined (for \(t\geq d_s\))
  • Fact 1: Auto-regressive models with length \(L\) are equivalent to partially observed linear dynamical systems (PO-LDS) which are observable and have sufficient dimension
    • We proved 1a: AR (\(L\)) \(\implies\) PO-LDS (\(d_s=Ld_y\))
  • Now, we prove 1b: PO-LDS \(\implies\) AR
    • High level idea: \(d_s\) outputs are sufficient to back out the state, and the state is sufficient to predict future
    • Give \(F\) and \(H\) construct $$\begin{bmatrix} y_t\\y_{t-1}\\\vdots\\y_{t-d_s-1}\end{bmatrix}= \begin{bmatrix}HF^{d_s-1} \\ \vdots \\ HF\\H \end{bmatrix}s_{t-d_s-1} =\tilde{\mathcal O} s_{t-d_s-1}$$
    • Then we have for \(L=d_s\) $$y_{t+1} = Hs_{t+1} = HF^{d_s}  s_{t-d_s-1} = HF^{d_s} (\tilde{\mathcal O}^\top \tilde{\mathcal O})^{-1} \tilde{\mathcal O}^\top \bar y_{t:t-L+1}$$

Equivalence

Non-unique state space

There is inherent ambiguity in state space representations
for partially observed dynamics

Example: Consider the sequence: $$1,1, 2, 3, 5, 8, 13, 21, ...$$

  • \(s_{t+1} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} s_t\), \(s_0 = \begin{bmatrix} 1 \\ 0\end{bmatrix}\)
  • \(y_t = \begin{bmatrix} 1 & 0\end{bmatrix}\)
  • \(s_{t+1} = \begin{bmatrix} 1 & 4 \\ \frac{1}{4} & 0 \end{bmatrix} s_t\), \(s_0 = \begin{bmatrix} 2 \\ 0\end{bmatrix}\)
  • \(y_t = \begin{bmatrix} \frac{1}{2} & 0\end{bmatrix}\)

Non-unique state space

There is inherent ambiguity in state space representations
for partially observed dynamics

  • \(y_0 =\hat H\hat s_0\)
  • \(\hat s_1 = \hat F\hat s_0\)
  • ...
  • \(\hat s_t = \hat F\hat s_{t-1}\)
  • \(y_t = \hat H\hat s_{t} \)

Suppose \(\hat F, \hat H, \hat s_0\) satisfies the equations

  • \(y_0 =\hat H\textcolor{cyan}{MM^{-1}}\hat s_0\)
  • \(\textcolor{cyan}{M^{-1}}\hat s_1 = \textcolor{cyan}{M^{-1}}\hat F\textcolor{cyan}{MM^{-1}}\hat s_0\)
  • ...
  • \(\textcolor{cyan}{M^{-1}}\hat s_t = \textcolor{cyan}{M^{-1}}\hat F\textcolor{cyan}{MM^{-1}}\hat s_{t-1}\)
  • \(y_t = \hat H\textcolor{cyan}{MM^{-1}}\hat s_{t} \)

Then so does \(\tilde F=M^{-1}\hat F M, \tilde H=\hat HM, \tilde s_0=M^{-1}\hat s_0\)

Subspace identfication

$$\mathcal O_L = \begin{bmatrix}H\\HF\\\vdots \\ HF^{L-1}\end{bmatrix} $$

Recall the observability matrix

$$=\begin{bmatrix}HM\\HMM^{-1} FM\\\vdots \\  HMM^{-1} F^{L-1}M\end{bmatrix} $$

$$=\mathcal O_L M $$

Approach: subspace identification of the rank \(d_s\) column space of \(\mathcal O_L\) (\(L\geq d_s\))

Note: column space is invariant under similarity transforms

\tilde {\mathcal O}_L = \begin{bmatrix}\tilde H\\\tilde H\tilde F\\\vdots \\ \tilde H\tilde F^{L-1}\end{bmatrix}

What we do:

$$\begin{bmatrix}y_0\\\vdots \\ y_{L-1}\end{bmatrix}  =\mathcal O_L s_0 $$

$$\begin{bmatrix}y_1\\\vdots \\ y_{L}\end{bmatrix}  =\mathcal O_L s_{1} $$

\(\dots\)

$$\begin{bmatrix}y_m\\\vdots \\ y_{L+m-1}\end{bmatrix}  =\mathcal O_L s_{k} $$

Hankel matrix

$$Y_{L,m} =\mathcal O_L \begin{bmatrix}s_0 & s_1 &\dots & s_m\end{bmatrix}$$

  • The column space of the Hankel matrix is the same as \(\mathcal O_L\) (as long as \(s_{0:m}\) form a rank \(d_s\) matrix)
  • also the same as the span of LS covariates \(\bar y_{k:k-L+1}\), \(k=L, L+1,...\)

Construct the Hankel matrix and consider its column space $$\begin{bmatrix}y_0 & \dots &y_m\\\vdots & \ddots&\vdots\\ y_{L-1}&\dots& y_{L+m-1}\end{bmatrix}  =Y_{L,m} $$

Why we do it:

Step 1: Estimate \(\mathcal O_t\)

Step 2: Recover \(\hat F\) and \(\hat H\) from \(\hat{\mathcal O}\)

$$\hat{\mathcal O}_t = \begin{bmatrix}\hat{\mathcal O}_t [0]\\\hat{\mathcal O}_t [1]\\\vdots \\ \hat{\mathcal O}_t [t]\end{bmatrix}\approx \begin{bmatrix}H\\HF\\\vdots \\ HF^{t}\end{bmatrix} $$

So we set \(\hat H = \hat {\mathcal O}_t[0]\) and \(\hat F\) as solving

  • \(\hat {\mathcal O}_t[0] F = \hat {\mathcal O}_t[1] \)
  • ...
  • \(\hat {\mathcal O}_t[t-1] F = \hat {\mathcal O}_t[t] \)

Subspace identfication

Extract column space using singular value decomposition (SVD) \(Y_{t,k} = \begin{bmatrix}U_{d_s} & U_{2}\end{bmatrix} \begin{bmatrix}\Sigma_{d_s} \\& 0\end{bmatrix}V^\top \)

Set \(\hat{\mathcal O}_t = U_{d_s}\)

Summary

  • Linear AR model: $$\hat y_{t+1} = \Theta^\top \hat{\bar y}_{t:t-L+1} = \sum_{\ell=0}^{L-1} \hat \Theta_{\ell+1}^\top \hat y_{t-\ell} $$
  • Fact 1: Linear AR model \(\iff\) partially observed LDS $$s_{t+1} = Fs_t,\quad y_t = Hs_t$$
  • Fact 2: from an output perspective, PO-LDS do not have a unique state space representation, but they do define a unique subspace: the column space of \(\mathcal O\)
    • equal to the column space of the Hankel matrix
    • equal to the span of the AR covariates

Recap

  • Autoregressive models
  • Partially observed dynamical systems
  • Observability and subspace ID

Next time: stochastic dynamics and filtering

Announcements

  • Third assignment due Thursday
  • Useful posts on Edstem about formatting PRs for submission

Reference: Callier & Desoer, "Linear Systems Theory" and Verhaegen & Verdult "Filtering and System Identification"

07 - Linear Autoregressive Models - ML in Feedback Sys F25

By Sarah Dean

07 - Linear Autoregressive Models - ML in Feedback Sys F25

  • 48