Sarah Dean PRO
asst prof in CS at Cornell
Fall 2025, Prof Sarah Dean
"What we do"
"Why we do it"
Last lecture: AR of length \(L\) is equivalent to deterministic LDS with state dimension \(L\)
This lecture: AR of length \(L=O(\log(1/\epsilon))\) can \(\epsilon\) approximate optimal prediction for stochastic LDS
$$ x_{t+1} = F x_t + w_t,\quad y_t = Hx_t + v_t$$
"What we do"
"What we do"
"Why we do it"
Fact 1: if \(w_k\) and \(v_k\) are Gaussian random variables, then the Kalman filter computes the posterior distribution of the latent state $$P(s_t | y_0,...,y_t) =\mathcal N(\hat s_{t|t}, P_{t|t})$$
Fact 2: the Kalman filter solves the following weighted least squares problem online $$\min_{{s_0, ..., s_t}} ~~~\sum_{k=0}^t \|\Sigma_v^{-1/2}(H{s_k}-y_k)\|_2^2+ \|\Sigma_w^{-1/2}(F{s_k-s_{k+1}})\|_2^2+ \|{s_0}\|_2^2 $$
Fact 3: A linear autoregressive model of length \(L=O(\log(1/\epsilon))\) can \(\epsilon\) approximate the Kalman filter's estimates
$$ x_{t+1} = F x_t + w_t,\quad y_t = Hx_t + v_t$$
\(\mathbb E[y_t] = \sum_{k=0}^{t}HF^{k}\mu_w + \mu_v\)
\(\mathrm{Cov}[y_t] = \sum_{k=0}^{t}HF^{k}\Sigma_w(F^{k})^\top H^\top + \Sigma_v\)
For Gaussian noise, mean and covariance characterizes the entire (prior) distribution on outputs (similar expression for states)
Converges to a steady state distribution as \(t\to\infty\) if \(F\) is stable
\(\displaystyle y_{t} = HF^t s_0+ \sum_{k=1}^{t}HF^{k-1}w_{t-k}+v_t\)
Fact 1: if \(w_k\) and \(v_k\) are Gaussian random variables, then the Kalman filter computes the posterior distribution of the latent state $$P(s_t | y_0,...,y_t) =\mathcal N(\hat s_{t|t}, P_{t|t})$$
Consider a single KF update
At time \(t\) we have \(P(s_t|y_{t-1},...) = \mathcal N(\hat s_{t\mid t-1} ,P_{t\mid t-1} )\) given by the Extrapolate step
As a result, the state estimated by the Kalman filter is statistically optimal as it is both the Minimum Variance Unbiased Estimator and the Maximum Likelihood Estimator
Recall the least square optimization $$ \min_{\textcolor{yellow}{\theta}} \sum_{k=0}^t (y_t - \textcolor{yellow}{\theta^\top} x_t )^2 $$
\(y_0 = \textcolor{yellow}{\theta^\top} x_0 + \textcolor{yellow}{v_0}\)
\(\vdots\)
\(y_t = \textcolor{yellow}{\theta^\top} x_t + \textcolor{yellow}{v_t}\)
$$ \min_{\textcolor{yellow}{\theta, v_k}} \sum_{k=0}^t \textcolor{yellow}{v_k}^2 $$
s.t.
Equivalent to minimizing the squared residuals of a linear model
\(\hat\theta = \left(\sum_{k=0}^tx_kx_k^\top\right)^{-1}\sum_{k=0}^t x_k y_k\)
We can also understand the KF as solving a specific (familiar) optimization problem
Recall the least square optimization $$ \min_{\textcolor{yellow}{\theta}} \sum_{k=0}^t (y_t - \textcolor{yellow}{\theta^\top} x_t )^2 $$
Equivalent to minimizing the squared residuals of a linear model
We can also understand the KF as solving a specific (familiar) optimization problem
$$ \min_{\textcolor{yellow}{\theta, v_k}} \sum_{k=0}^t \textcolor{yellow}{v_k}^2 $$
s.t.
\(\hat\theta = \left(X^\top X\right)^{-1}X^\top Y\)
$$\underbrace{\begin{bmatrix} y_0 \\ \vdots \\ y_t \end{bmatrix} }_Y= \underbrace{ \begin{bmatrix} x_0^\top\\ \vdots \\ x_t^\top \end{bmatrix} }_{X}\textcolor{yellow}{\theta} + \textcolor{yellow}{\begin{bmatrix} v_0\\ \vdots \\ v_t\end{bmatrix}}$$
At time \(t\), due to our observations, measurement model, and dynamics model:
\(\vdots\)
Our least-square estimation problem is*
$$\min_{\textcolor{yellow}{s}} ~~~\sum_{k=0}^t \|H\textcolor{yellow}{s_k}-y_k\|_2^2+ \|F\textcolor{yellow}{s_k-s_{k+1}}\|_2^2+ \|\textcolor{yellow}{s_0}\|_2^2 $$
*for simplicity for the rest of lecture, let \(\Sigma_w=I\) and \(\Sigma_v=I\)
\(\vdots\)
At time \(t\), due to our observations, measurement model, and dynamics model:
$$\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \begin{bmatrix} H\\ F-I \\ &H\\ & F-I \\ &&\ddots \\ &&&H \end{bmatrix} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
At time \(t\), due to our observations, measurement model, and dynamics model:
The least squares estimator minimizes the squared residual
$$\text{s.t.}~~~\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \underbrace{\begin{bmatrix} H\\ F&-I \\ &H \\ & F&-I \\ &&\ddots \\ &&&H\end{bmatrix}}_{A} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
$$\min_{\textcolor{yellow}{s,v,w}} ~~~\sum_{k=0}^{t-1} \textcolor{yellow}{\|w_k\|_2^2 + \|v_k\|^2 + \|v_t\|^2+\|s_0\|_2^2} $$
This is equivalent to the optimization problem $$\min_{\textcolor{yellow}{s}} ~~~\sum_{k=0}^t \|H\textcolor{yellow}{s_k}-y_k\|_2^2+ \|F\textcolor{yellow}{s_k-s_{k+1}}\|_2^2+ \|\textcolor{yellow}{s_0}\|_2^2 $$
The least squares estimator minimizes the squared residual
$$\text{s.t.}~~~\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \underbrace{\begin{bmatrix} H\\ F&-I \\ &H \\ & F&-I \\ &&\ddots \\ &&&H\end{bmatrix}}_{A} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
$$\min_{\textcolor{yellow}{s,v,w}} ~~~\sum_{k=0}^{t-1} \textcolor{yellow}{\|w_k\|_2^2 + \|v_k\|^2 + \|v_t\|^2+\|s_0\|_2^2} $$
$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A^\top A)^{-1} A^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$
Fact 2: the Kalman filter efficiently solves the above least squares problem online, i.e. \(\hat s_{t|t}\) exactly coincides with the solution above
(see extra slides below)
$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A^\top A)^{-1} A^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$
$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}H^\top H + F^\top F & -F^\top \\ -F& H^\top H+F^\top F + I & -F^\top \\ & -F & \ddots & \\ &&& H^\top H+I\end{bmatrix}^{-1}\begin{bmatrix}H^\top y_0 \\ \vdots \\ H^\top y_t \end{bmatrix} $$
Block tri-diagonal matrix inverse
$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}D_1 & -F^\top \\ -F &D_2 & -F^\top \\ & -F & \ddots & \\ &&&D_3\end{bmatrix}^{-1}\begin{bmatrix}H^\top y_0 \\ \vdots \\ H^\top y_t \end{bmatrix} $$
$$\textcolor{yellow}{\begin{bmatrix}\hat s_{0\mid t+1}\\ \hat s_{1\mid t+1} \\ \vdots \\ \hat s_{t\mid t+1} \\ \hat s_{t+1\mid t+1} \end{bmatrix}} =\begin{bmatrix}D_1 & -F^\top \\ -F& D_2 & -F^\top \\ & -F & \ddots & \\ &&& D_3+\textcolor{yellow}{F^\top F} & \textcolor{yellow}{-F^\top}\\ &&& \textcolor{yellow}{-F} &\textcolor{yellow}{ H^\top H + I}\end{bmatrix}^{-1}\begin{bmatrix}H^\top y_0 \\ \vdots \\ H^\top y_t \\ \textcolor{yellow}{H^\top y_{t+1}}\end{bmatrix} $$
Block tri-diagonal matrix inverse
Possible to write \(\hat s_{t+1\mid t+1}\) as a linear combination of \(\hat s_{t\mid t}\) and \(y_{t+1}\)
1. Classic least squares regression
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
$$ y_0 = s_0 + v_0$$
\(\hat s_{0\mid 0} = y_0\)
2. Least squares filtering
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
$$ \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\end{bmatrix}$$
$$ \begin{bmatrix} 2 & -1 \\ -1 & 2\end{bmatrix}^{-1} \begin{bmatrix} 1 & 1\\ & -1 & 1 \end{bmatrix} \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} $$
\(\hat s_{0\mid 1} = \frac{2y_0+y_1}{3}\)
\(\hat s_{1\mid 1} = \frac{y_0+2y_1}{3}\)
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
$$ \begin{bmatrix} y_0\\ 0 \\ y_1\\ 0 \\ y_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1\\ & 1 & -1 \\ && 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\\ s_2\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\\w_1\\v_2\end{bmatrix}$$
\(\hat s_{0\mid 2} = \frac{5y_0+2y_1+1y_2}{8}\)
\(\hat s_{1\mid 2} = \frac{2y_0+4y_1+2y_2}{8}\)
\(\hat s_{2\mid 2} = \frac{y_0+2y_1+5y_2}{8}\)
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
1. Classic least squares regression
2. Least squares filtering
\(y_t\)
\(s_{t+1} = Fs_t + w_t\)
\(y_t = Hs_t + v_t\)
Kalman Filter
\(\hat s_t\)
\(F-L_tHF\)
\(\hat s\)
\(L_t\)
\(s\)
\(w_t\)
\(v_t\)
\(H\)
\(\hat s_{t+1} = F\hat s_t + L_t(y_{t+1} -HF\hat s_t)\)
\(\hat y_t = H\hat s_t\)
The Kalman filter in state space $$ \hat s_{t+1} = \underbrace{(F-L_tHF)}_{F_{L,t}}\hat s_t + L_ty_{t+1},\quad \hat y_t = H\hat s_t$$
"Unrolling" to get equivalent linear output model $$\hat y_{t} = H\Big(\prod_{k=0}^{t-1} F_{L,k}\Big) \hat s_{0}+ \sum_{k=1}^{t}H \Big(\prod_{\ell=0}^{k-1} F_{L,\ell}\Big) L_{t-1}y_{t-k+1}$$
The prediction \(\hat y_t\) is a linear combination of all past observations \(y_k\)
Claim: the first \(t-L\) terms can be upper bounded by a function scaling as \(C\rho^L\) for some \(\rho<1\) and \(C\geq 0\) (next assignment)
Fact 3: A linear autoregressive model of length \(L\geq \log(C/\epsilon) / \log(1/\rho)\) can \(\epsilon\)-approximate the Kalman filter's state estimates
$$ x_{t+1} = F x_t + w_t,\quad y_t = Hx_t + v_t$$
Fact 1: the Kalman filter is statistically optimal if noise is Gaussian
Fact 2: the Kalman filter provides an online solution to a least-squares optimization problem
Fact 3: A linear autoregressive model of length \(L=O(\log(1/\epsilon))\) can \(\epsilon\) approximate the Kalman filter's estimates
\(\implies\) AR is approximately optimal for LDS
Next time: discrete state and observation space
\(\displaystyle \min_{{s,v,w}} ~~~\sum_{k=0}^{t-1} \|\)\(\Sigma_{w,k}^{-1/2}\)\(w_k\|_2^2 + \|\)\(\Sigma_{v,k}^{-1/2}\)\(v_k\|^2+\|\)\(\Sigma_{s}^{-1/2}\)\(s_0\|_2^2 \)
$$\text{s.t.}~~~\bar{y}_{0:t} = F \bar s_{0:t}+ \bar w_{0:t} + \bar v_{0:t}$$
By Sarah Dean