State Estimation
ML in Feedback Sys #8
Prof Sarah Dean
Announcements
- Final project details posted
- proposal due October 7
- Office hours 10-11am this Thursday
- Reminders
- Working in pairs/groups, self-assessment
- Paper presentation schedule posted
-
Required: meet with Atul at least 2 days before you are scheduled to present
- schedule meeting with Atul a week or more in advance!
-
Required: meet with Atul at least 2 days before you are scheduled to present
Recap: Paper presentations
- HSNL18 Fairness without demographics in repeated loss minimization
- Motivation: fairness via equitable allocation of accuracy across groups
- Population update: group retention \(\lambda_t^{(k)}\) based on risk
- Learner update: (robust) risk minimization \(\theta_t\) based on mixture \(\sum_k \frac{\lambda^{(k)}_t}{\sum_\ell \lambda^{(\ell)}_t}\mathcal D_k\)
- Linearization argument shows ERM is unstable; induction argument shows that DRO controls worst-case risk
\(F\)
\(\lambda_t\)
Recap: Paper presentations
- PZMDH20 Performative prediction
- Motivation: general model of reactive distribution shift
- Population update: smooth dependence \(\mathcal D(\theta)\)
- Learner update: risk minimization \(\theta_{t+1}\) based on \(\mathcal D(\theta_t)\)
- Induction argument shows convergence to \(\theta_{PS}\), proximity to \(\theta_{PO}\)
\(F\)
\(\theta_t\)
- Verify stability by finding a Lyapunov function (positive definite and decreasing)
- Stable linear dynamics have quadratic Lyapunov functions
- For example, \(V(s)= s^\top \sum_{t=0}^\infty (A^\top)^t A^t s\)
- For nonlinear dynamics, stability of Jacobian determines (local) stability
- Stochastic dynamics: sets around stable equilibria
Recap: Stability via Lyapunov
Reference: Bof, Carli, Schenato, "Lyapunov Theory for Discrete Time Systems"
Consider the dynamics of stochastic gradient descent on a twice differentiable function \(g:\mathbb R^d\to\mathbb R^d\)
\(\theta_{t+1} = \theta_t - \alpha g_t\)
Example: stochastic gradient descent
\(= \theta_t - \alpha \nabla g(\theta_t) + \underbrace{\alpha (\nabla g(\theta_t) - g_t) }_{w_t} \)
- may converge to ball around minima
- in general, could jump between minima
- in practice, decreasing step size \(\alpha_t\propto 1/t\)

\(F\)
Dynamical System
\(s\)
\(s_{t+1} = F(s_t, w_t)\)
\(y_t = G(s_t)\)
\(w_t\)
\(y_t\)
input signal: external phenomena
output signal: measurements

\(F\)
Dynamical System
\(s\)
\(s_{t+1} = F(s_t, w_t)\)
\(y_t = G(s_t)+v_t\)
\(w_t\)
\(y_t\)
input signal: external phenomena
output signal: noisy measurements
\(v_t\)

\(F\)
System Response
\(s\)
\((y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)\)
\(w_t\)
\(y_t\)
\(\Phi_{F,G}\)
\(v_t\)
System Response
\((y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)\)
If \(F,G\) known (e.g. physics), then can compute \(y_{t+1}\) given
- \(s_0\), \( w_{0:t}\), and \( v_{0:t+1}\)
- (or given \(s_t\) and \(v_t\))
But only observe \(y_{0:t}\)!


Related Goals:
- Predict future outputs $$ y_{0:t} \to \hat y_{t+1}$$
- Estimate underlying state $$ y_{0:t} \to \hat s_t $$
Challenges:
- Unknown initial state \(s_0\)
- Unpredictable process due to \(w_t\)
- Noisy measurements due to \(v_t\)
Example
The state \(s=[\theta,\omega]\), output \(y=\theta\), and
$$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$
- What can we predict about \(y_t\) and \(s_t\) as \(t\to\infty\)?
- Suppose \(y_0 = 1\). Can we predict \(y_1\) or estimate \(s_0\)?
- What if we also know that \(y_1 = 1\)?
Linear dynamics and measurement
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
Setting: linear state estimation
- State space \(\mathbb R^n\) and output space \(\mathbb R^p\)
- \(A\in\mathbb R^{n\times n}\) and \(C\in\mathbb R^{p\times n}\)
At time \(t\), we have observed \(y_{0:t}\), which we can use to estimate $$\hat s_{k\mid t}$$
our guess of \(s_k\) given observations up to time \(t\)
At time \(t\), our observations and models tell us that
Estimation via least squares
- \(y_0=C \)\(s_0\) \( + \) \(v_0\)
- \(s_1 \) \(= A\)\(s_0\) \( + \) \(w_0\)
- \(y_1=C \)\(s_1\) \( + \) \(v_1\)
- \(s_t \) \(= A\)\(s_{t-1}\) \( + \) \(w_{t-1}\)
- \(y_t=C \)\(s_t\) \( + \) \(v_t\)
\(\vdots\)
At time \(t\), our observations and models tell us that
Estimation via least squares
- \(y_0=C \)\(s_0\) \( + \) \(v_0\)
- \(0= A\)\(s_0-s_1\) \( + \) \(w_0\)
- \(y_1=C \)\(s_1\) \( + \) \(v_1\)
- \(0= A\)\(s_{t-1}-s_t\) \( + \) \(w_{t-1}\)
- \(y_t=C \)\(s_t\) \( + \) \(v_t\)
\(\vdots\)
At time \(t\), our observations and models tell us that
Estimation via least squares
$$\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \begin{bmatrix} C\\ A-I \\ &C\\ & A-I \\ &&\ddots \\ &&&C \end{bmatrix} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
The least squares estimator minimizes the squared error
Estimation via least squares
$$\text{s.t.}~~~\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \underbrace{\begin{bmatrix} C\\ A&-I \\ &C \\ & A&-I \\ &&\ddots \\ &&&C\end{bmatrix}}_{A_C} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$
$$\min_{\textcolor{yellow}{s,v,w}} ~~~\sum_{k=0}^{t-1} \textcolor{yellow}{\|w_k\|_2^2 + \|v_k\|^2 + \|v_t\|^2+\|s_0\|_2^2} $$
The least squares estimator minimizes the squared error
Estimation via least squares
$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$
1. Classic least squares regression
- assume fixed \(s_t = s\)
- least squares solution to measurement model equations $$\begin{bmatrix} y_0\\\vdots\\y_t\end{bmatrix} = \begin{bmatrix} s\\\vdots\\s\end{bmatrix} + \begin{bmatrix} v_0\\\vdots\\v_t\end{bmatrix}$$
- what is \(\hat s_{\mid 0}\), \(\hat s_{\mid 1}\), and \(\hat s_{\mid 2}\)?
- predict average $$\hat s_{\mid t} = \frac{1}{t+1}\sum_{t=0}^t y_t$$
Example: effect of modelling drift
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
- allow for drift \(s_{t+1} = s_{t} + w_t\)
- least squares solution to measurement model and drift model equations
- what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?
Example: effect of modelling drift
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
$$ y_0 = s_0 + v_0$$
\(\hat s_{0\mid 0} = y_0\)
2. Least squares filtering
- allow for drift \(s_{t+1} = s_{t} + w_t\)
- least squares solution to measurement model and drift model equations
- what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?
Example: effect of modelling drift
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
- allow for drift \(s_{t+1} = s_{t} + w_t\)
- least squares solution to measurement model and drift model equations
- what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?
$$ \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\end{bmatrix}$$
$$ \begin{bmatrix} 2 & -1 \\ -1 & 2\end{bmatrix}^{-1} \begin{bmatrix} 1 & 1\\ & -1 & 1 \end{bmatrix} \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} $$
\(\hat s_{0\mid 1} = \frac{2y_0+y_1}{3}\)
\(\hat s_{1\mid 1} = \frac{y_0+2y_1}{3}\)
Example: effect of modelling drift
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
2. Least squares filtering
- allow for drift \(s_{t+1} = s_{t} + w_t\)
- least squares solution to measurement model and drift model equations
- what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?
$$ \begin{bmatrix} y_0\\ 0 \\ y_1\\ 0 \\ y_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1\\ & 1 & -1 \\ && 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\\ s_2\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\\w_1\\v_2\end{bmatrix}$$
\(\hat s_{0\mid 2} = \frac{5y_0+2y_1+1y_2}{8}\)
\(\hat s_{1\mid 2} = \frac{2y_0+4y_1+2y_2}{8}\)
\(\hat s_{2\mid 2} = \frac{y_0+2y_1+5y_2}{8}\)
Example: effect of modelling drift
Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)
1. Classic least squares regression
- assume fixed \(s_t = s\)
- least squares solution to measurement model equation
- predict average $$\hat s_{\mid t} = \frac{1}{t+1}\sum_{t=0}^t y_t$$
2. Least squares filtering
- allow for drift \(s_{t+1} = s_{t} + w_t\)
- least squares solution to measurement model and drift model equations
- predict average weighted by proximity in time
- \(\hat s_{k\mid t}\) more heavily weights \(y_k\)
Online estimation
$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$
$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}C^\top C + A^\top A & -A^\top \\ -A & C^\top C+A^\top A + I & -A^\top \\ & -A & \ddots & \\ &&& C^\top C+I\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \end{bmatrix} $$
Block tri-diagonal matrix inverse
$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}D_1 & -A^\top \\ -A &D_2 & -A^\top \\ & -A & \ddots & \\ &&&D_3\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \end{bmatrix} $$
Online estimation
$$\textcolor{yellow}{\begin{bmatrix}\hat s_{0\mid t+1}\\ \hat s_{1\mid t+1} \\ \vdots \\ \hat s_{t\mid t+1} \\ \hat s_{t+1\mid t+1} \end{bmatrix}} =\begin{bmatrix}D_1 & -A^\top \\ -A & D_2 & -A^\top \\ & -A & \ddots & \\ &&& D_3+\textcolor{yellow}{A^\top A} & \textcolor{yellow}{-A^\top}\\ &&& \textcolor{yellow}{-A} &\textcolor{yellow}{ C^\top C + I}\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \\ \textcolor{yellow}{C^\top y_{t+1}}\end{bmatrix} $$
Block tri-diagonal matrix inverse
Possible to write \(\hat s_{t+1\mid t+1}\) as a linear combination of \(\hat s_{t\mid t}\) and \(y_{t+1}\)
Kalman filter algorithm
- Initialize \(P_{0\mid0} =(C^\top C)^{\dagger}\) and \(\hat s_{0\mid 0} = P_{0\mid 0} C^\top y_0\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =A\hat s_{t-1\mid t-1} \)
- Info matrix \(P_{t\mid t-1} = AP_{t-1\mid t-1} A^\top + I\)
- Compute gain $$L_{t} = P_{t\mid t-1}C^\top ( CP_{t\mid t-1} C^\top+I)^{-1}$$
- Update
- State \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-C\hat s_{t\mid t-1})\)
- Info matrix \(P_{t\mid t} = (I - L_{t}C)P_{t\mid t-1}\)
- Extrapolate
\(y_t\)
\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
Kalman Filter
- Extrapolate
- Compute Gain
- Update
\(\hat s_t\)
Kalman filter

\(A\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
\(y_t\)
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
Kalman Filter
\(\hat s_t\)
\(A-L_tCA\)
\(\hat s\)
\(L_t\)
Kalman filter

\(A\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)
Extensions
- Time-varying dynamics/measurements \(A_t, C_t\)
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + I\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+I)^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
Extensions
- Time-varying dynamics/measurements \(A_t, C_t\)
- Weighted least-squares models characteristics of noise
\(\displaystyle \min_{{s,v,w}} ~~~\sum_{k=0}^{t-1} \|\)\(\Sigma_{w,k}^{-1/2}\)\(w_k\|_2^2 + \|\)\(\Sigma_{v,k}^{-1/2}\)\(v_k\|^2+\|\)\(\Sigma_{s}^{-1/2}\)\(s_0\|_2^2 \)
$$\text{s.t.}~~~\bar{y}_{0:t} = A_C s_{0:t}+ \bar w_{0:t} + \bar v_{0:t}$$
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
Extensions
- Time-varying dynamics/measurements \(A_t, C_t\)
- Weighted least-squares models characteristics of noise
- KF is the Minimum Variance Linear Unbiased Estimator
- if \(s_0, w_k, v_k\) stochastic and independent \(\forall k\)
- and \(\mathbb E[s_0] = \mathbb E[w_k] = \mathbb E[v_k] = 0\)
- and covariances \(\Sigma_s,\Sigma_{w,k},\Sigma_{v,k}\)
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
Extensions
- Time-varying dynamics/measurements \(A_t, C_t\)
- Weighted least-squares models characteristics of noise
- KF is the Minimum Variance Linear Unbiased Estimator for stochastic/independent noise
- KF is the Minimum Variance Unbiased Estimator
- if \(s_0\sim N(0,\Sigma_s)\), \(w_k\sim N(0,\Sigma_{w,k})\), \(v_k\sim N(0,\Sigma_{v,k})\)
- in other words, KF exactly computes conditional expectation \(\hat s_{t\mid t} = \mathbb E[s_t\mid y_{0:t}]\)
- furthermore, \(P_{t\mid t}\) is the error covariance
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
Extensions
- Time-varying dynamics/measurements \(A_t, C_t\)
- Weighted least-squares models characteristics of noise
- KF is the Minimum Variance Linear Unbiased Estimator for stochastic/independent noise
- KF is the Minimum Variance Unbiased Estimator for Gaussian noise
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
Next time: when is estimation possible? What if dynamics matrices are unknown?
Recap
- Paper recaps & stability
- System response and inverse problem
- Linear system & least squares
- Kalman filter & extensions
08 - State Estimation - ML in Feedback Sys
By Sarah Dean