State Estimation

ML in Feedback Sys #8

Prof Sarah Dean

Announcements

Final project details posted
- proposal due October 7
Office hours 10-11am this Thursday
Reminders
- Working in pairs/groups, self-assessment
- Paper presentation schedule posted
  - Required: meet with Atul at least 2 days before you are scheduled to present
    - schedule meeting with Atul a week or more in advance!

Recap: Paper presentations

HSNL18 Fairness without demographics in repeated loss minimization
- Motivation: fairness via equitable allocation of accuracy across groups
- Population update: group retention $\lambda_t^{(k)}$ based on risk
- Learner update: (robust) risk minimization $\theta_t$ based on mixture $\sum_k \frac{\lambda^{(k)}_t}{\sum_\ell \lambda^{(\ell)}_t}\mathcal D_k$
- Linearization argument shows ERM is unstable; induction argument shows that DRO controls worst-case risk

$F$

$\lambda_t$

Recap: Paper presentations

PZMDH20 Performative prediction
- Motivation: general model of reactive distribution shift
- Population update: smooth dependence $\mathcal D(\theta)$
- Learner update: risk minimization $\theta_{t+1}$ based on $\mathcal D(\theta_t)$
- Induction argument shows convergence to $\theta_{PS}$, proximity to $\theta_{PO}$

$F$

$\theta_t$

Verify stability by finding a Lyapunov function (positive definite and decreasing)
Stable linear dynamics have quadratic Lyapunov functions
- For example, $V(s)= s^\top \sum_{t=0}^\infty (A^\top)^t A^t s$
For nonlinear dynamics, stability of Jacobian determines (local) stability
Stochastic dynamics: sets around stable equilibria

Recap: Stability via Lyapunov

Reference: Bof, Carli, Schenato, "Lyapunov Theory for Discrete Time Systems "

Consider the dynamics of stochastic gradient descent on a twice differentiable function $g:\mathbb R^d\to\mathbb R^d$

$\theta_{t+1} = \theta_t - \alpha g_t$

Example: stochastic gradient descent

$= \theta_t - \alpha \nabla g(\theta_t) + \underbrace{\alpha (\nabla g(\theta_t) - g_t) }_{w_t} $

may converge to ball around minima
in general, could jump between minima
in practice, decreasing step size $\alpha_t\propto 1/t$

$F$

Dynamical System

$s$

$s_{t+1} = F(s_t, w_t)$

$y_t = G(s_t)$

$w_t$

$y_t$

input signal: external phenomena

output signal: measurements

$F$

Dynamical System

$s$

$s_{t+1} = F(s_t, w_t)$

$y_t = G(s_t)+v_t$

$w_t$

$y_t$

input signal: external phenomena

output signal: noisy measurements

$v_t$

$F$

System Response

$s$

$(y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)$

$w_t$

$y_t$

$\Phi_{F,G}$

$v_t$

System Response

$(y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)$

If $F,G$ known (e.g. physics), then can compute $y_{t+1}$ given

$s_0$, $ w_{0:t}$, and $ v_{0:t+1}$
(or given $s_t$ and $v_t$)

But only observe $y_{0:t}$!

Related Goals:

Predict future outputs $$ y_{0:t} \to \hat y_{t+1}$$
Estimate underlying state $$ y_{0:t} \to \hat s_t $$

Challenges:

Unknown initial state $s_0$
Unpredictable process due to $w_t$
Noisy measurements due to $v_t$

Example

The state $s=[\theta,\omega]$, output $y=\theta$, and

$$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

What can we predict about $y_t$ and $s_t$ as $t\to\infty$?
Suppose $y_0 = 1$. Can we predict $y_1$ or estimate $s_0$?
What if we also know that $y_1 = 1$?

Linear dynamics and measurement

$s_{t+1} = As_t + w_t$

$y_t = Cs_t + v_t$

Setting: linear state estimation

State space $\mathbb R^n$ and output space $\mathbb R^p$
$A\in\mathbb R^{n\times n}$ and $C\in\mathbb R^{p\times n}$

At time $t$, we have observed $y_{0:t}$, which we can use to estimate $$\hat s_{k\mid t}$$

our guess of $s_k$ given observations up to time $t$

At time $t$, our observations and models tell us that

Estimation via least squares

$y_0=C $$s_0$ $ + $ $v_0$
$s_1 $ $= A$$s_0$ $ + $ $w_0$
$y_1=C $$s_1$ $ + $ $v_1$

$s_t $ $= A$$s_{t-1}$ $ + $ $w_{t-1}$
$y_t=C $$s_t$ $ + $ $v_t$

$\vdots$

At time $t$, our observations and models tell us that

Estimation via least squares

$y_0=C $$s_0$ $ + $ $v_0$
$0= A$$s_0-s_1$ $ + $ $w_0$
$y_1=C $$s_1$ $ + $ $v_1$

$0= A$$s_{t-1}-s_t$ $ + $ $w_{t-1}$
$y_t=C $$s_t$ $ + $ $v_t$

$\vdots$

At time $t$, our observations and models tell us that

Estimation via least squares

$$\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \begin{bmatrix} C\\ A-I \\ &C\\ & A-I \\ &&\ddots \\ &&&C \end{bmatrix} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$

The least squares estimator minimizes the squared error

Estimation via least squares

$$\text{s.t.}~~~\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \underbrace{\begin{bmatrix} C\\ A&-I \\ &C \\ & A&-I \\ &&\ddots \\ &&&C\end{bmatrix}}_{A_C} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$

$$\min_{\textcolor{yellow}{s,v,w}} ~~~\sum_{k=0}^{t-1} \textcolor{yellow}{\|w_k\|_2^2 + \|v_k\|^2 + \|v_t\|^2+\|s_0\|_2^2} $$

The least squares estimator minimizes the squared error

Estimation via least squares

$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$

1. Classic least squares regression

assume fixed $s_t = s$
least squares solution to measurement model equations $$\begin{bmatrix} y_0\\\vdots\\y_t\end{bmatrix} = \begin{bmatrix} s\\\vdots\\s\end{bmatrix} + \begin{bmatrix} v_0\\\vdots\\v_t\end{bmatrix}$$
what is $\hat s_{\mid 0}$, $\hat s_{\mid 1}$, and $\hat s_{\mid 2}$?
predict average $$\hat s_{\mid t} = \frac{1}{t+1}\sum_{t=0}^t y_t$$

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate $y_t = s_t + v_t$

2. Least squares filtering

allow for drift $s_{t+1} = s_{t} + w_t$
least squares solution to measurement model and drift model equations
what is $\hat s_{k\mid 0}$, $\hat s_{k\mid 1}$, and $\hat s_{k\mid 2}$?

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate $y_t = s_t + v_t$

$$ y_0 = s_0 + v_0$$

$\hat s_{0\mid 0} = y_0$

2. Least squares filtering

allow for drift $s_{t+1} = s_{t} + w_t$
least squares solution to measurement model and drift model equations
what is $\hat s_{k\mid 0}$, $\hat s_{k\mid 1}$, and $\hat s_{k\mid 2}$?

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate $y_t = s_t + v_t$

2. Least squares filtering

allow for drift $s_{t+1} = s_{t} + w_t$
least squares solution to measurement model and drift model equations
what is $\hat s_{k\mid 0}$, $\hat s_{k\mid 1}$, and $\hat s_{k\mid 2}$?

$$ \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\end{bmatrix}$$

$$ \begin{bmatrix} 2 & -1 \\ -1 & 2\end{bmatrix}^{-1} \begin{bmatrix} 1 & 1\\ & -1 & 1 \end{bmatrix} \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} $$

$\hat s_{0\mid 1} = \frac{2y_0+y_1}{3}$

$\hat s_{1\mid 1} = \frac{y_0+2y_1}{3}$

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate $y_t = s_t + v_t$

2. Least squares filtering

allow for drift $s_{t+1} = s_{t} + w_t$
least squares solution to measurement model and drift model equations
what is $\hat s_{k\mid 0}$, $\hat s_{k\mid 1}$, and $\hat s_{k\mid 2}$?

$$ \begin{bmatrix} y_0\\ 0 \\ y_1\\ 0 \\ y_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1\\ & 1 & -1 \\ && 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\\ s_2\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\\w_1\\v_2\end{bmatrix}$$

$\hat s_{0\mid 2} = \frac{5y_0+2y_1+1y_2}{8}$

$\hat s_{1\mid 2} = \frac{2y_0+4y_1+2y_2}{8}$

$\hat s_{2\mid 2} = \frac{y_0+2y_1+5y_2}{8}$

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate $y_t = s_t + v_t$

1. Classic least squares regression

assume fixed $s_t = s$
least squares solution to measurement model equation
predict average $$\hat s_{\mid t} = \frac{1}{t+1}\sum_{t=0}^t y_t$$

2. Least squares filtering

allow for drift $s_{t+1} = s_{t} + w_t$
least squares solution to measurement model and drift model equations
predict average weighted by proximity in time
- $\hat s_{k\mid t}$ more heavily weights $y_k$

Online estimation

$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$

$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}C^\top C + A^\top A & -A^\top \\ -A & C^\top C+A^\top A + I & -A^\top \\ & -A & \ddots & \\ &&& C^\top C+I\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \end{bmatrix} $$

Block tri-diagonal matrix inverse

$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}D_1 & -A^\top \\ -A &D_2 & -A^\top \\ & -A & \ddots & \\ &&&D_3\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \end{bmatrix} $$

Online estimation

$$\textcolor{yellow}{\begin{bmatrix}\hat s_{0\mid t+1}\\ \hat s_{1\mid t+1} \\ \vdots \\ \hat s_{t\mid t+1} \\ \hat s_{t+1\mid t+1} \end{bmatrix}} =\begin{bmatrix}D_1 & -A^\top \\ -A & D_2 & -A^\top \\ & -A & \ddots & \\ &&& D_3+\textcolor{yellow}{A^\top A} & \textcolor{yellow}{-A^\top}\\ &&& \textcolor{yellow}{-A} &\textcolor{yellow}{ C^\top C + I}\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots \\ C^\top y_t \\ \textcolor{yellow}{C^\top y_{t+1}}\end{bmatrix} $$

Block tri-diagonal matrix inverse

Possible to write $\hat s_{t+1\mid t+1}$ as a linear combination of $\hat s_{t\mid t}$ and $y_{t+1}$

Reference: Notes 23 in ECE 6250 taught by Justin Romberg at Georgia Tech

Kalman filter algorithm

Initialize $P_{0\mid0} =(C^\top C)^{\dagger}$ and $\hat s_{0\mid 0} = P_{0\mid 0} C^\top y_0$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =A\hat s_{t-1\mid t-1} $
  - Info matrix $P_{t\mid t-1} = AP_{t-1\mid t-1} A^\top + I$
- Compute gain $$L_{t} = P_{t\mid t-1}C^\top ( CP_{t\mid t-1} C^\top+I)^{-1}$$
- Update
  - State $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-C\hat s_{t\mid t-1})$
  - Info matrix $P_{t\mid t} = (I - L_{t}C)P_{t\mid t-1}$

Reference: Notes 23 in ECE 6250 taught by Justin Romberg at Georgia Tech

$y_t$

$\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)$

$s_{t+1} = As_t + w_t$

$y_t = Cs_t + v_t$

Kalman Filter

Extrapolate
Compute Gain
Update

$\hat s_t$

Kalman filter

$A$

$s$

$w_t$

$v_t$

$C$

$y_t$

$s_{t+1} = As_t + w_t$

$y_t = Cs_t + v_t$

Kalman Filter

$\hat s_t$

$A-L_tCA$

$\hat s$

$L_t$

Kalman filter

$A$

$s$

$w_t$

$v_t$

$C$

$\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)$

Extensions

Time-varying dynamics/measurements $A_t, C_t$

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + I$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+I)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

Extensions

Time-varying dynamics/measurements $A_t, C_t$
Weighted least-squares models characteristics of noise

$\displaystyle \min_{{s,v,w}} ~~~\sum_{k=0}^{t-1} \|$$\Sigma_{w,k}^{-1/2}$$w_k\|_2^2 + \|$$\Sigma_{v,k}^{-1/2}$$v_k\|^2+\|$$\Sigma_{s}^{-1/2}$$s_0\|_2^2 $

$$\text{s.t.}~~~\bar{y}_{0:t} = A_C s_{0:t}+ \bar w_{0:t} + \bar v_{0:t}$$

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + \Sigma_{w,t}$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+$$\Sigma_{v,t}$$)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

Extensions

Time-varying dynamics/measurements $A_t, C_t$
Weighted least-squares models characteristics of noise
KF is the Minimum Variance Linear Unbiased Estimator
- if $s_0, w_k, v_k$ stochastic and independent $\forall k$
- and $\mathbb E[s_0] = \mathbb E[w_k] = \mathbb E[v_k] = 0$
- and covariances $\Sigma_s,\Sigma_{w,k},\Sigma_{v,k}$

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + \Sigma_{w,t}$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+$$\Sigma_{v,t}$$)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

Extensions

Time-varying dynamics/measurements $A_t, C_t$
Weighted least-squares models characteristics of noise
KF is the Minimum Variance Linear Unbiased Estimator for stochastic/independent noise
KF is the Minimum Variance Unbiased Estimator
- if $s_0\sim N(0,\Sigma_s)$, $w_k\sim N(0,\Sigma_{w,k})$, $v_k\sim N(0,\Sigma_{v,k})$
- in other words, KF exactly computes conditional expectation $\hat s_{t\mid t} = \mathbb E[s_t\mid y_{0:t}]$
- furthermore, $P_{t\mid t}$ is the error covariance

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + \Sigma_{w,t}$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+$$\Sigma_{v,t}$$)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

Extensions

Time-varying dynamics/measurements $A_t, C_t$
Weighted least-squares models characteristics of noise
KF is the Minimum Variance Linear Unbiased Estimator for stochastic/independent noise
KF is the Minimum Variance Unbiased Estimator for Gaussian noise

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + \Sigma_{w,t}$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+$$\Sigma_{v,t}$$)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

Next time: when is estimation possible? What if dynamics matrices are unknown?

Recap

Paper recaps & stability
System response and inverse problem
Linear system & least squares
Kalman filter & extensions

Reference: Notes 23 in ECE 6250 taught by Justin Romberg at Georgia Tech

08 - State Estimation - ML in Feedback Sys

By Sarah Dean

08 - State Estimation - ML in Feedback Sys

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

State Estimation

ML in Feedback Sys #8

Announcements

Recap: Paper presentations

\(F\)

Recap: Paper presentations

\(F\)

Recap: Stability via Lyapunov

Example: stochastic gradient descent

\(F\)

Dynamical System

\(F\)

Dynamical System

\(F\)

System Response

\(\Phi_{F,G}\)

System Response

Related Goals:

Challenges:

Example

Setting: linear state estimation

Estimation via least squares

Estimation via least squares

Estimation via least squares

Estimation via least squares

Estimation via least squares

Example: effect of modelling drift

Example: effect of modelling drift

Example: effect of modelling drift

Example: effect of modelling drift

Example: effect of modelling drift

Online estimation

Online estimation

Kalman filter algorithm

Kalman filter

\(A\)

Kalman filter

\(A\)

Extensions

Extensions

Extensions

Extensions

Extensions

Recap

08 - State Estimation - ML in Feedback Sys

More from Sarah Dean