State Estimation

ML in Feedback Sys #8

Prof Sarah Dean

Announcements

  • Final project details posted
    • proposal due October 7
  • Office hours 10-11am this Thursday
  • Reminders
    • Working in pairs/groups, self-assessment
    • Paper presentation schedule posted
      • Required: meet with Atul at least 2 days before you are scheduled to present
        • schedule meeting with Atul a week or more in advance!

Recap: Paper presentations

  • HSNL18 Fairness without demographics in repeated loss minimization
    • Motivation: fairness via equitable allocation of accuracy across groups
    • Population update: group retention \(\lambda_t^{(k)}\) based on risk
    • Learner update: (robust) risk minimization \(\theta_t\) based on mixture \(\sum_k \frac{\lambda^{(k)}_t}{\sum_\ell \lambda^{(\ell)}_t}\mathcal D_k\)
    • Linearization argument shows ERM is unstable; induction argument shows that DRO controls worst-case risk

\(F\)

\(\lambda_t\)

Recap: Paper presentations

  • PZMDH20 Performative prediction
    • Motivation: general model of reactive distribution shift
    • Population update: smooth dependence \(\mathcal D(\theta)\)
    • Learner update: risk minimization \(\theta_{t+1}\) based on \(\mathcal D(\theta_t)\)
    • Induction argument shows convergence to \(\theta_{PS}\), proximity to \(\theta_{PO}\)

\(F\)

\(\theta_t\)

  • Verify stability by finding a Lyapunov function (positive definite and decreasing)
  • Stable linear dynamics have quadratic Lyapunov functions
    • For example, \(V(s)= s^\top \sum_{t=0}^\infty (A^\top)^t A^t s\)
  • For nonlinear dynamics, stability of Jacobian determines (local) stability
  • Stochastic dynamics: sets around stable equilibria

Recap: Stability via Lyapunov

Reference: Bof, Carli, Schenato, "Lyapunov Theory for Discrete Time Systems"

Consider the dynamics of stochastic gradient descent on a twice differentiable function \(g:\mathbb R^d\to\mathbb R^d\)

\(\theta_{t+1} = \theta_t - \alpha g_t\)

Example: stochastic gradient descent

\(= \theta_t -  \alpha \nabla g(\theta_t) + \underbrace{\alpha (\nabla g(\theta_t) - g_t) }_{w_t} \)

  • may converge to ball around minima
  • in general, could jump between minima
  • in practice, decreasing step size \(\alpha_t\propto 1/t\)

\(F\)

Dynamical System

\(s\)

\(s_{t+1} = F(s_t, w_t)\)

\(y_t = G(s_t)\)

\(w_t\)

\(y_t\)

input signal: external phenomena

output signal:  measurements

\(F\)

Dynamical System

\(s\)

\(s_{t+1} = F(s_t, w_t)\)

\(y_t = G(s_t)+v_t\)

\(w_t\)

\(y_t\)

input signal: external phenomena

output signal: noisy measurements

\(v_t\)

\(F\)

System Response

\(s\)

\((y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)\)

\(w_t\)

\(y_t\)

\(\Phi_{F,G}\)

\(v_t\)

System Response

\((y_0, y_1,...) = \Phi_{F,G}[(s_0, w_0, w_1,...)]+(v_0,v_1,...)\)

If \(F,G\) known (e.g. physics), then can compute \(y_{t+1}\) given

  • \(s_0\), \( w_{0:t}\), and \( v_{0:t+1}\)
  • (or given \(s_t\) and \(v_t\))

But only observe \(y_{0:t}\)!

Related Goals:

  • Predict future outputs $$ y_{0:t} \to \hat y_{t+1}$$
  • Estimate underlying state $$ y_{0:t} \to \hat s_t $$

Challenges:

  1. Unknown initial state \(s_0\)
  2. Unpredictable process due to \(w_t\)
  3. Noisy measurements due to \(v_t\)

Example

The state \(s=[\theta,\omega]\), output \(y=\theta\), and

$$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

  • What can we predict about \(y_t\) and \(s_t\) as \(t\to\infty\)?
  • Suppose \(y_0 = 1\). Can we predict \(y_1\) or estimate \(s_0\)?
  • What if we also know that \(y_1 = 1\)?

Linear dynamics and measurement

\(s_{t+1} = As_t + w_t\)

\(y_t = Cs_t + v_t\)

Setting: linear state estimation

  • State space \(\mathbb R^n\) and output space \(\mathbb R^p\)
  • \(A\in\mathbb R^{n\times n}\) and \(C\in\mathbb R^{p\times n}\)

At time \(t\), we have observed \(y_{0:t}\), which we can use to estimate $$\hat s_{k\mid t}$$

our guess of \(s_k\) given observations up to time \(t\)

At time \(t\), our observations and models tell us that

Estimation via least squares

  • \(y_0=C \)\(s_0\) \( + \) \(v_0\)
  • \(s_1 \) \(= A\)\(s_0\) \( + \) \(w_0\)
  • \(y_1=C \)\(s_1\) \( + \) \(v_1\)
  • \(s_t \) \(= A\)\(s_{t-1}\) \( + \) \(w_{t-1}\)
  • \(y_t=C \)\(s_t\) \( + \) \(v_t\)

\(\vdots\)

At time \(t\), our observations and models tell us that

Estimation via least squares

  • \(y_0=C \)\(s_0\) \( + \) \(v_0\)
  • \(0= A\)\(s_0-s_1\) \( + \) \(w_0\)
  • \(y_1=C \)\(s_1\) \( + \) \(v_1\)
  • \(0= A\)\(s_{t-1}-s_t\) \( + \) \(w_{t-1}\)
  • \(y_t=C \)\(s_t\) \( + \) \(v_t\)

\(\vdots\)

At time \(t\), our observations and models tell us that

Estimation via least squares

$$\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \begin{bmatrix} C\\ A-I \\ &C\\ & A-I \\ &&\ddots \\ &&&C \end{bmatrix} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$

The least squares estimator minimizes the squared error

Estimation via least squares

$$\text{s.t.}~~~\begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix} = \underbrace{\begin{bmatrix} C\\ A&-I \\ &C \\ & A&-I \\ &&\ddots \\ &&&C\end{bmatrix}}_{A_C} \textcolor{yellow}{\begin{bmatrix}s_0\\ s_1 \\ \vdots \\ s_t \end{bmatrix}} + \textcolor{yellow}{\begin{bmatrix} v_0\\ w_0\\ v_1 \\ \vdots \\ w_{t-1} \\ v_t\end{bmatrix}}$$

$$\min_{\textcolor{yellow}{s,v,w}} ~~~\sum_{k=0}^{t-1} \textcolor{yellow}{\|w_k\|_2^2 + \|v_k\|^2 + \|v_t\|^2+\|s_0\|_2^2} $$

The least squares estimator minimizes the squared error

Estimation via least squares

$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$

1. Classic least squares regression

  • assume fixed \(s_t = s\)
  • least squares solution to measurement model equations $$\begin{bmatrix} y_0\\\vdots\\y_t\end{bmatrix} =  \begin{bmatrix} s\\\vdots\\s\end{bmatrix} + \begin{bmatrix} v_0\\\vdots\\v_t\end{bmatrix}$$
  • what is \(\hat s_{\mid 0}\), \(\hat s_{\mid 1}\), and \(\hat s_{\mid 2}\)?
  • predict average $$\hat s_{\mid t} = \frac{1}{t+1}\sum_{t=0}^t y_t$$

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)

2. Least squares filtering

  • allow for drift \(s_{t+1} = s_{t} + w_t\)
  • least squares solution to measurement model and drift model equations
  • what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)

$$ y_0 = s_0 + v_0$$

\(\hat s_{0\mid 0} = y_0\)

2. Least squares filtering

  • allow for drift \(s_{t+1} = s_{t} + w_t\)
  • least squares solution to measurement model and drift model equations
  • what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)

2. Least squares filtering

  • allow for drift \(s_{t+1} = s_{t} + w_t\)
  • least squares solution to measurement model and drift model equations
  • what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?

$$ \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\end{bmatrix}$$

$$ \begin{bmatrix} 2 & -1 \\  -1 & 2\end{bmatrix}^{-1} \begin{bmatrix} 1 & 1\\ & -1 & 1 \end{bmatrix} \begin{bmatrix} y_0\\ 0 \\ y_1\end{bmatrix}  $$

\(\hat s_{0\mid 1} = \frac{2y_0+y_1}{3}\)

\(\hat s_{1\mid 1} = \frac{y_0+2y_1}{3}\)

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)

2. Least squares filtering

  • allow for drift \(s_{t+1} = s_{t} + w_t\)
  • least squares solution to measurement model and drift model equations
  • what is \(\hat s_{k\mid 0}\), \(\hat s_{k\mid 1}\), and \(\hat s_{k\mid 2}\)?

$$ \begin{bmatrix} y_0\\ 0 \\ y_1\\ 0 \\ y_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 & -1 \\ & 1\\ & 1 & -1 \\ && 1 \end{bmatrix} \begin{bmatrix}s_0\\s_1\\ s_2\end{bmatrix} + \begin{bmatrix} v_0 \\ w_0 \\ v_1\\w_1\\v_2\end{bmatrix}$$

\(\hat s_{0\mid 2} = \frac{5y_0+2y_1+1y_2}{8}\)

\(\hat s_{1\mid 2} = \frac{2y_0+4y_1+2y_2}{8}\)

\(\hat s_{2\mid 2} = \frac{y_0+2y_1+5y_2}{8}\)

Example: effect of modelling drift

Suppose we take noisy measurements of heart rate \(y_t = s_t + v_t\)

1. Classic least squares regression

  • assume fixed \(s_t = s\)
  • least squares solution to measurement model equation
  • predict average $$\hat s_{\mid t} = \frac{1}{t+1}\sum_{t=0}^t y_t$$

2. Least squares filtering

  • allow for drift \(s_{t+1} = s_{t} + w_t\)
  • least squares solution to measurement model and drift model equations
  • predict average weighted by proximity in time
    • \(\hat s_{k\mid t}\) more heavily weights \(y_k\)

Online estimation

$$ \begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} = (A_C^\top A_C)^{-1} A_C^\top \begin{bmatrix} y_0 \\ 0 \\ y_1 \\ \vdots \\ 0 \\ y_t \end{bmatrix}$$

$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}C^\top C + A^\top A & -A^\top \\ -A & C^\top C+A^\top A + I & -A^\top \\ & -A & \ddots &  \\ &&& C^\top C+I\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots  \\ C^\top y_t \end{bmatrix} $$

Block tri-diagonal matrix inverse

$$\begin{bmatrix}\hat s_{0\mid t}\\ \hat s_{1\mid t} \\ \vdots \\ \hat s_{t\mid t} \end{bmatrix} =\begin{bmatrix}D_1 & -A^\top \\ -A &D_2 & -A^\top \\ & -A & \ddots &  \\ &&&D_3\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots  \\ C^\top y_t \end{bmatrix} $$

Online estimation

$$\textcolor{yellow}{\begin{bmatrix}\hat s_{0\mid t+1}\\ \hat s_{1\mid t+1} \\ \vdots \\ \hat s_{t\mid t+1} \\ \hat s_{t+1\mid t+1} \end{bmatrix}} =\begin{bmatrix}D_1 & -A^\top \\ -A & D_2 & -A^\top \\ & -A & \ddots &  \\ &&& D_3+\textcolor{yellow}{A^\top A} & \textcolor{yellow}{-A^\top}\\ &&& \textcolor{yellow}{-A} &\textcolor{yellow}{ C^\top C + I}\end{bmatrix}^{-1}\begin{bmatrix}C^\top y_0 \\ \vdots  \\ C^\top y_t \\ \textcolor{yellow}{C^\top y_{t+1}}\end{bmatrix}  $$

Block tri-diagonal matrix inverse

Possible to write \(\hat s_{t+1\mid t+1}\) as a linear combination of \(\hat s_{t\mid t}\) and \(y_{t+1}\)

Reference: Notes 23 in ECE 6250 taught by Justin Romberg at Georgia Tech

Kalman filter algorithm

  • Initialize \(P_{0\mid0} =(C^\top C)^{\dagger}\) and \(\hat s_{0\mid 0} = P_{0\mid 0} C^\top y_0\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =A\hat s_{t-1\mid t-1} \)
      • Info matrix \(P_{t\mid t-1} = AP_{t-1\mid t-1} A^\top + I\)
    • Compute gain $$L_{t} = P_{t\mid t-1}C^\top ( CP_{t\mid t-1} C^\top+I)^{-1}$$
    • Update
      • State \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-C\hat s_{t\mid t-1})\)
      • Info matrix \(P_{t\mid t} = (I - L_{t}C)P_{t\mid t-1}\)

Reference: Notes 23 in ECE 6250 taught by Justin Romberg at Georgia Tech

\(y_t\)

\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)

\(s_{t+1} = As_t + w_t\)

\(y_t = Cs_t + v_t\)

Kalman Filter

  • Extrapolate
  • Compute Gain
  • Update

\(\hat s_t\)

Kalman filter

\(A\)

\(s\)

\(w_t\)

\(v_t\)

\(C\)

\(y_t\)

\(s_{t+1} = As_t + w_t\)

\(y_t = Cs_t + v_t\)

Kalman Filter

 

 

 

\(\hat s_t\)

\(A-L_tCA\)

\(\hat s\)

\(L_t\)

Kalman filter

\(A\)

\(s\)

\(w_t\)

\(v_t\)

\(C\)

\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)

Extensions

  1. Time-varying dynamics/measurements \(A_t, C_t\)
  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + I\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+I)^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

Extensions

  1. Time-varying dynamics/measurements \(A_t, C_t\)
  2. Weighted least-squares models characteristics of noise

\(\displaystyle \min_{{s,v,w}} ~~~\sum_{k=0}^{t-1} \|\)\(\Sigma_{w,k}^{-1/2}\)\(w_k\|_2^2 + \|\)\(\Sigma_{v,k}^{-1/2}\)\(v_k\|^2+\|\)\(\Sigma_{s}^{-1/2}\)\(s_0\|_2^2 \)

$$\text{s.t.}~~~\bar{y}_{0:t} = A_C s_{0:t}+ \bar w_{0:t} + \bar v_{0:t}$$

  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

Extensions

  1. Time-varying dynamics/measurements \(A_t, C_t\)
  2. Weighted least-squares models characteristics of noise
  3. KF is the Minimum Variance Linear Unbiased Estimator
    • if \(s_0, w_k, v_k\) stochastic and independent \(\forall k\)
    • and \(\mathbb E[s_0] = \mathbb E[w_k] = \mathbb E[v_k] = 0\)
    • and covariances \(\Sigma_s,\Sigma_{w,k},\Sigma_{v,k}\)
  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

Extensions

  1. Time-varying dynamics/measurements \(A_t, C_t\)
  2. Weighted least-squares models characteristics of noise
  3. KF is the Minimum Variance Linear Unbiased Estimator for stochastic/independent noise
  4. KF is the Minimum Variance Unbiased Estimator
    • if \(s_0\sim N(0,\Sigma_s)\), \(w_k\sim N(0,\Sigma_{w,k})\), \(v_k\sim N(0,\Sigma_{v,k})\)
    • in other words, KF exactly computes conditional expectation \(\hat s_{t\mid t} = \mathbb E[s_t\mid y_{0:t}]\)
    • furthermore, \(P_{t\mid t}\) is the error covariance
  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

Extensions

  1. Time-varying dynamics/measurements \(A_t, C_t\)
  2. Weighted least-squares models characteristics of noise
  3. KF is the Minimum Variance Linear Unbiased Estimator for stochastic/independent noise
  4. KF is the Minimum Variance Unbiased Estimator for Gaussian noise
  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

Next time: when is estimation possible? What if dynamics matrices are unknown?

Recap

  • Paper recaps & stability
  • System response and inverse problem
  • Linear system & least squares
  • Kalman filter & extensions

Reference: Notes 23 in ECE 6250 taught by Justin Romberg at Georgia Tech