System Identification

ML in Feedback Sys #9

Prof Sarah Dean

Announcements

  • Office hours 10-11am this Thursday
  • Zoom attendance and participation
  • Guest lecture by Max Simchowitz on Monday, Zoom only
  • Reminders
    • Final project details posted, proposal due October 7
    • Working in pairs/groups, self-assessment
    • Paper presentation schedule posted
      • Meet with Atul at least 2 days before you are scheduled to present, schedule meeting a week or more in advance

\(y_t\)

\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)

\(s_{t+1} = As_t + w_t\)

\(y_t = Cs_t + v_t\)

Kalman Filter

 

 

 

\(\hat s_t\)

\(A-L_tCA\)

\(\hat s\)

\(L_t\)

Recap: Kalman filter

\(A\)

\(s\)

\(w_t\)

\(v_t\)

\(C\)

Recap: Kalman filter

  1. Least-squares estimate of state
  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

$$ \min_{\textcolor{yellow}{\theta}} \sum_{k=0}^t (y_t - \textcolor{yellow}{\theta^\top} x_t )^2 $$

Recap: Least squares regression

\(y_0 = \textcolor{yellow}{\theta^\top} x_0 + \textcolor{yellow}{v_0}\)

\(\vdots\)

\(y_t = \textcolor{yellow}{\theta^\top} x_t + \textcolor{yellow}{v_t}\)

$$ \min_{\textcolor{yellow}{\theta, v_k}} \sum_{k=0}^t \textcolor{yellow}{v_k}^2 $$

Recap: Least squares regression

s.t.

\(y_0 = \textcolor{yellow}{\theta_0^\top} x_0 + \textcolor{yellow}{v_0}\)

\(\vdots\)

\(y_t = \textcolor{yellow}{\theta_t^\top} x_t + \textcolor{yellow}{v_t}\)

$$ \min_{\textcolor{yellow}{\theta_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{v_k^2+\|w_k\|_2^2 + \|\theta_0\|_2^2} $$

Recap: Least squares filtering

\(\theta_1 = A_0\textcolor{yellow}{\theta_0}+\textcolor{yellow}{w_0}\)

\(\vdots\)

\(\theta_t = A_{t-1}\textcolor{yellow}{\theta_{t-1}}+\textcolor{yellow}{w_{t-1}}\)

s.t.

\(y_0 = C_0\textcolor{yellow}{s_0} + \textcolor{yellow}{v_0}\)

\(\vdots\)

\(y_t = C_t\textcolor{yellow}{s_t} + \textcolor{yellow}{v_t}\)

Recap: Least squares filtering

\(\theta_1 = A_0\textcolor{yellow}{s_0}+\textcolor{yellow}{w_0}\)

\(\vdots\)

\(\theta_t = A_{t-1}\textcolor{yellow}{s_{t-1}}+\textcolor{yellow}{w_{t-1}}\)

s.t.

$$ \min_{\textcolor{yellow}{s_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{\|v_k\|^2+\|w_k\|_2^2 + \|s_0\|_2^2} $$

Recap: Least squares filtering

$$ \min_{\textcolor{yellow}{s_k}} \sum_{k=0}^t \|C_k\textcolor{yellow}{s_k}-y_k\|_2^2+ \|A_k\textcolor{yellow}{s_k-s_{k+1}}\|_2^2+ \|\textcolor{yellow}{s_0}\|_2^2$$

Recap: Kalman filter

  1. Least-squares estimate of state
    • Time-varying dynamics/measurements
    • Noise covariance
  2. KF is the MVLUE for stochastic and independent noise
  3. KF is the MVUE for Gaussian noise
  • Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
  • For \(t= 1, 2...\):
    • Extrapolate
      • State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
      • Info matrix
        \(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
    • Compute gain
      \(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\)
    • Update
      • State
        \(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\)
      • Info matrix
        \(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)

When is state estimation possible?

Example

The state \(s=[\theta,\omega]\), output \(y=\theta\), and

$$\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t$$

Related Goals

  • Predict future outputs \(y_{0:t} \to \hat y_{t+1}\)
  • Estimate underlying state \(y_{0:t} \to \hat s_t \)

Challenges

  1. Unknown initial state \(s_0\)
  2. Unpredictable process due to \(w_t\)
  3. Noisy measurements due to \(v_t\)

A deterministic system is observable if outputs \(y_{0:t}\) uniquely determine the state trajectory \(s_{0:t}\).

Observability

\(s_{t+1} = F(s_t)\)

\(y_t = G(s_t)\)

\((y_0, y_1,...) = \Phi_{F,G}[s_0]\)

A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$

Observability

\(s_{t+1} = As_t\)

\(y_t = Cs_t\)

\(\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\)

A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$

\(\begin{bmatrix} Cs_0\\CAs_0\\\vdots\\CA^ts_0\end{bmatrix}\)

1) \(\mathrm{rank}(\mathcal O) = n \implies\) observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0 $$

2) \(\mathrm{rank}(\mathcal O) = n \impliedby\) observable

  • There is a unique solution \(s_0\) when \(\mathcal O\) is rank \(n\).
  • Thus, \(s_{0:t}\) is uniquely determined

Proof:

1) \(\mathrm{rank}(\mathcal O) = n \implies\) observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0 $$

2) \(\mathrm{rank}(\mathcal O) < n \implies\) not observable

  • for \(t\geq n\), \(\mathcal O_t\) has more rows than \(\mathcal O\)
  • Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree \(n\)).
  • Therefore, \(A^k\) for \(k\geq n\) is a linear combo of \(I, A,\dots ,A^{n-1}\)
  • Thus, \(\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)<n\) so \(s_0\) is not uniquely determined
  • There is a unique solution \(s_0\) when \(\mathcal O\) is rank \(n\).
  • Thus, \(s_{0:t}\) is uniquely determined

\(\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\underbrace{\begin{bmatrix} C\\CA\\\vdots\\CA^t\end{bmatrix}}_{\mathcal O_t} s_0\)

Proof:

Steady-state Kalman Filter

If \((A,C)\) is observable, then as \(t\to\infty\) the Kalman gain converges to \(P,L\) satisfying

\(P = A\left(P-PC^\top (CPC^\top +\Sigma_v)^{-1} CP\right)A^\top + \Sigma_w\)

\(L = PC^\top (\Sigma_v+CPC^\top)^{-1}\)

\(y_t\)

Kalman Filter

 

 

 

\(A-LCA\)

\(\hat s\)

\(L\)

\(\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})\)

Discrete Algebraic Riccati Equation

Steady-state Kalman Filter

If \((A,C)\) is observable, then as \(t\to\infty\) the Kalman gain converges to \(P,L\) satisfying

\(P, L = \mathrm{DARE}(A^\top, C^\top ,\Sigma_w, \Sigma_v)\)

\(y_t\)

Kalman Filter

 

 

 

\(A-LCA\)

\(\hat s\)

\(L\)

\(\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})\)

Steady-state Kalman Filter

\(y_t\)

Kalman Filter

 

 

 

\(A-LCA\)

\(\hat s\)

\(L\)

\(\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t\)

\(y_{t} =CA\hat s_{t-1}+\varepsilon_t\)

\(A\)

\(s\)

\(w_t\)

\(v_t\)

\(C\)

Interconnection of the world (linear dynamics) with a filter (linear dynamics)

Exercise: show that for some \(C_\varepsilon, D_\varepsilon, A_e, D_e\),
the innovations are given by \(\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t\) where

\(e_{t+1} =A_ee_{t} -w_{t}+D_e v_t\) and \(e_0 = \hat s_0- s_0\)

Steady-state Kalman Filter

\(\varepsilon_t\)

Kalman Filter

 

 

 

\(A\)

\(\hat s\)

\(L\)

\(\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t\)

\(y_{t} =CA\hat s_{t-1}+\varepsilon_t\)

\(A_e\)

\(e\)

\(w_t\)

\(v_t\)

\(C_\varepsilon\)

Exercise: show that for some \(C_\varepsilon, D_\varepsilon, A_e, D_e\),
the innovations are given by \(\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t\) where

\(e_{t+1} =A_ee_{t} -w_{t}+D_e v_t\) and \(e_0 = \hat s_0- s_0\)

Interconnection of the world (linear dynamics) with a filter (linear dynamics)

\(CA\)

What if we don't know exactly how the system evolves or how measurements are generated?

System identification

Related Goals

  • Predict future outputs \(y_{0:t} \to \hat y_{t+1}\)
  • Estimate underlying state \(y_{0:t} \to \hat s_t \)
  • Identify dynamics and measurement model $$y_{0:t} \to (\hat A, \hat C)$$

\(y_t\)

\(?\)

\(s\)

\(w_t\)

\(v_t\)

\(?\)

What if we don't know exactly how the system evolves?

Sys id with state observation

Related Goals

  • Predict future states \(s_{0:t} \to \hat s_{t+1}\)
  • Identify dynamics model $$s_{0:t} \to \hat A$$

\(s_t\)

\(?\)

\(s\)

\(w_t\)

What if we don't know exactly how the system evolves?

Sys id with state observation

  • \(s_1 = \textcolor{yellow}{F}(s_0) + \textcolor{yellow}{w_0}\)
  • \(s_2 = \textcolor{yellow}{F}(s_1) + \textcolor{yellow}{w_1}\)
  • ...
  • \(s_t = \textcolor{yellow}{F}(s_{t-1}) + \textcolor{yellow}{w_{t-1}}\)

What if we don't know exactly how the system evolves?

Sys id with state observation

$$\hat F = \min_{\textcolor{yellow}{F}\in\mathcal F} ~~\sum_{k=0}^{t-1}\|s_{k+1} - \textcolor{yellow}{F}(s_k)\|^2 $$

Use supervised learning with labels \(s_{k+1}\) and features \(s_k\)!

Linear sys id with state observation

$$\hat A = \min_{A} ~~\sum_{k=0}^{t-1}\|s_{k+1} - As_k\|^2 $$

$$= \min_{A} \Big\|\underbrace{\begin{bmatrix} s_1^\top \\ \vdots \\ s_{t}^\top\end{bmatrix}}_{S_+} - \underbrace{\begin{bmatrix} s_0^\top \\ \vdots \\ s_{t-1}^\top\end{bmatrix}}_{S_-}A^\top\Big\|_F^2 $$

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$

Linear sys id with state observation

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$

Example

Suppose there is no noise and we observe \(s_0 = [0, 1]\), \(s_1 = [\frac{1}{2}, 0]\), and \(s_2 = [0, \frac{1}{4}]\). What is \(s_3\)? \(A\)?

What if \(s_0 = [0,1]\), \(s_1 = [0, \frac{1}{2}]\), and \(s_2 = [0, \frac{1}{4}]\)?

Linear sys id with state observation

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$

  • Unlike standard linear regression, the "features" \((s_0,\dots,s_{t-1})\) are not fixed or drawn independently
    • dependence \(s_{k+1} = As_k+w_k\)
  • Exercise: excess risk of fixed design (Hint: see Lecture 2)
    • Simple query model: suppose \(S_-\) is populated with fixed set of queries \((s_0,\dots s_{t-1})\) and \(S_+\) populated with single step of model \((s_{0,+},\dots,s_{t-1,+})\) where  $$s_{i,+} = As_{i} + w_i$$
    • Show that if \(w_i\) is i.i.d., zero mean, and \(\mathbb E[w_iw_i^\top]=\sigma^2I\), the excess risk of the fixed design query model is \(\frac{\sigma^2n^2}{t}\)

Even if \((A,C)\) is observable, non-unique solution!

Noiseless linear sys id

What if we don't know exactly how the system evolves or how measurements are generated?

  • \(y_0 = \textcolor{yellow}{Cs_0}\)
  • \(s_1 = \textcolor{yellow}{As_0}\)
  • ...
  • \(s_t = \textcolor{yellow}{A}s_{t-1}\)
  • \(y_t = \textcolor{yellow}{C}s_{t} \)

Example

  • \(s_{t+1} = \frac{1}{2}s_t\) from \(s_0=4\) and \(y_t = s_t\)
  • could just as easily be \(s_{t+1} = \frac{1}{2}s_t\) from \(s_0=2\) and \(y_t = 2s_t\)

Even if \((A,C)\) is observable, non-unique solution!

Noiseless linear sys id

  • \(y_0 =\hat C\hat s_0\)
  • \(\hat s_1 = \hat A\hat s_0\)
  • ...
  • \(\hat s_t = \hat A\hat s_{t-1}\)
  • \(y_t = \hat C\hat s_{t} \)

Suppose \(\hat A, \hat C, \hat s_0\) satisfies the equations

  • \(y_0 =\hat C\textcolor{cyan}{MM^{-1}}\hat s_0\)
  • \(\textcolor{cyan}{M^{-1}}\hat s_1 = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_0\)
  • ...
  • \(\textcolor{cyan}{M^{-1}}\hat s_t = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_{t-1}\)
  • \(y_t = \hat C\textcolor{cyan}{MM^{-1}}\hat s_{t} \)

Then so does \(\tilde A, \tilde C, \tilde s_0\)

The state \(s_t=[\theta_t,\omega_t]\), output \(y_t=\theta_t\), and

$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

Example

Then equivalent by having \(y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t\)

$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t$$

What if we define \(\tilde s_0 = [\alpha\theta_0, \beta \omega_0]\) ?

The state \(s_t=[\theta_t,\omega_t]\), output \(y_t=\theta_t\), process noise \(w_t\), and

$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t + w_t$$

Example

Then equivalent by having \(y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t\)

$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t+?$$

What if we define \(\tilde s_0 = [\alpha\theta_0, \beta \omega_0]\) ?

Equivalent representations

Set of equivalent state space representations for all invertible and square \(M\)

\(s_{t+1} = As_t + Bw_t\)

\(y_t = Cs_t+v_t\)

\(\tilde s_{t+1} = \tilde A\tilde s_t + \tilde B w_t\)

\(y_t = \tilde C\tilde s_t+v_t\)

\(\tilde s = M^{-1}s\)

\(\tilde A = M^{-1}AM\)

\(\tilde B = M^{-1}B\)

\(\tilde C = CM\)

\( s = M\tilde s\)

\( A = M\tilde AM^{-1}\)

\( B = M\tilde B\)

\(C = \tilde CM^{-1}\)

\(y_t\)

\(A\)

\(s\)

\(w_t\)

\(v_t\)

\(C\)

\(B\)

\(y_t\)

\(\tilde A\)

\(s\)

\(w_t\)

\(v_t\)

\(\tilde C\)

\(\tilde B\)

Goal: identify dynamics \(A\) and measurement model \(C\) up to similarity transform

Linear system identfication

$$\mathcal O_t = \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix} $$

Recall the observability matrix

$$\tilde \mathcal O_t = \begin{bmatrix}\tilde C\\\tilde C\tilde A\\\vdots \\ \tilde C\tilde A^{t}\end{bmatrix} $$

$$=\begin{bmatrix}CM\\CMM^{-1} AM\\\vdots \\  CMM^{-1} A^{t}M\end{bmatrix} $$

$$=\mathcal O_t M $$

Approach: subspace identification of the rank \(n\) column space of \(\mathcal O_t\)

Step 1: Estimate \(\mathcal O_t\)

Linear system identfication

$$\begin{bmatrix}y_0\\\vdots \\ y_t\end{bmatrix}  =\mathcal O_t s_0 $$

$$\begin{bmatrix}y_1\\\vdots \\ y_{t+1}\end{bmatrix}  =\mathcal O_t s_{1} $$

\(\dots\)

$$\begin{bmatrix}y_k\\\vdots \\ y_{t+k}\end{bmatrix}  =\mathcal O_t s_{k} $$

Step 1: Estimate \(\mathcal O_t\)

Linear system identfication

$$\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix}  =\mathcal O_t \begin{bmatrix}s_0 & s_1 &\dots & s_k\end{bmatrix}$$

As long as \(s_{0:k}\) form a rank \(n\) matrix, the column space of \(\mathcal O_t\) is the same as the column space of the matrix of outputs (Hankel matrix)

\(\hat{\mathcal O}_t = U_n\)

where the SVD is given by \(Y_{t,k} = \begin{bmatrix}U_n & U_{-1}\end{bmatrix} \begin{bmatrix}\Sigma_n \\& 0\end{bmatrix}V^\top \)

Step 1: Estimate \(\mathcal O_t\)

Linear system identfication

Step 2: Recover \(\hat A\) and \(\hat C\) from \(\hat{\mathcal O}\)

$$\hat{\mathcal O}_t = \begin{bmatrix}\hat{\mathcal O}_t [0]\\\hat{\mathcal O}_t [1]\\\vdots \\ \hat{\mathcal O}_t [t]\end{bmatrix}\approx \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix} $$

So we set \(\hat C = \hat {\mathcal O}_t[0]\) and \(\hat A\) as solving

  • \(\hat {\mathcal O}_t[0] A = \hat {\mathcal O}_t[1] \)
  • ...
  • \(\hat {\mathcal O}_t[t-1] A = \hat {\mathcal O}_t[t] \)

\(\hat{\mathcal O}_t = U_n\) from SVD of Hankel matrix \(\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix}\)

Next time: cutting edge work on finite sample guarantees for system identification!

Recap

  • Kalman filter recap
  • Observability
  • Steady-state linear filter
  • System identification
  • Equivalent realizations

Reference: Callier & Desoer, "Linear Systems Theory" and Verhaegen & Verdult "Filtering and System Identification"