System Identification

ML in Feedback Sys #9

Prof Sarah Dean

Announcements

Office hours 10-11am this Thursday
Zoom attendance and participation
Guest lecture by Max Simchowitz on Monday, Zoom only
Reminders
- Final project details posted, proposal due October 7
- Working in pairs/groups, self-assessment
- Paper presentation schedule posted
  - Meet with Atul at least 2 days before you are scheduled to present, schedule meeting a week or more in advance

$y_t$

$\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)$

$s_{t+1} = As_t + w_t$

$y_t = Cs_t + v_t$

Kalman Filter

$\hat s_t$

$A-L_tCA$

$\hat s$

$L_t$

Recap: Kalman filter

$A$

$s$

$w_t$

$v_t$

$C$

Recap: Kalman filter

Least-squares estimate of state

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + \Sigma_{w,t}$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+$$\Sigma_{v,t}$$)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

$$ \min_{\textcolor{yellow}{\theta}} \sum_{k=0}^t (y_t - \textcolor{yellow}{\theta^\top} x_t )^2 $$

Recap: Least squares regression

$y_0 = \textcolor{yellow}{\theta^\top} x_0 + \textcolor{yellow}{v_0}$

$\vdots$

$y_t = \textcolor{yellow}{\theta^\top} x_t + \textcolor{yellow}{v_t}$

$$ \min_{\textcolor{yellow}{\theta, v_k}} \sum_{k=0}^t \textcolor{yellow}{v_k}^2 $$

Recap: Least squares regression

s.t.

$y_0 = \textcolor{yellow}{\theta_0^\top} x_0 + \textcolor{yellow}{v_0}$

$\vdots$

$y_t = \textcolor{yellow}{\theta_t^\top} x_t + \textcolor{yellow}{v_t}$

$$ \min_{\textcolor{yellow}{\theta_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{v_k^2+\|w_k\|_2^2 + \|\theta_0\|_2^2} $$

Recap: Least squares filtering

$\theta_1 = A_0\textcolor{yellow}{\theta_0}+\textcolor{yellow}{w_0}$

$\vdots$

$\theta_t = A_{t-1}\textcolor{yellow}{\theta_{t-1}}+\textcolor{yellow}{w_{t-1}}$

s.t.

$y_0 = C_0\textcolor{yellow}{s_0} + \textcolor{yellow}{v_0}$

$\vdots$

$y_t = C_t\textcolor{yellow}{s_t} + \textcolor{yellow}{v_t}$

Recap: Least squares filtering

$\theta_1 = A_0\textcolor{yellow}{s_0}+\textcolor{yellow}{w_0}$

$\vdots$

$\theta_t = A_{t-1}\textcolor{yellow}{s_{t-1}}+\textcolor{yellow}{w_{t-1}}$

s.t.

$$ \min_{\textcolor{yellow}{s_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{\|v_k\|^2+\|w_k\|_2^2 + \|s_0\|_2^2} $$

Recap: Least squares filtering

$$ \min_{\textcolor{yellow}{s_k}} \sum_{k=0}^t \|C_k\textcolor{yellow}{s_k}-y_k\|_2^2+ \|A_k\textcolor{yellow}{s_k-s_{k+1}}\|_2^2+ \|\textcolor{yellow}{s_0}\|_2^2$$

Recap: Kalman filter

Least-squares estimate of state
- Time-varying dynamics/measurements
- Noise covariance
KF is the MVLUE for stochastic and independent noise
KF is the MVUE for Gaussian noise

Initialize $P_{0\mid0} $ and $\hat s_{0\mid 0}$
For $t= 1, 2...$:
- Extrapolate
  - State $\hat s_{t\mid t-1} =$$A_t$$\hat s_{t-1\mid t-1} $
  - Info matrix
    $P_{t\mid t-1} = $$A_t$$P_{t-1\mid t-1}$$ A_t^\top$$ + \Sigma_{w,t}$
- Compute gain
  $L_{t} = P_{t\mid t-1}$$C_t^\top$$ ( $$C_t$$P_{t\mid t-1} $$C_t^\top$$+$$\Sigma_{v,t}$$)^{-1}$
- Update
  - State
    $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$C_t$$\hat s_{t\mid t-1})$
  - Info matrix
    $P_{t\mid t} = (I - L_{t}$$C_t$$)P_{t\mid t-1}$

When is state estimation possible?

Example

The state $s=[\theta,\omega]$, output $y=\theta$, and

$$\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t$$

Related Goals

Predict future outputs $y_{0:t} \to \hat y_{t+1}$
Estimate underlying state $y_{0:t} \to \hat s_t $

Challenges

Unknown initial state $s_0$
~~Unpredictable process due to $w_t$~~
~~Noisy measurements due to $v_t$~~

A deterministic system is observable if outputs $y_{0:t}$ uniquely determine the state trajectory $s_{0:t}$.

Observability

$s_{t+1} = F(s_t)$

$y_t = G(s_t)$

$(y_0, y_1,...) = \Phi_{F,G}[s_0]$

A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$

Observability

$s_{t+1} = As_t$

$y_t = Cs_t$

$\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=$

A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$

$\begin{bmatrix} Cs_0\\CAs_0\\\vdots\\CA^ts_0\end{bmatrix}$

1) $\mathrm{rank}(\mathcal O) = n \implies$ observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0 $$

2) $\mathrm{rank}(\mathcal O) = n \impliedby$ observable

There is a unique solution $s_0$ when $\mathcal O$ is rank $n$.
Thus, $s_{0:t}$ is uniquely determined

Proof:

1) $\mathrm{rank}(\mathcal O) = n \implies$ observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0 $$

2) $\mathrm{rank}(\mathcal O) < n \implies$ not observable

for $t\geq n$, $\mathcal O_t$ has more rows than $\mathcal O$
Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree $n$).
Therefore, $A^k$ for $k\geq n$ is a linear combo of $I, A,\dots ,A^{n-1}$
Thus, $\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)<n$ so $s_0$ is not uniquely determined

There is a unique solution $s_0$ when $\mathcal O$ is rank $n$.
Thus, $s_{0:t}$ is uniquely determined

$\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\underbrace{\begin{bmatrix} C\\CA\\\vdots\\CA^t\end{bmatrix}}_{\mathcal O_t} s_0$

Proof:

Steady-state Kalman Filter

If $(A,C)$ is observable, then as $t\to\infty$ the Kalman gain converges to $P,L$ satisfying

$P = A\left(P-PC^\top (CPC^\top +\Sigma_v)^{-1} CP\right)A^\top + \Sigma_w$

$L = PC^\top (\Sigma_v+CPC^\top)^{-1}$

$y_t$

Kalman Filter

$A-LCA$

$\hat s$

$L$

$\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})$

Discrete Algebraic Riccati Equation

Steady-state Kalman Filter

If $(A,C)$ is observable, then as $t\to\infty$ the Kalman gain converges to $P,L$ satisfying

$P, L = \mathrm{DARE}(A^\top, C^\top ,\Sigma_w, \Sigma_v)$

$y_t$

Kalman Filter

$A-LCA$

$\hat s$

$L$

$\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})$

Steady-state Kalman Filter

$y_t$

Kalman Filter

$A-LCA$

$\hat s$

$L$

$\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t$

$y_{t} =CA\hat s_{t-1}+\varepsilon_t$

$A$

$s$

$w_t$

$v_t$

$C$

Interconnection of the world (linear dynamics) with a filter (linear dynamics)

Exercise: show that for some $C_\varepsilon, D_\varepsilon, A_e, D_e$,
the innovations are given by $\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t$ where

$e_{t+1} =A_ee_{t} -w_{t}+D_e v_t$ and $e_0 = \hat s_0- s_0$

Steady-state Kalman Filter

$\varepsilon_t$

Kalman Filter

$A$

$\hat s$

$L$

$\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t$

$y_{t} =CA\hat s_{t-1}+\varepsilon_t$

$A_e$

$e$

$w_t$

$v_t$

$C_\varepsilon$

Exercise: show that for some $C_\varepsilon, D_\varepsilon, A_e, D_e$,
the innovations are given by $\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t$ where

$e_{t+1} =A_ee_{t} -w_{t}+D_e v_t$ and $e_0 = \hat s_0- s_0$

Interconnection of the world (linear dynamics) with a filter (linear dynamics)

$CA$

What if we don't know exactly how the system evolves or how measurements are generated?

System identification

Related Goals

Predict future outputs $y_{0:t} \to \hat y_{t+1}$
Estimate underlying state $y_{0:t} \to \hat s_t $
Identify dynamics and measurement model $$y_{0:t} \to (\hat A, \hat C)$$

$y_t$

$?$

$s$

$w_t$

$v_t$

$?$

What if we don't know exactly how the system evolves?

Sys id with state observation

Related Goals

Predict future states $s_{0:t} \to \hat s_{t+1}$
Identify dynamics model $$s_{0:t} \to \hat A$$

$s_t$

$?$

$s$

$w_t$

What if we don't know exactly how the system evolves?

Sys id with state observation

$s_1 = \textcolor{yellow}{F}(s_0) + \textcolor{yellow}{w_0}$
$s_2 = \textcolor{yellow}{F}(s_1) + \textcolor{yellow}{w_1}$
...
$s_t = \textcolor{yellow}{F}(s_{t-1}) + \textcolor{yellow}{w_{t-1}}$

What if we don't know exactly how the system evolves?

Sys id with state observation

$$\hat F = \min_{\textcolor{yellow}{F}\in\mathcal F} ~~\sum_{k=0}^{t-1}\|s_{k+1} - \textcolor{yellow}{F}(s_k)\|^2 $$

Use supervised learning with labels $s_{k+1}$ and features $s_k$!

Linear sys id with state observation

$$\hat A = \min_{A} ~~\sum_{k=0}^{t-1}\|s_{k+1} - As_k\|^2 $$

$$= \min_{A} \Big\|\underbrace{\begin{bmatrix} s_1^\top \\ \vdots \\ s_{t}^\top\end{bmatrix}}_{S_+} - \underbrace{\begin{bmatrix} s_0^\top \\ \vdots \\ s_{t-1}^\top\end{bmatrix}}_{S_-}A^\top\Big\|_F^2 $$

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$

Linear sys id with state observation

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$

Example

Suppose there is no noise and we observe $s_0 = [0, 1]$, $s_1 = [\frac{1}{2}, 0]$, and $s_2 = [0, \frac{1}{4}]$. What is $s_3$? $A$?

What if $s_0 = [0,1]$, $s_1 = [0, \frac{1}{2}]$, and $s_2 = [0, \frac{1}{4}]$?

Linear sys id with state observation

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$

Unlike standard linear regression, the "features" $(s_0,\dots,s_{t-1})$ are not fixed or drawn independently
- dependence $s_{k+1} = As_k+w_k$
Exercise: excess risk of fixed design (Hint: see Lecture 2)
- Simple query model: suppose $S_-$ is populated with fixed set of queries $(s_0,\dots s_{t-1})$ and $S_+$ populated with single step of model $(s_{0,+},\dots,s_{t-1,+})$ where $$s_{i,+} = As_{i} + w_i$$
- Show that if $w_i$ is i.i.d., zero mean, and $\mathbb E[w_iw_i^\top]=\sigma^2I$, the excess risk of the fixed design query model is $\frac{\sigma^2n^2}{t}$

Even if $(A,C)$ is observable, non-unique solution!

Noiseless linear sys id

What if we don't know exactly how the system evolves or how measurements are generated?

$y_0 = \textcolor{yellow}{Cs_0}$
$s_1 = \textcolor{yellow}{As_0}$
...
$s_t = \textcolor{yellow}{A}s_{t-1}$
$y_t = \textcolor{yellow}{C}s_{t} $

Example

$s_{t+1} = \frac{1}{2}s_t$ from $s_0=4$ and $y_t = s_t$
could just as easily be $s_{t+1} = \frac{1}{2}s_t$ from $s_0=2$ and $y_t = 2s_t$

Even if $(A,C)$ is observable, non-unique solution!

Noiseless linear sys id

$y_0 =\hat C\hat s_0$
$\hat s_1 = \hat A\hat s_0$
...
$\hat s_t = \hat A\hat s_{t-1}$
$y_t = \hat C\hat s_{t} $

Suppose $\hat A, \hat C, \hat s_0$ satisfies the equations

$y_0 =\hat C\textcolor{cyan}{MM^{-1}}\hat s_0$
$\textcolor{cyan}{M^{-1}}\hat s_1 = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_0$
...
$\textcolor{cyan}{M^{-1}}\hat s_t = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_{t-1}$
$y_t = \hat C\textcolor{cyan}{MM^{-1}}\hat s_{t} $

Then so does $\tilde A, \tilde C, \tilde s_0$

The state $s_t=[\theta_t,\omega_t]$, output $y_t=\theta_t$, and

$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

Example

Then equivalent by having $y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t$

$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t$$

What if we define $\tilde s_0 = [\alpha\theta_0, \beta \omega_0]$ ?

The state $s_t=[\theta_t,\omega_t]$, output $y_t=\theta_t$, process noise $w_t$, and

$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t + w_t$$

Example

Then equivalent by having $y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t$

$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t+?$$

What if we define $\tilde s_0 = [\alpha\theta_0, \beta \omega_0]$ ?

Equivalent representations

Set of equivalent state space representations for all invertible and square $M$

$s_{t+1} = As_t + Bw_t$

$y_t = Cs_t+v_t$

$\tilde s_{t+1} = \tilde A\tilde s_t + \tilde B w_t$

$y_t = \tilde C\tilde s_t+v_t$

$\tilde s = M^{-1}s$

$\tilde A = M^{-1}AM$

$\tilde B = M^{-1}B$

$\tilde C = CM$

$ s = M\tilde s$

$ A = M\tilde AM^{-1}$

$ B = M\tilde B$

$C = \tilde CM^{-1}$

$y_t$

$A$

$s$

$w_t$

$v_t$

$C$

$B$

$y_t$

$\tilde A$

$s$

$w_t$

$v_t$

$\tilde C$

$\tilde B$

Goal: identify dynamics $A$ and measurement model $C$ up to similarity transform

Linear system identfication

$$\mathcal O_t = \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix} $$

Recall the observability matrix

$$\tilde \mathcal O_t = \begin{bmatrix}\tilde C\\\tilde C\tilde A\\\vdots \\ \tilde C\tilde A^{t}\end{bmatrix} $$

$$=\begin{bmatrix}CM\\CMM^{-1} AM\\\vdots \\ CMM^{-1} A^{t}M\end{bmatrix} $$

$$=\mathcal O_t M $$

Approach: subspace identification of the rank $n$ column space of $\mathcal O_t$

Step 1: Estimate $\mathcal O_t$

Linear system identfication

$$\begin{bmatrix}y_0\\\vdots \\ y_t\end{bmatrix} =\mathcal O_t s_0 $$

$$\begin{bmatrix}y_1\\\vdots \\ y_{t+1}\end{bmatrix} =\mathcal O_t s_{1} $$

$\dots$

$$\begin{bmatrix}y_k\\\vdots \\ y_{t+k}\end{bmatrix} =\mathcal O_t s_{k} $$

Step 1: Estimate $\mathcal O_t$

Linear system identfication

$$\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix} =\mathcal O_t \begin{bmatrix}s_0 & s_1 &\dots & s_k\end{bmatrix}$$

As long as $s_{0:k}$ form a rank $n$ matrix, the column space of $\mathcal O_t$ is the same as the column space of the matrix of outputs (Hankel matrix)

$\hat{\mathcal O}_t = U_n$

where the SVD is given by $Y_{t,k} = \begin{bmatrix}U_n & U_{-1}\end{bmatrix} \begin{bmatrix}\Sigma_n \\& 0\end{bmatrix}V^\top $

Step 1: Estimate $\mathcal O_t$

Linear system identfication

Step 2: Recover $\hat A$ and $\hat C$ from $\hat{\mathcal O}$

$$\hat{\mathcal O}_t = \begin{bmatrix}\hat{\mathcal O}_t [0]\\\hat{\mathcal O}_t [1]\\\vdots \\ \hat{\mathcal O}_t [t]\end{bmatrix}\approx \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix} $$

So we set $\hat C = \hat {\mathcal O}_t[0]$ and $\hat A$ as solving

$\hat {\mathcal O}_t[0] A = \hat {\mathcal O}_t[1] $
...
$\hat {\mathcal O}_t[t-1] A = \hat {\mathcal O}_t[t] $

$\hat{\mathcal O}_t = U_n$ from SVD of Hankel matrix $\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix}$

Next time: cutting edge work on finite sample guarantees for system identification!

Recap

Kalman filter recap
Observability
Steady-state linear filter
System identification
Equivalent realizations

Reference: Callier & Desoer, "Linear Systems Theory" and Verhaegen & Verdult "Filtering and System Identification"

09 - Sys ID - ML in Feedback Sys

By Sarah Dean

09 - Sys ID - ML in Feedback Sys

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

System Identification

ML in Feedback Sys #9

Announcements

Recap: Kalman filter

\(A\)

Recap: Kalman filter

Recap: Least squares regression

Recap: Least squares regression

Recap: Least squares filtering

Recap: Least squares filtering

Recap: Least squares filtering

Recap: Kalman filter

When is state estimation possible?

Observability

Observability

Steady-state Kalman Filter

Steady-state Kalman Filter

Steady-state Kalman Filter

Steady-state Kalman Filter

System identification

\(?\)

Sys id with state observation

\(?\)

Sys id with state observation

Sys id with state observation

Linear sys id with state observation

Linear sys id with state observation

Linear sys id with state observation

Noiseless linear sys id

Noiseless linear sys id

Example

Example

Equivalent representations

Linear system identfication

Linear system identfication

Linear system identfication

Linear system identfication

Recap

09 - Sys ID - ML in Feedback Sys

More from Sarah Dean