System Identification
ML in Feedback Sys #9
Prof Sarah Dean
Announcements
- Office hours 10-11am this Thursday
- Zoom attendance and participation
- Guest lecture by Max Simchowitz on Monday, Zoom only
- Reminders
- Final project details posted, proposal due October 7
- Working in pairs/groups, self-assessment
- Paper presentation schedule posted
- Meet with Atul at least 2 days before you are scheduled to present, schedule meeting a week or more in advance
\(y_t\)
\(\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)\)
\(s_{t+1} = As_t + w_t\)
\(y_t = Cs_t + v_t\)
Kalman Filter
\(\hat s_t\)
\(A-L_tCA\)
\(\hat s\)
\(L_t\)
Recap: Kalman filter

\(A\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
Recap: Kalman filter
- Least-squares estimate of state
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
$$ \min_{\textcolor{yellow}{\theta}} \sum_{k=0}^t (y_t - \textcolor{yellow}{\theta^\top} x_t )^2 $$
Recap: Least squares regression
\(y_0 = \textcolor{yellow}{\theta^\top} x_0 + \textcolor{yellow}{v_0}\)
\(\vdots\)
\(y_t = \textcolor{yellow}{\theta^\top} x_t + \textcolor{yellow}{v_t}\)
$$ \min_{\textcolor{yellow}{\theta, v_k}} \sum_{k=0}^t \textcolor{yellow}{v_k}^2 $$
Recap: Least squares regression
s.t.
\(y_0 = \textcolor{yellow}{\theta_0^\top} x_0 + \textcolor{yellow}{v_0}\)
\(\vdots\)
\(y_t = \textcolor{yellow}{\theta_t^\top} x_t + \textcolor{yellow}{v_t}\)
$$ \min_{\textcolor{yellow}{\theta_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{v_k^2+\|w_k\|_2^2 + \|\theta_0\|_2^2} $$
Recap: Least squares filtering
\(\theta_1 = A_0\textcolor{yellow}{\theta_0}+\textcolor{yellow}{w_0}\)
\(\vdots\)
\(\theta_t = A_{t-1}\textcolor{yellow}{\theta_{t-1}}+\textcolor{yellow}{w_{t-1}}\)
s.t.
\(y_0 = C_0\textcolor{yellow}{s_0} + \textcolor{yellow}{v_0}\)
\(\vdots\)
\(y_t = C_t\textcolor{yellow}{s_t} + \textcolor{yellow}{v_t}\)
Recap: Least squares filtering
\(\theta_1 = A_0\textcolor{yellow}{s_0}+\textcolor{yellow}{w_0}\)
\(\vdots\)
\(\theta_t = A_{t-1}\textcolor{yellow}{s_{t-1}}+\textcolor{yellow}{w_{t-1}}\)
s.t.
$$ \min_{\textcolor{yellow}{s_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{\|v_k\|^2+\|w_k\|_2^2 + \|s_0\|_2^2} $$
Recap: Least squares filtering
$$ \min_{\textcolor{yellow}{s_k}} \sum_{k=0}^t \|C_k\textcolor{yellow}{s_k}-y_k\|_2^2+ \|A_k\textcolor{yellow}{s_k-s_{k+1}}\|_2^2+ \|\textcolor{yellow}{s_0}\|_2^2$$
Recap: Kalman filter
- Least-squares estimate of state
- Time-varying dynamics/measurements
- Noise covariance
- KF is the MVLUE for stochastic and independent noise
- KF is the MVUE for Gaussian noise
- Initialize \(P_{0\mid0} \) and \(\hat s_{0\mid 0}\)
- For \(t= 1, 2...\):
- Extrapolate
- State \(\hat s_{t\mid t-1} =\)\(A_t\)\(\hat s_{t-1\mid t-1} \)
- Info matrix
\(P_{t\mid t-1} = \)\(A_t\)\(P_{t-1\mid t-1}\)\( A_t^\top\)\( + \Sigma_{w,t}\)
- Compute gain
\(L_{t} = P_{t\mid t-1}\)\(C_t^\top\)\( ( \)\(C_t\)\(P_{t\mid t-1} \)\(C_t^\top\)\(+\)\(\Sigma_{v,t}\)\()^{-1}\) - Update
- State
\(\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-\)\(C_t\)\(\hat s_{t\mid t-1})\) - Info matrix
\(P_{t\mid t} = (I - L_{t}\)\(C_t\)\()P_{t\mid t-1}\)
- State
- Extrapolate
When is state estimation possible?
Example
The state \(s=[\theta,\omega]\), output \(y=\theta\), and
$$\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t$$
Related Goals
- Predict future outputs \(y_{0:t} \to \hat y_{t+1}\)
- Estimate underlying state \(y_{0:t} \to \hat s_t \)
Challenges
- Unknown initial state \(s_0\)
Unpredictable process due to \(w_t\)Noisy measurements due to \(v_t\)
A deterministic system is observable if outputs \(y_{0:t}\) uniquely determine the state trajectory \(s_{0:t}\).
Observability
\(s_{t+1} = F(s_t)\)
\(y_t = G(s_t)\)
\((y_0, y_1,...) = \Phi_{F,G}[s_0]\)
A linear system is observable if and only if
$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$
Observability
\(s_{t+1} = As_t\)
\(y_t = Cs_t\)
\(\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\)
A linear system is observable if and only if
$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$
\(\begin{bmatrix} Cs_0\\CAs_0\\\vdots\\CA^ts_0\end{bmatrix}\)
1) \(\mathrm{rank}(\mathcal O) = n \implies\) observable
$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0 $$
2) \(\mathrm{rank}(\mathcal O) = n \impliedby\) observable
- There is a unique solution \(s_0\) when \(\mathcal O\) is rank \(n\).
- Thus, \(s_{0:t}\) is uniquely determined
Proof:
1) \(\mathrm{rank}(\mathcal O) = n \implies\) observable
$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0 $$
2) \(\mathrm{rank}(\mathcal O) < n \implies\) not observable
- for \(t\geq n\), \(\mathcal O_t\) has more rows than \(\mathcal O\)
- Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree \(n\)).
- Therefore, \(A^k\) for \(k\geq n\) is a linear combo of \(I, A,\dots ,A^{n-1}\)
- Thus, \(\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)<n\) so \(s_0\) is not uniquely determined
- There is a unique solution \(s_0\) when \(\mathcal O\) is rank \(n\).
- Thus, \(s_{0:t}\) is uniquely determined
\(\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\underbrace{\begin{bmatrix} C\\CA\\\vdots\\CA^t\end{bmatrix}}_{\mathcal O_t} s_0\)
Proof:
Steady-state Kalman Filter
If \((A,C)\) is observable, then as \(t\to\infty\) the Kalman gain converges to \(P,L\) satisfying
\(P = A\left(P-PC^\top (CPC^\top +\Sigma_v)^{-1} CP\right)A^\top + \Sigma_w\)
\(L = PC^\top (\Sigma_v+CPC^\top)^{-1}\)
\(y_t\)
Kalman Filter
\(A-LCA\)
\(\hat s\)
\(L\)
\(\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})\)
Discrete Algebraic Riccati Equation
Steady-state Kalman Filter
If \((A,C)\) is observable, then as \(t\to\infty\) the Kalman gain converges to \(P,L\) satisfying
\(P, L = \mathrm{DARE}(A^\top, C^\top ,\Sigma_w, \Sigma_v)\)
\(y_t\)
Kalman Filter
\(A-LCA\)
\(\hat s\)
\(L\)
\(\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})\)
Steady-state Kalman Filter
\(y_t\)
Kalman Filter
\(A-LCA\)
\(\hat s\)
\(L\)
\(\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t\)
\(y_{t} =CA\hat s_{t-1}+\varepsilon_t\)

\(A\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
Interconnection of the world (linear dynamics) with a filter (linear dynamics)
Exercise: show that for some \(C_\varepsilon, D_\varepsilon, A_e, D_e\),
the innovations are given by \(\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t\) where
\(e_{t+1} =A_ee_{t} -w_{t}+D_e v_t\) and \(e_0 = \hat s_0- s_0\)
Steady-state Kalman Filter
\(\varepsilon_t\)
Kalman Filter
\(A\)
\(\hat s\)
\(L\)
\(\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t\)
\(y_{t} =CA\hat s_{t-1}+\varepsilon_t\)

\(A_e\)
\(e\)
\(w_t\)
\(v_t\)
\(C_\varepsilon\)
Exercise: show that for some \(C_\varepsilon, D_\varepsilon, A_e, D_e\),
the innovations are given by \(\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t\) where
\(e_{t+1} =A_ee_{t} -w_{t}+D_e v_t\) and \(e_0 = \hat s_0- s_0\)
Interconnection of the world (linear dynamics) with a filter (linear dynamics)
\(CA\)
What if we don't know exactly how the system evolves or how measurements are generated?
System identification
Related Goals
- Predict future outputs \(y_{0:t} \to \hat y_{t+1}\)
- Estimate underlying state \(y_{0:t} \to \hat s_t \)
- Identify dynamics and measurement model $$y_{0:t} \to (\hat A, \hat C)$$
\(y_t\)

\(?\)
\(s\)
\(w_t\)
\(v_t\)
\(?\)
What if we don't know exactly how the system evolves?
Sys id with state observation
Related Goals
- Predict future states \(s_{0:t} \to \hat s_{t+1}\)
- Identify dynamics model $$s_{0:t} \to \hat A$$
\(s_t\)

\(?\)
\(s\)
\(w_t\)
What if we don't know exactly how the system evolves?
Sys id with state observation
- \(s_1 = \textcolor{yellow}{F}(s_0) + \textcolor{yellow}{w_0}\)
- \(s_2 = \textcolor{yellow}{F}(s_1) + \textcolor{yellow}{w_1}\)
- ...
- \(s_t = \textcolor{yellow}{F}(s_{t-1}) + \textcolor{yellow}{w_{t-1}}\)
What if we don't know exactly how the system evolves?
Sys id with state observation
$$\hat F = \min_{\textcolor{yellow}{F}\in\mathcal F} ~~\sum_{k=0}^{t-1}\|s_{k+1} - \textcolor{yellow}{F}(s_k)\|^2 $$
Use supervised learning with labels \(s_{k+1}\) and features \(s_k\)!
Linear sys id with state observation
$$\hat A = \min_{A} ~~\sum_{k=0}^{t-1}\|s_{k+1} - As_k\|^2 $$
$$= \min_{A} \Big\|\underbrace{\begin{bmatrix} s_1^\top \\ \vdots \\ s_{t}^\top\end{bmatrix}}_{S_+} - \underbrace{\begin{bmatrix} s_0^\top \\ \vdots \\ s_{t-1}^\top\end{bmatrix}}_{S_-}A^\top\Big\|_F^2 $$
$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$
Linear sys id with state observation
$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$
Example
Suppose there is no noise and we observe \(s_0 = [0, 1]\), \(s_1 = [\frac{1}{2}, 0]\), and \(s_2 = [0, \frac{1}{4}]\). What is \(s_3\)? \(A\)?
What if \(s_0 = [0,1]\), \(s_1 = [0, \frac{1}{2}]\), and \(s_2 = [0, \frac{1}{4}]\)?
Linear sys id with state observation
$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1} $$
- Unlike standard linear regression, the "features" \((s_0,\dots,s_{t-1})\) are not fixed or drawn independently
- dependence \(s_{k+1} = As_k+w_k\)
- Exercise: excess risk of fixed design (Hint: see Lecture 2)
- Simple query model: suppose \(S_-\) is populated with fixed set of queries \((s_0,\dots s_{t-1})\) and \(S_+\) populated with single step of model \((s_{0,+},\dots,s_{t-1,+})\) where $$s_{i,+} = As_{i} + w_i$$
- Show that if \(w_i\) is i.i.d., zero mean, and \(\mathbb E[w_iw_i^\top]=\sigma^2I\), the excess risk of the fixed design query model is \(\frac{\sigma^2n^2}{t}\)
Even if \((A,C)\) is observable, non-unique solution!
Noiseless linear sys id
What if we don't know exactly how the system evolves or how measurements are generated?
- \(y_0 = \textcolor{yellow}{Cs_0}\)
- \(s_1 = \textcolor{yellow}{As_0}\)
- ...
- \(s_t = \textcolor{yellow}{A}s_{t-1}\)
- \(y_t = \textcolor{yellow}{C}s_{t} \)
Example
- \(s_{t+1} = \frac{1}{2}s_t\) from \(s_0=4\) and \(y_t = s_t\)
- could just as easily be \(s_{t+1} = \frac{1}{2}s_t\) from \(s_0=2\) and \(y_t = 2s_t\)
Even if \((A,C)\) is observable, non-unique solution!
Noiseless linear sys id
- \(y_0 =\hat C\hat s_0\)
- \(\hat s_1 = \hat A\hat s_0\)
- ...
- \(\hat s_t = \hat A\hat s_{t-1}\)
- \(y_t = \hat C\hat s_{t} \)
Suppose \(\hat A, \hat C, \hat s_0\) satisfies the equations
- \(y_0 =\hat C\textcolor{cyan}{MM^{-1}}\hat s_0\)
- \(\textcolor{cyan}{M^{-1}}\hat s_1 = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_0\)
- ...
- \(\textcolor{cyan}{M^{-1}}\hat s_t = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_{t-1}\)
- \(y_t = \hat C\textcolor{cyan}{MM^{-1}}\hat s_{t} \)
Then so does \(\tilde A, \tilde C, \tilde s_0\)
The state \(s_t=[\theta_t,\omega_t]\), output \(y_t=\theta_t\), and
$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$
Example
Then equivalent by having \(y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t\)
$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t$$
What if we define \(\tilde s_0 = [\alpha\theta_0, \beta \omega_0]\) ?
The state \(s_t=[\theta_t,\omega_t]\), output \(y_t=\theta_t\), process noise \(w_t\), and
$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t + w_t$$
Example
Then equivalent by having \(y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t\)
$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t+?$$
What if we define \(\tilde s_0 = [\alpha\theta_0, \beta \omega_0]\) ?
Equivalent representations
Set of equivalent state space representations for all invertible and square \(M\)
\(s_{t+1} = As_t + Bw_t\)
\(y_t = Cs_t+v_t\)
\(\tilde s_{t+1} = \tilde A\tilde s_t + \tilde B w_t\)
\(y_t = \tilde C\tilde s_t+v_t\)
\(\tilde s = M^{-1}s\)
\(\tilde A = M^{-1}AM\)
\(\tilde B = M^{-1}B\)
\(\tilde C = CM\)
\( s = M\tilde s\)
\( A = M\tilde AM^{-1}\)
\( B = M\tilde B\)
\(C = \tilde CM^{-1}\)
\(y_t\)

\(A\)
\(s\)
\(w_t\)
\(v_t\)
\(C\)
\(B\)
\(y_t\)

\(\tilde A\)
\(s\)
\(w_t\)
\(v_t\)
\(\tilde C\)
\(\tilde B\)
Goal: identify dynamics \(A\) and measurement model \(C\) up to similarity transform
Linear system identfication
$$\mathcal O_t = \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix} $$
Recall the observability matrix
$$\tilde \mathcal O_t = \begin{bmatrix}\tilde C\\\tilde C\tilde A\\\vdots \\ \tilde C\tilde A^{t}\end{bmatrix} $$
$$=\begin{bmatrix}CM\\CMM^{-1} AM\\\vdots \\ CMM^{-1} A^{t}M\end{bmatrix} $$
$$=\mathcal O_t M $$
Approach: subspace identification of the rank \(n\) column space of \(\mathcal O_t\)
Step 1: Estimate \(\mathcal O_t\)
Linear system identfication
$$\begin{bmatrix}y_0\\\vdots \\ y_t\end{bmatrix} =\mathcal O_t s_0 $$
$$\begin{bmatrix}y_1\\\vdots \\ y_{t+1}\end{bmatrix} =\mathcal O_t s_{1} $$
\(\dots\)
$$\begin{bmatrix}y_k\\\vdots \\ y_{t+k}\end{bmatrix} =\mathcal O_t s_{k} $$
Step 1: Estimate \(\mathcal O_t\)
Linear system identfication
$$\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix} =\mathcal O_t \begin{bmatrix}s_0 & s_1 &\dots & s_k\end{bmatrix}$$
As long as \(s_{0:k}\) form a rank \(n\) matrix, the column space of \(\mathcal O_t\) is the same as the column space of the matrix of outputs (Hankel matrix)
\(\hat{\mathcal O}_t = U_n\)
where the SVD is given by \(Y_{t,k} = \begin{bmatrix}U_n & U_{-1}\end{bmatrix} \begin{bmatrix}\Sigma_n \\& 0\end{bmatrix}V^\top \)
Step 1: Estimate \(\mathcal O_t\)
Linear system identfication
Step 2: Recover \(\hat A\) and \(\hat C\) from \(\hat{\mathcal O}\)
$$\hat{\mathcal O}_t = \begin{bmatrix}\hat{\mathcal O}_t [0]\\\hat{\mathcal O}_t [1]\\\vdots \\ \hat{\mathcal O}_t [t]\end{bmatrix}\approx \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix} $$
So we set \(\hat C = \hat {\mathcal O}_t[0]\) and \(\hat A\) as solving
- \(\hat {\mathcal O}_t[0] A = \hat {\mathcal O}_t[1] \)
- ...
- \(\hat {\mathcal O}_t[t-1] A = \hat {\mathcal O}_t[t] \)
\(\hat{\mathcal O}_t = U_n\) from SVD of Hankel matrix \(\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix}\)
Next time: cutting edge work on finite sample guarantees for system identification!
Recap
- Kalman filter recap
- Observability
- Steady-state linear filter
- System identification
- Equivalent realizations
Reference: Callier & Desoer, "Linear Systems Theory" and Verhaegen & Verdult "Filtering and System Identification"
09 - Sys ID - ML in Feedback Sys
By Sarah Dean