Prof Sarah Dean

## Announcements

• Office hours 10-11am this Thursday
• Zoom attendance and participation
• Guest lecture by Max Simchowitz on Monday, Zoom only
• Reminders
• Final project details posted, proposal due October 7
• Working in pairs/groups, self-assessment
• Paper presentation schedule posted
• Meet with Atul at least 2 days before you are scheduled to present, schedule meeting a week or more in advance

$$y_t$$

$$\hat s_{t+1} = A\hat s_t + L_t(y_{t+1} -CA\hat s_t)$$

$$s_{t+1} = As_t + w_t$$

$$y_t = Cs_t + v_t$$

Kalman Filter

$$\hat s_t$$

$$A-L_tCA$$

$$\hat s$$

$$L_t$$

## $$A$$

$$s$$

$$w_t$$

$$v_t$$

$$C$$

## Recap: Kalman filter

1. Least-squares estimate of state
• Initialize $$P_{0\mid0}$$ and $$\hat s_{0\mid 0}$$
• For $$t= 1, 2...$$:
• Extrapolate
• State $$\hat s_{t\mid t-1} =$$$$A_t$$$$\hat s_{t-1\mid t-1}$$
• Info matrix
$$P_{t\mid t-1} =$$$$A_t$$$$P_{t-1\mid t-1}$$$$A_t^\top$$$$+ \Sigma_{w,t}$$
• Compute gain
$$L_{t} = P_{t\mid t-1}$$$$C_t^\top$$$$($$$$C_t$$$$P_{t\mid t-1}$$$$C_t^\top$$$$+$$$$\Sigma_{v,t}$$$$)^{-1}$$
• Update
• State
$$\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$$$C_t$$$$\hat s_{t\mid t-1})$$
• Info matrix
$$P_{t\mid t} = (I - L_{t}$$$$C_t$$$$)P_{t\mid t-1}$$

$$\min_{\textcolor{yellow}{\theta}} \sum_{k=0}^t (y_t - \textcolor{yellow}{\theta^\top} x_t )^2$$

## Recap: Least squares regression

$$y_0 = \textcolor{yellow}{\theta^\top} x_0 + \textcolor{yellow}{v_0}$$

$$\vdots$$

$$y_t = \textcolor{yellow}{\theta^\top} x_t + \textcolor{yellow}{v_t}$$

$$\min_{\textcolor{yellow}{\theta, v_k}} \sum_{k=0}^t \textcolor{yellow}{v_k}^2$$

## Recap: Least squares regression

s.t.

$$y_0 = \textcolor{yellow}{\theta_0^\top} x_0 + \textcolor{yellow}{v_0}$$

$$\vdots$$

$$y_t = \textcolor{yellow}{\theta_t^\top} x_t + \textcolor{yellow}{v_t}$$

$$\min_{\textcolor{yellow}{\theta_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{v_k^2+\|w_k\|_2^2 + \|\theta_0\|_2^2}$$

## Recap: Least squares filtering

$$\theta_1 = A_0\textcolor{yellow}{\theta_0}+\textcolor{yellow}{w_0}$$

$$\vdots$$

$$\theta_t = A_{t-1}\textcolor{yellow}{\theta_{t-1}}+\textcolor{yellow}{w_{t-1}}$$

s.t.

$$y_0 = C_0\textcolor{yellow}{s_0} + \textcolor{yellow}{v_0}$$

$$\vdots$$

$$y_t = C_t\textcolor{yellow}{s_t} + \textcolor{yellow}{v_t}$$

## Recap: Least squares filtering

$$\theta_1 = A_0\textcolor{yellow}{s_0}+\textcolor{yellow}{w_0}$$

$$\vdots$$

$$\theta_t = A_{t-1}\textcolor{yellow}{s_{t-1}}+\textcolor{yellow}{w_{t-1}}$$

s.t.

$$\min_{\textcolor{yellow}{s_k, v_k, w_k}} \sum_{k=0}^t \textcolor{yellow}{\|v_k\|^2+\|w_k\|_2^2 + \|s_0\|_2^2}$$

## Recap: Least squares filtering

$$\min_{\textcolor{yellow}{s_k}} \sum_{k=0}^t \|C_k\textcolor{yellow}{s_k}-y_k\|_2^2+ \|A_k\textcolor{yellow}{s_k-s_{k+1}}\|_2^2+ \|\textcolor{yellow}{s_0}\|_2^2$$

## Recap: Kalman filter

1. Least-squares estimate of state
• Time-varying dynamics/measurements
• Noise covariance
2. KF is the MVLUE for stochastic and independent noise
3. KF is the MVUE for Gaussian noise
• Initialize $$P_{0\mid0}$$ and $$\hat s_{0\mid 0}$$
• For $$t= 1, 2...$$:
• Extrapolate
• State $$\hat s_{t\mid t-1} =$$$$A_t$$$$\hat s_{t-1\mid t-1}$$
• Info matrix
$$P_{t\mid t-1} =$$$$A_t$$$$P_{t-1\mid t-1}$$$$A_t^\top$$$$+ \Sigma_{w,t}$$
• Compute gain
$$L_{t} = P_{t\mid t-1}$$$$C_t^\top$$$$($$$$C_t$$$$P_{t\mid t-1}$$$$C_t^\top$$$$+$$$$\Sigma_{v,t}$$$$)^{-1}$$
• Update
• State
$$\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-$$$$C_t$$$$\hat s_{t\mid t-1})$$
• Info matrix
$$P_{t\mid t} = (I - L_{t}$$$$C_t$$$$)P_{t\mid t-1}$$

## When is state estimation possible?

Example

The state $$s=[\theta,\omega]$$, output $$y=\theta$$, and

$$\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t$$

Related Goals

• Predict future outputs $$y_{0:t} \to \hat y_{t+1}$$
• Estimate underlying state $$y_{0:t} \to \hat s_t$$

Challenges

1. Unknown initial state $$s_0$$
2. Unpredictable process due to $$w_t$$
3. Noisy measurements due to $$v_t$$

A deterministic system is observable if outputs $$y_{0:t}$$ uniquely determine the state trajectory $$s_{0:t}$$.

## Observability

$$s_{t+1} = F(s_t)$$

$$y_t = G(s_t)$$

$$(y_0, y_1,...) = \Phi_{F,G}[s_0]$$

A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$

## Observability

$$s_{t+1} = As_t$$

$$y_t = Cs_t$$

$$\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=$$

A linear system is observable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix}}_{\mathcal O}\Big) = n$$

$$\begin{bmatrix} Cs_0\\CAs_0\\\vdots\\CA^ts_0\end{bmatrix}$$

1) $$\mathrm{rank}(\mathcal O) = n \implies$$ observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0$$

2) $$\mathrm{rank}(\mathcal O) = n \impliedby$$ observable

• There is a unique solution $$s_0$$ when $$\mathcal O$$ is rank $$n$$.
• Thus, $$s_{0:t}$$ is uniquely determined

Proof:

1) $$\mathrm{rank}(\mathcal O) = n \implies$$ observable

$$\begin{bmatrix} y_0 \\ \vdots \\ y_{n-1}\end{bmatrix} = \begin{bmatrix}C\\CA\\\vdots \\ CA^{n-1}\end{bmatrix} s_0$$

2) $$\mathrm{rank}(\mathcal O) < n \implies$$ not observable

• for $$t\geq n$$, $$\mathcal O_t$$ has more rows than $$\mathcal O$$
• Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree $$n$$).
• Therefore, $$A^k$$ for $$k\geq n$$ is a linear combo of $$I, A,\dots ,A^{n-1}$$
• Thus, $$\mathrm{rank}(\mathcal O_t) \leq \mathrm{rank}(\mathcal O)<n$$ so $$s_0$$ is not uniquely determined
• There is a unique solution $$s_0$$ when $$\mathcal O$$ is rank $$n$$.
• Thus, $$s_{0:t}$$ is uniquely determined

$$\begin{bmatrix} y_0\\y_1\\\vdots\\y_t\end{bmatrix}=\underbrace{\begin{bmatrix} C\\CA\\\vdots\\CA^t\end{bmatrix}}_{\mathcal O_t} s_0$$

Proof:

## Steady-state Kalman Filter

If $$(A,C)$$ is observable, then as $$t\to\infty$$ the Kalman gain converges to $$P,L$$ satisfying

$$P = A\left(P-PC^\top (CPC^\top +\Sigma_v)^{-1} CP\right)A^\top + \Sigma_w$$

$$L = PC^\top (\Sigma_v+CPC^\top)^{-1}$$

$$y_t$$

Kalman Filter

$$A-LCA$$

$$\hat s$$

$$L$$

$$\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})$$

Discrete Algebraic Riccati Equation

## Steady-state Kalman Filter

If $$(A,C)$$ is observable, then as $$t\to\infty$$ the Kalman gain converges to $$P,L$$ satisfying

$$P, L = \mathrm{DARE}(A^\top, C^\top ,\Sigma_w, \Sigma_v)$$

$$y_t$$

Kalman Filter

$$A-LCA$$

$$\hat s$$

$$L$$

$$\hat s_{t} = A\hat s_{t-1} + L(y_{t} -CA\hat s_{t-1})$$

## Steady-state Kalman Filter

$$y_t$$

Kalman Filter

$$A-LCA$$

$$\hat s$$

$$L$$

$$\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t$$

$$y_{t} =CA\hat s_{t-1}+\varepsilon_t$$

$$A$$

$$s$$

$$w_t$$

$$v_t$$

$$C$$

Interconnection of the world (linear dynamics) with a filter (linear dynamics)

Exercise: show that for some $$C_\varepsilon, D_\varepsilon, A_e, D_e$$,
the innovations are given by $$\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t$$ where

$$e_{t+1} =A_ee_{t} -w_{t}+D_e v_t$$ and $$e_0 = \hat s_0- s_0$$

## Steady-state Kalman Filter

$$\varepsilon_t$$

Kalman Filter

$$A$$

$$\hat s$$

$$L$$

$$\hat s_{t} = A\hat s_{t-1} + L\varepsilon_t$$

$$y_{t} =CA\hat s_{t-1}+\varepsilon_t$$

$$A_e$$

$$e$$

$$w_t$$

$$v_t$$

$$C_\varepsilon$$

Exercise: show that for some $$C_\varepsilon, D_\varepsilon, A_e, D_e$$,
the innovations are given by $$\varepsilon_t = C_\varepsilon e_t+ D_\varepsilon v_t$$ where

$$e_{t+1} =A_ee_{t} -w_{t}+D_e v_t$$ and $$e_0 = \hat s_0- s_0$$

Interconnection of the world (linear dynamics) with a filter (linear dynamics)

$$CA$$

What if we don't know exactly how the system evolves or how measurements are generated?

## System identification

Related Goals

• Predict future outputs $$y_{0:t} \to \hat y_{t+1}$$
• Estimate underlying state $$y_{0:t} \to \hat s_t$$
• Identify dynamics and measurement model $$y_{0:t} \to (\hat A, \hat C)$$

$$y_t$$

## $$?$$

$$s$$

$$w_t$$

$$v_t$$

$$?$$

What if we don't know exactly how the system evolves?

## Sys id with state observation

Related Goals

• Predict future states $$s_{0:t} \to \hat s_{t+1}$$
• Identify dynamics model $$s_{0:t} \to \hat A$$

$$s_t$$

## $$?$$

$$s$$

$$w_t$$

What if we don't know exactly how the system evolves?

## Sys id with state observation

• $$s_1 = \textcolor{yellow}{F}(s_0) + \textcolor{yellow}{w_0}$$
• $$s_2 = \textcolor{yellow}{F}(s_1) + \textcolor{yellow}{w_1}$$
• ...
• $$s_t = \textcolor{yellow}{F}(s_{t-1}) + \textcolor{yellow}{w_{t-1}}$$

What if we don't know exactly how the system evolves?

## Sys id with state observation

$$\hat F = \min_{\textcolor{yellow}{F}\in\mathcal F} ~~\sum_{k=0}^{t-1}\|s_{k+1} - \textcolor{yellow}{F}(s_k)\|^2$$

Use supervised learning with labels $$s_{k+1}$$ and features $$s_k$$!

## Linear sys id with state observation

$$\hat A = \min_{A} ~~\sum_{k=0}^{t-1}\|s_{k+1} - As_k\|^2$$

$$= \min_{A} \Big\|\underbrace{\begin{bmatrix} s_1^\top \\ \vdots \\ s_{t}^\top\end{bmatrix}}_{S_+} - \underbrace{\begin{bmatrix} s_0^\top \\ \vdots \\ s_{t-1}^\top\end{bmatrix}}_{S_-}A^\top\Big\|_F^2$$

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1}$$

## Linear sys id with state observation

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1}$$

Example

Suppose there is no noise and we observe $$s_0 = [0, 1]$$, $$s_1 = [\frac{1}{2}, 0]$$, and $$s_2 = [0, \frac{1}{4}]$$. What is $$s_3$$? $$A$$?

What if $$s_0 = [0,1]$$, $$s_1 = [0, \frac{1}{2}]$$, and $$s_2 = [0, \frac{1}{4}]$$?

## Linear sys id with state observation

$$\hat A = S_+^\top S_-(S_-^\top S_-)^{-1}$$

• Unlike standard linear regression, the "features" $$(s_0,\dots,s_{t-1})$$ are not fixed or drawn independently
• dependence $$s_{k+1} = As_k+w_k$$
• Exercise: excess risk of fixed design (Hint: see Lecture 2)
• Simple query model: suppose $$S_-$$ is populated with fixed set of queries $$(s_0,\dots s_{t-1})$$ and $$S_+$$ populated with single step of model $$(s_{0,+},\dots,s_{t-1,+})$$ where  $$s_{i,+} = As_{i} + w_i$$
• Show that if $$w_i$$ is i.i.d., zero mean, and $$\mathbb E[w_iw_i^\top]=\sigma^2I$$, the excess risk of the fixed design query model is $$\frac{\sigma^2n^2}{t}$$

Even if $$(A,C)$$ is observable, non-unique solution!

## Noiseless linear sys id

What if we don't know exactly how the system evolves or how measurements are generated?

• $$y_0 = \textcolor{yellow}{Cs_0}$$
• $$s_1 = \textcolor{yellow}{As_0}$$
• ...
• $$s_t = \textcolor{yellow}{A}s_{t-1}$$
• $$y_t = \textcolor{yellow}{C}s_{t}$$

Example

• $$s_{t+1} = \frac{1}{2}s_t$$ from $$s_0=4$$ and $$y_t = s_t$$
• could just as easily be $$s_{t+1} = \frac{1}{2}s_t$$ from $$s_0=2$$ and $$y_t = 2s_t$$

Even if $$(A,C)$$ is observable, non-unique solution!

## Noiseless linear sys id

• $$y_0 =\hat C\hat s_0$$
• $$\hat s_1 = \hat A\hat s_0$$
• ...
• $$\hat s_t = \hat A\hat s_{t-1}$$
• $$y_t = \hat C\hat s_{t}$$

Suppose $$\hat A, \hat C, \hat s_0$$ satisfies the equations

• $$y_0 =\hat C\textcolor{cyan}{MM^{-1}}\hat s_0$$
• $$\textcolor{cyan}{M^{-1}}\hat s_1 = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_0$$
• ...
• $$\textcolor{cyan}{M^{-1}}\hat s_t = \textcolor{cyan}{M^{-1}}\hat A\textcolor{cyan}{MM^{-1}}\hat s_{t-1}$$
• $$y_t = \hat C\textcolor{cyan}{MM^{-1}}\hat s_{t}$$

Then so does $$\tilde A, \tilde C, \tilde s_0$$

The state $$s_t=[\theta_t,\omega_t]$$, output $$y_t=\theta_t$$, and

$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t$$

## Example

Then equivalent by having $$y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t$$

$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t$$

What if we define $$\tilde s_0 = [\alpha\theta_0, \beta \omega_0]$$ ?

The state $$s_t=[\theta_t,\omega_t]$$, output $$y_t=\theta_t$$, process noise $$w_t$$, and

$$\theta_{t+1} = 0.9\theta_t+0.1\omega_t,\quad \omega_{t+1} = 0.9 \omega_t + w_t$$

## Example

Then equivalent by having $$y_t =\begin{bmatrix}\frac{1}{\alpha} & 0\end{bmatrix} \tilde s_t$$

$$\tilde \theta_{t+1} = 0.9\tilde \theta_t+0.1\frac{\alpha}{\beta}\tilde \omega_t,\quad \tilde \omega_{t+1} = 0.9 \tilde \omega_t+?$$

What if we define $$\tilde s_0 = [\alpha\theta_0, \beta \omega_0]$$ ?

## Equivalent representations

Set of equivalent state space representations for all invertible and square $$M$$

$$s_{t+1} = As_t + Bw_t$$

$$y_t = Cs_t+v_t$$

$$\tilde s_{t+1} = \tilde A\tilde s_t + \tilde B w_t$$

$$y_t = \tilde C\tilde s_t+v_t$$

$$\tilde s = M^{-1}s$$

$$\tilde A = M^{-1}AM$$

$$\tilde B = M^{-1}B$$

$$\tilde C = CM$$

$$s = M\tilde s$$

$$A = M\tilde AM^{-1}$$

$$B = M\tilde B$$

$$C = \tilde CM^{-1}$$

$$y_t$$

$$A$$

$$s$$

$$w_t$$

$$v_t$$

$$C$$

$$B$$

$$y_t$$

$$\tilde A$$

$$s$$

$$w_t$$

$$v_t$$

$$\tilde C$$

$$\tilde B$$

Goal: identify dynamics $$A$$ and measurement model $$C$$ up to similarity transform

## Linear system identfication

$$\mathcal O_t = \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix}$$

Recall the observability matrix

$$\tilde \mathcal O_t = \begin{bmatrix}\tilde C\\\tilde C\tilde A\\\vdots \\ \tilde C\tilde A^{t}\end{bmatrix}$$

$$=\begin{bmatrix}CM\\CMM^{-1} AM\\\vdots \\ CMM^{-1} A^{t}M\end{bmatrix}$$

$$=\mathcal O_t M$$

Approach: subspace identification of the rank $$n$$ column space of $$\mathcal O_t$$

Step 1: Estimate $$\mathcal O_t$$

## Linear system identfication

$$\begin{bmatrix}y_0\\\vdots \\ y_t\end{bmatrix} =\mathcal O_t s_0$$

$$\begin{bmatrix}y_1\\\vdots \\ y_{t+1}\end{bmatrix} =\mathcal O_t s_{1}$$

$$\dots$$

$$\begin{bmatrix}y_k\\\vdots \\ y_{t+k}\end{bmatrix} =\mathcal O_t s_{k}$$

Step 1: Estimate $$\mathcal O_t$$

## Linear system identfication

$$\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix} =\mathcal O_t \begin{bmatrix}s_0 & s_1 &\dots & s_k\end{bmatrix}$$

As long as $$s_{0:k}$$ form a rank $$n$$ matrix, the column space of $$\mathcal O_t$$ is the same as the column space of the matrix of outputs (Hankel matrix)

$$\hat{\mathcal O}_t = U_n$$

where the SVD is given by $$Y_{t,k} = \begin{bmatrix}U_n & U_{-1}\end{bmatrix} \begin{bmatrix}\Sigma_n \\& 0\end{bmatrix}V^\top$$

Step 1: Estimate $$\mathcal O_t$$

## Linear system identfication

Step 2: Recover $$\hat A$$ and $$\hat C$$ from $$\hat{\mathcal O}$$

$$\hat{\mathcal O}_t = \begin{bmatrix}\hat{\mathcal O}_t [0]\\\hat{\mathcal O}_t [1]\\\vdots \\ \hat{\mathcal O}_t [t]\end{bmatrix}\approx \begin{bmatrix}C\\CA\\\vdots \\ CA^{t}\end{bmatrix}$$

So we set $$\hat C = \hat {\mathcal O}_t[0]$$ and $$\hat A$$ as solving

• $$\hat {\mathcal O}_t[0] A = \hat {\mathcal O}_t[1]$$
• ...
• $$\hat {\mathcal O}_t[t-1] A = \hat {\mathcal O}_t[t]$$

$$\hat{\mathcal O}_t = U_n$$ from SVD of Hankel matrix $$\begin{bmatrix}y_0 & \dots &y_k\\\vdots & \ddots&\vdots\\ y_t&\dots& y_{t+k}\end{bmatrix}$$

Next time: cutting edge work on finite sample guarantees for system identification!

## Recap

• Kalman filter recap
• Observability
• Steady-state linear filter
• System identification
• Equivalent realizations

Reference: Callier & Desoer, "Linear Systems Theory" and Verhaegen & Verdult "Filtering and System Identification"

By Sarah Dean

Private