14 - Control with Partial Observation - ML in Feedback Sys F25

Control with Partial Observation

ML in Feedback Sys #14

Fall 2025, Prof Sarah Dean

Control by Separation Principle

"What we do"

Given a state feedback policy $\pi:\mathcal S\to\mathcal A$ (e.g. MPC) and a state estimation filtering algorithm
For $t=1,2,...$
- Observe the output $y_t$
- Update the state estimate to $\hat s_{t|t}$ using $y_t$ according to the filtering algorithm
- Select an action $a_t = \pi(\hat s_{t|t})$
- Update the state estimation $\hat s_{t+1|t}$ using $a_t$ according to the filtering algorithm

Control by Separation Principle

"Why we do it"

Fact 1: For partially observed linear-quadratic (LQ) control, the separation principle policy is optimal, i.e. $$\pi_t^\star(a_{0:t-1},y_{0:t}) = K_t^\star \mathbb E[s_t|a_{0:t-1},y_{0:t}] = \pi^{LQ}_t(\hat s_{t|t})$$
Fact 2: For LQ control with Gaussian noise (LQG), the Kalman filter is the optimal filtering algorithm
Fact 3: For LQG, the steady state optimal policy is described by control gains $K_\star$ and filter gains $L_\star$, resulting in linear time-invariant dynamics with $$a_t = K_\star \hat s_{t},\quad \hat s_{t+1} = F\hat s_t + G a_t + L_\star (y_{t+1}-H(F\hat s_t + Ga_t)) $$
Fact 4: The separation principle is convenient, but in general it is not optimal

Partially observed LQ control

Linear dynamics, linear observations, quadratic costs
Stochastic and independent noise $$\mathbb E[w_k] = 0,~~\mathbb E[w_kw_k^\top] =\sigma_w^2 I,~~\mathbb E[v_k] = 0,~~\mathbb E[v_kv_k^\top] = \sigma_v^2 I$$
Information structure: policy at time $t$ can depend on $$\mathcal I_t =\{ y_0, a_0, y_1, a_1, ..., a_{t-1}, y_t\}$$

PO-LQ Optimal Control Problem

$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = F s_k+ Ga_k+w_k $$

$$a_k=\pi_k(a_{0:k-1}, y_{0:k}) $$

$$y_k=Hs_k+v_k $$

Partially observed optimal control

Dynamic Programming Algorithm

Initialize $J_{T+1}^\star (\mathcal I_{T+1}) = 0$
For $k=T,T-1,\dots,0$:
- Compute $J_k^\star (\mathcal I_{k}) = \min_{a\in\mathcal A} \mathbb E[c(s_k, a)+ J_{k+1}^\star (\mathcal I_{k+1})| \mathcal I_{k}, a ]$
- Record minimizing argument as $\pi_k^\star(\mathcal I_{k})$

Reference: Ch 4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

By the principle of optimality, the resulting policy is optimal.

Information state $\mathcal I_k$ acts as our state

The information state evolves depending on $a_k, w_k, v_{k+1}$

$$\{ y_0, a_0, ..., a_{k-1}, y_k\} \to \{ y_0, a_0, ..., a_{k}, y_{k+1}\}$$

Dynamic Programming for PO-LQ

The cost-to-go has the form $$J^\star_k(\mathcal I_k) = \mathbb E\Big[s_k^\top P_k s_k + (s_k-\mu_k)^\top P_k(s_k-\mu_k) + w_k^\top Q w_k\Big|\mathcal I_k\Big]$$ where $\mu_k = \mathbb E[s_k|\mathcal I_k]$
Lemma 4.2.1: the difference $s_k - \mu_k$ depends on $s_0,w_0,w_1,...,w_{k-1},v_0,v_1,...,v_k$ and is entirely independent of actions
- this is a separation between estimation and control
As a result, only the first term in the cost-to-go will depend on the action, so the DP iterations proceed exactly as in the state observed case

DP: $J_k^\star (\mathcal I_{k}) = \min_{a\in\mathcal A} \mathbb E[c(s_k, a)+ J_{k+1}^\star (\mathcal I_{k+1})| \mathcal I_{k}, a ]$

Reference: Ch 4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

Separation in PO-LQ control

Theorem: The optimal policy is linear in the estimated state
$\pi_t^\star(\mathcal I_t) = K_t \mathbb E[s_t|\mathcal I_t]$ and coincides with the
optimal state feedback policy

$K_t = -(R+G^\top P_{t+1}G)^{-1}G^\top P_{t+1}F$
$P_t = Q+F^\top P_{t+1}F + F^\top P_{t+1}G(R+G^\top P_{t+1}G)^{-1}G^\top P_{t+1}F$
Define as $P_t = Ricc(P_{t+1}, F, G, Q, R)$

Reference: Ch 4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

Linear Quadratic Gaussian

Fact from Lecture 8: if $w_k$ and $v_k$ are Gaussian random variables, then the Kalman filter computes the posterior distribution of the latent state $$P(s_t | y_0,...,y_t, a_0,..,a_{t-1}) =\mathcal N(\hat s_{t|t}, P_{t|t})$$

Kalman filter at $t$

Extrapolate $\hat s_{t\mid t-1} =F\hat s_{t-1\mid t-1} +\textcolor{cyan}{Ga_{t-1}}$ $$P_{t\mid t-1} = FP_{t-1\mid t-1} F^\top + \Sigma_w$$
Compute gain $L_{t} = P_{t\mid t-1}H^\top ( HP_{t\mid t-1} H^\top+\Sigma_v)^{-1}$
Update $\hat s_{t\mid t} = \hat s_{t\mid t-1}+ L_{t}(y_{t}-H\hat s_{t\mid t-1})$ $$P_{t\mid t} = (I - L_{t}H)P_{t\mid t-1}$$

Linear Quadratic Gaussian

Kalman filter at $t$

State estimate $$\hat s_{t\mid t} = (F\hat s_{t-1\mid t-1} +{Ga_{t-1}})+ L_{t}(y_{t}-H(F\hat s_{t-1\mid t-1} +{Ga_{t-1}}))$$
With gain $L_{t} = P_{t\mid t-1}H^\top ( HP_{t\mid t-1} H^\top+\Sigma_v)^{-1}$
where $P_{t\mid t-1} = FP_{t-1\mid t-1} F^\top + \Sigma_w$ and $P_{t\mid t} = (I - L_{t}H)P_{t\mid t-1}$

Theorem: The optimal policy is linear in the KF state
$\pi_t^\star(\mathcal I_t) = K_t \hat s_{t|t}$ and coincides with the
optimal state feedback policy

$K_t = -(R+G^\top P_{t+1}G)^{-1}G^\top P_{t+1}F$
$P_t = Ricc(P_{t+1}, F, G, Q, R)$

Steady state LQG

Assume $(F,G)$ controllable, $(F,H)$ observable, $Q,\Sigma_w\succ 0$
Feedback policy steady state
- A fixed point exists $P_\star = Ricc(P_{\star}, F, G, Q, R)$
- If $Q_T = P_\star$ or $T\to\infty$ then $K_\star = -(R+G^\top P_{\star}G)^{-1}G^\top QP_{\star}F$
Kalman filter steady state
- Combining predict and update equations, $$P_{t+1|t} = Ricc(P_{t|t-1}, F^\top, H^\top, \Sigma_w, \Sigma_v)$$
- Similarly, a fixed point exists $\Sigma_{\star} = Ricc(\Sigma_{\star}, F^\top, H^\top, \Sigma_w, \Sigma_v)$
- Fact: if $P_{0|-1} = \Sigma_\star$ or $t\to\infty$ then $L_{\star} = F\Sigma_{\star}H^\top ( H\Sigma_{\star} H^\top+\Sigma_v)^{-1}$
Fact 3: For LQG, the steady state optimal policy is described by control gains $K_\star$ and filter gains $L_\star$, resulting in LTI dynamics $$a_t = K_\star \hat s_{t},\quad \hat s_{t+1} = F\hat s_t + G a_t + L_\star (y_{t+1}-H(F\hat s_t + Ga_t)) $$

For fixed point of Riccati equation existence $$P_\star = Ricc(P_{\star}, F, G, Q, R)$$

$(F,G)$ is stabilizable
- guaranteed if controllable
$(F,Q^{1/2})$ is detectable
- guaranteed if observable

Steady state LQG

Assume $(F,G)$ controllable, $(F,H)$ observable, $Q,\Sigma_w\succ 0$
Feedback policy
- $P_\star = Ricc(P_{\star}, F, G, Q, R)$
- $K_\star = -(R+G^\top P_{\star}G)^{-1}G^\top QP_{\star}F$
Kalman filter
- $\Sigma_{\star} = Ricc(\Sigma_{\star}, F^\top, H^\top, \Sigma_w, \Sigma_v)$
- $L_{\star} = F\Sigma_{\star}H^\top ( H\Sigma_{\star} H^\top+\Sigma_v)^{-1}$
Fact 3: For LQG, the steady state optimal policy is described by control gains $K_\star$ and filter gains $L_\star$, resulting in LTI dynamics $$a_t = K_\star \hat s_{t},\quad \hat s_{t+1} = F\hat s_t + G a_t + L_\star (y_{t+1}-H(F\hat s_t + Ga_t)) $$

Failure of Separation Principle

In general, optimal policies do not exhibit the separation principle
Even in the simple linear general-cost Guassian case
- KF still computes the state posterior $\mathcal N(\hat s_{t|t}, P_{t|t})$
Example: scalar $s_{t+1} = s_t+a_t+w_t$ with quartic cost $$\min_a \mathbb E[s_{t+1}^4] + 2a^2$$
- For a Gaussian $\mathcal N(\mu, \sigma)$, we have that $E[s^4] =\mu^4+6\mu^2\sigma^2 + 3\sigma^4$
- Then the minimization becomes $$\min_a (\hat s+a)^4+6(\hat s+a)^2\sigma^2 + 3\sigma^4+2a^2$$
- So the optimal action satisfies $$a^3+3\hat s a^2+(3\hat s^2+3\sigma^2+1)a+(\hat s^3+3\hat s\sigma^2)=0$$
- Numerically solving, we see effect of $\sigma$ on optimal action

Failure of Separation Principle

In general, optimal policies do not exhibit the separation principle
Even in the simple linear general-cost Guassian case
- KF still computes the state posterior $\mathcal N(\hat s_{t|t}, P_{t|t})$
Example: scalar $s_{t+1} = s_t+a_t(1+w_t)$ with multiplicative noise for $\sigma_w=1$ and quartic cost $$\min_a \mathbb E[s_{t+1}^4] + 2a^2$$
- For a Gaussian $\mathcal N(\mu, \sigma)$, we have that $E[s^4] =\mu^4+6\mu^2\sigma^2 + 3\sigma^4$
- Then the minimization becomes $$\min_a (\hat s+a)^4+6(\hat s+a)^2(p^2+a^2) + 3(p^2+a^2)^2+2a^2$$
- Numerically solving, we see even more complex effects

Classes of Policies

Closed Loop: a policy choosing actions with knowledge of $F,H,c$ and noise characteristics over the full horizon $0,..., T$
Feedback: a policy choosing actions at $t$ with knowledge of
1. $F,c$ and process noise characteristics over the full horizon
2. $H$ and measurement noise characteristics over the horizon $0,...,t$
- Certainty Equivalent: a policy depending only on the estimated state
- contrast with Uncertainty Aware: a policy depending on posterior state distribution
Open Loop: a policy choosing actions with knowledge of $F,c$ over the full horizon but no knowledge about $H$ or measurement noise

Stochastic Optimal Control Problem

$$ \min_{\pi_{0:T}}~~ \mathbb E_{w,v}\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

Tse & Bar-Shalom, Information patterns and classes of stochastic control laws, 1973.

$$a_k=\pi_k(a_{0:k-1}, y_{0:k}) $$

$$y_k=H(s_k,v_k) $$

Classes of Policies

Closed Loop: gold standard for optimality, may exhibit "probing" or "active learning" behavior to reduce uncertainties
Feedback: common in practice (especially certainty equivalent), may be overly cautious (especially uncertainty aware) due to lack of ability to consider uncertainty reduction
Open Loop: "eyes closed" policy does not respond to new observations

Stochastic Optimal Control Problem

$$ \min_{\pi_{0:T}}~~ \mathbb E_{w,v}\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

Tse & Bar-Shalom, Information patterns and classes of stochastic control laws, 1973.

$$a_k=\pi_k(a_{0:k-1}, y_{0:k}) $$

$$y_k=H(s_k,v_k) $$

Reference: Ch 4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

Recap

Separation principle
Steady state LQG

Next time: adaptive control

Announcements

Project proposal due tomorrow, no assignment over Fall break