Learning and Control in the Presence of  Observer Effects

Sarah Dean, Cornell University

IMSI Workshop, May 2026

Observer effects occur when there is coupling between actuation and observation

 

  • examples: electronic circuits, quantum wave collapse, human psychology, robotics ....

Observer Effects

Example: Personalization

\(u_t\)

unknown preference parameters \(\theta\)

expressed preferences

recommended content

recommender policy

\(\mathbb E[y_t] = \theta^\top u_t  \)

approach: identify \(\theta\) sufficiently well to make good recommendations

Classically studied as an online decision problem (e.g. multi-armed bandits)

               \(y_t\)

Example: Preference Dynamics

\(u_t\)

However, interests may be impacted by recommended content

preference state \(x_t\)

expressed preferences

recommended content

recommender policy

\(\mathbb E[y_t] =  u_t^\top  C x_t  \)

updates to \(x_{t+1}\)

               \(y_t\)

  • Dean & Morgenstern. Preference dynamics under personalized recommendations, EC'22
  • Chee et al. Harm Mitigation in Recommender Systems under User Preference Dynamics, KDD'24

$$x_{t+1} = Ax_t + Bu_t + w_t\\ y_t = u_t^\top Cx_t + v_t$$

 

  • input \(u\in\mathbb R^p\)
  • output \(y\in\mathbb R\)
  • state \(x\in\mathbb R^n\)
  • measurement noise \(v\in\mathbb R\)
  • process noise \(w\in\mathbb R^n\)

Setting: Bilinearly Observed LDS

Setting: bilinearly observed linear dynamical system (BO-LDS)

Outline

1. Identification

2. Separation Principle

inputs

outputs

time

3. Optimal Control

Kalman

Filter

State Feedback

\(y\)

\(\hat x\)

\(u\)

Outline

i) Setting

ii) Algorithm

iii) Results

inputs

outputs

time

1. Identification from Bilinear Observations

Problem Setting: Identification

  • Unknown dynamics and measurement matrices $$A,B,C\quad\text{unkown}$$
  • Observed trajectory of inputs \(u\in\mathbb R^p\) and outputs \(y\in\mathbb R\) $$u_0,y_0,u_1,y_1,...,u_T,y_T$$
  • Goal: identify dynamics and measurement models from data

e.g. playlist attributes

e.g. listen time

inputs \(u_t\)

 

\( \)

 

 

outputs \(y_t\)

Identification Algorithm

Input: data \((u_0,y_0,...,u_T,y_T)\), history length \(L\), state dim \(n\)

Step 1: Regression

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

Step 2: Decomposition \(\hat A,\hat B,\hat C = \mathrm{HoKalman}(\hat G, n)\)
(Omyak & Ozay, 2019)

\(t\)

\(L\)

\(\underbrace{\qquad\qquad}\)

inputs

outputs

time

Yahya Sattar

\(~\)

Yassir Jedra

$$\underbrace{\qquad\qquad}_{} \\ \downarrow \\ \begin{bmatrix} u_{t-1}^\top & ... & u_{t-L}^\top \end{bmatrix} \otimes u_t^\top \mathrm{vec}(G) $$

degree 2 polynomial features

Estimation Errors

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

  • (Biased) estimate of Markov parameters $$ G  =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
  • Regress \(y_t\) against $$ \underbrace{ \begin{bmatrix} u_{t-1}^\top & ... & u_{t-L}^\top \end{bmatrix}}_{\bar u_{t-1}^\top } \otimes u_t^\top $$
  • Data matrix: circulant-like structure $$Z = \begin{bmatrix}\bar u_{L-1}^\top  \otimes u_L^\top \\ \vdots \\ \bar u_{T-1}^\top \otimes u_T^\top\end{bmatrix} $$

\(t\)

\(L\)

\(\underbrace{\qquad\qquad}\)

\(\bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) \)

inputs

outputs

time

Estimation Result

Under the following assumptions:

  1. Process and measurement noise \(w_t,v_t\) are i.i.d., zero mean, and have bounded second moments
  2. Inputs \(u_t\) are bounded
  3. The dynamics are strictly stable, i.e. \(\rho(A)<1\)
  4. (For state space recovery: \((A,B,C)\) are observable, controllable)

Informal Summary Theorem

Choosing \(L=\log(T)/\log(\rho(A)^{-1})\) guarantees that with high probabilty, for bounded random design inputs \(u_{0:T}\), $$\mathrm{estimation~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dimension})}{T}}$$

Main Results

Assumptions:

  1. Process and measurement noise \(w_t,v_t\) are i.i.d., zero mean, and have bounded second moments
  2. Inputs \(u_t\) are bounded
  3. The dynamics are strictly stable, i.e. \(\rho(A)<1\)

Informal Theorem (Markov parameter estimation)

With probability at least \(1-\delta\), $$\|G-\hat G\|_{Z^\top Z} \lesssim \sqrt{ \frac{p^2 L}{\delta} \cdot c_{\mathrm{stability,noise}} }+ \rho(A)^L\sqrt{T} c_{\mathrm{stability}}$$

\(\hat G\)

Proof Sketch

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G)  \big)^2 $$

  • Claim: this is a biased estimate of Markov parameters $$ G_\star  =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
    • Observe that \(x_t = \sum_{k=1}^L CA^{k-1} B (u_{t-k} + w_{t-k}) + A^L x_{t-L}\)
    • Hence, \(y_t= \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G_\star) +u_t^\top \textstyle \sum_{k=1}^L CA^{k-1} w_{t-k} + u_t CA^L x_{t-L} + v_t \)
  • Least squares: for \(y_t = z_t^\top \theta + n_t\), the estimate  \(\hat\theta=\arg \min\sum_t (z_t^\top \theta - y_t)^2\) $$= \textstyle\arg \min  \|Z \theta - Y\|^2_2 = (Z^\top Z)^\dagger Z^\top Y= \theta_\star + (Z^\top Z)^\dagger Z^\top N$$
  • Estimation errors are therefore \(\|G_\star -\hat G\|_{Z^\top Z} = \|Z^\top N\| \)
  • Blocking technique to bound minimum singular value of \(Z\)

\(*\)

\(=\)

$$ = \begin{bmatrix}\bar u_{L-1}^\top  \otimes u_L^\top \\ \vdots \\ \bar u_{T-1}^\top \otimes u_T^\top\end{bmatrix} =Z$$

Informal Summary Theorem

Choosing \(L=\log(T)/\log(\rho(A)^{-1})\) guarantees that with high probabilty, for bounded random design inputs \(u_{0:T}\), $$\mathrm{estimation~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dimension})}{T}}$$

  • Analysis requires blocking technique for dependent covariates $$\begin{bmatrix} u_{t-1}^\top & ... & u_{t-L}^\top \end{bmatrix} \otimes u_t^\top$$
  • Open question: marginally stable \(\rho(A)=1\)?
    • Regressing additionally against past outputs \(y_{t-1:t-L}\) introduces higher order polynomial dependence (degree \(L\))

Estimation Result

Outline

1. Identification

2. Separation Principle

inputs

outputs

time

3. Optimal Control

Kalman

Filter

State Feedback

\(y\)

\(\hat x\)

\(u\)

Outline

i) Control Setting

ii) SP Policy

2. Separation Principle for Control from Bilinear Obs

Kalman

Filter

State Feedback

\(y\)

\(\hat x\)

\(u\)

Problem Setting: Optimal Control

  • Linear state update with \(A\in\mathbb R^{n\times n}\), \(B\in\mathbb R^{n\times p}\) $$x_{t+1} = Ax_t + Bu_t + w_t $$
  • Bilinear measurements with \(C_i\in\mathbb R^{m\times n}\) $$y_t = \underbrace{\Big(C_0 + \sum_{i=1}^p u_t[i] C_i \Big)}_{C(u_t)}x_t + v_t$$
  • Quadratic costs with \(Q,R\succ 0\) $$c(x,u) = x^\top Q x + u^\top R u $$
  • Gaussian process noise, measurement noise, and initial state $$\{w_t\} \sim \mathcal N(0,\Sigma_w),\quad \{v_t\} \sim \mathcal N(0, \Sigma_v),\quad x_0 \sim \mathcal N(0,\Sigma_0)$$
  • Information set for decision-making $$\mathcal I_t = \{u_0,...,u_{t-1}, y_0, ..., y_{t-1}\}$$

Problem Setting: Optimal Control

$$\min_{u_t=\pi_t(\mathcal I_t)} \mathbb E\left[x_T^\top Q x_T+ \sum_{t=1}^{T-1} x_t^\top Q x_t + u_t^\top R u_t \right]\\ \text{s.t.} \quad x_{t+1} = Ax_t + Bu_t + w_t \\\qquad\qquad\qquad  y_t =\Big(C_0 + \sum_{i=1}^p u_t[i] C_i \Big)x_t + v_t$$

Small departure from classic LQG control

Sunmook Choi

Yahya Sattar

Yassir Jedra

Maryam Fazel

Leo Maynard-Zhang

Separation Principle

  • Separation principle (SP):  independently design
    estimation & control
    • Optimal for partially observed
      LQ control (\(C_i=0\) for \(i>0\))
  • The SP policy has two components:
    1. State estimation \(\hat x_t = \mathbb E[x_t|\mathcal I_t]\)
    2. State dependent policy \(u_t = K^\star_t \hat x_t\)

Kalman

Filter

State Feedback

\(y\)

\(\hat x\)

\(u\)

Partial Observation LQ Control

Simplest problem: linear dynamics, quadratic cost, zero mean noise

\(u_t =  K_t^\star \hat x_t,\quad \hat x_t = \mathbb E[x_t|u_0,...,u_t,y_0,...,y_t]\)

  • where \(K^\star = \{K^\star_0,...,K^\star_T\}\) defined recursively depending on \(A,B,Q,R\)
  • and \(\hat x_t\) depends on \(A,B,C,\Sigma_w,\Sigma_v,\Sigma_0\)
  • when noise is Gaussian, \(\hat x_t\) computed efficiently with Kalman filter

Linear policy is optimal and can be computed in closed-form:

minimize \(\mathbb{E}\left[ \sum_{t=0}^{T-1} x_t^\top Q x_t + u_t^\top R u_t\right]\)

          s.t.  \(x_{t+1} = Ax_t+Bu_t+w_t\)

                 \(y_{t} = Cx_t+v_t\)

?

(separation principle)

?

?

?

?

Separation Principle Policy

The posterior distribution is given by the Kalman filter \(x_t|\mathcal I_t \sim \mathcal N(\hat x_t,\Sigma_t)\)

Open question: does \(\varepsilon\) estimation error in dynamics lead to performance degradation scaling with \(\varepsilon^2\) (as in LQG) or \(\varepsilon\)?

  • Conjecture: \(\varepsilon\) (due to \(\Sigma_t\)), thus certainty equivalent control is not efficient

1. State Estimation with the Kalman Filter

\(\hat x_{t+1} = A\hat x_t + Bu_t - L_t\big(y_t-C(u_t)\hat x_t\big)\)

\(\Sigma_{t+1} = (A+ L_tC(u_t))\Sigma_tA^\top + \Sigma_w\)

\(L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}\)

2. State Feedback Control via LQR

$$u_t =  K_t^\star \hat x_t$$

where \(K^\star = \{K^\star_0,...,K^\star_T\}\) defined recursively depending on \(A,B,Q,R\)

 

Outline

1. Identification

2. Separation Principle

inputs

outputs

time

3. Optimal Control

Kalman

Filter

State Feedback

\(y\)

\(\hat x\)

\(u\)

Outline

i) SP is not Optimal

ii) Belief-space MPC

3. Optimal Control from Bilinear Observations

Optimality Results

  • Theorem: For \(T\geq 2\), the optimal policy is not affine in the estimated state
    • as a consequence, the SP policy is not optimal
  • Theorem: There exist instances in which the SP policy locally maximizes the cost
    • in these instances, the optimal controller is nonlinear and not unique, i.e. for scalar system at \(t=T-2\), $$ u^\star_{t} = -\alpha\hat x_{t}\left(1\pm \frac{1}{K_{t} \hat x_{t}}\sqrt{-\frac{\Sigma_z}{\Sigma_{t}} +\beta K_{t}}\right) $$

Proof Sketch

  • Strategy: analyze solution to dynamic programming
  • At \(t=T\), the value function is \(V_T(x) = x^\top Q x\)
  • At \(t=T-1\), $$V_{T-1}(x_t)= \min_u \underbrace{ \mathbb E[ c(x_t, u) +V_T(x_{t+1})|\mathcal I_{T-1}]}_{f_{T-1}(u) = f_{T-1}^\mathrm{LQ}(u)}$$
    • The solution coincides with LQG $$u^\star_{T-1} = K^\star_{T-1}\mathbb E[x_{T-1}|\mathcal I_{T-1}]$$
  • At \(t=T-2\), due to dependence of state estimation on input $$ \min_u\underbrace{\mathbb E[ c(x_t, u) +V_{T-1}x_{t+1})|\mathcal I_{T-2}]}_{f_{T-2}(u)  = f_{T-2}^\mathrm{LQ}(u) + f^\mathrm{obs}_{T-2}(u)} $$

Proof Sketch

  • Strategy: analyze solution to dynamic programming
  • At \(t=T\), the value function is \(V_T(x) = x^\top Q x\)
  • At \(t=T-1\), \(u^\star_{T-1} = K^\star_{T-1}\mathbb E[x_{T-1}|\mathcal I_{T-1}]\)
  • At \(t=T-2\), due to dependence of state estimation on input $$ \min_u\underbrace{\mathbb E[ c(x_t, u) +V_{T-1}x_{t+1})|\mathcal I_{T-2}]}_{f_{T-2}(u)  = f_{T-2}^\mathrm{LQ}(u) + f^\mathrm{obs}_{T-2}(u)} $$

Examples where SP is bad

$$\begin{align*} x_{t+1} &= \begin{bmatrix} 1 & 0.3 \\ 0 & 1\end{bmatrix} x_t + \begin{bmatrix}0.3 \\ 0 \end{bmatrix} u_t + w_t \\ y_t &= u_t\begin{bmatrix} 1 & 0\end{bmatrix} x_t + v_t \end{align*}$$

with \(Q=I\) and \(R=1000\)

$$\begin{align*} x_{t+1} &= x_t + \begin{bmatrix}0 & 1 \end{bmatrix} u_t + w_t \\ y_t &= u_t^\top \begin{bmatrix} 1\\ 0\end{bmatrix} x_t + v_t \end{align*}$$

with \(Q=\frac{1}{2}\) and \(R=I\)

\(\implies\)

infinite horizon \(K_\star = \begin{bmatrix} 0 \\ \frac{1}{2}\end{bmatrix}\)

\(y_t = \hat x_t \begin{bmatrix} 0 \\ \frac{1}{2}\end{bmatrix}^\top \begin{bmatrix} 1 \\ 0\end{bmatrix} x_t + v_t\)

\(0\)         only noise is observed!

Belief Space Objective

Sunmook Choi

Yahya Sattar

$$\begin{align*}\min_{u_t=\pi_t(\hat x_t,\Sigma_t)}~~& \mathbb E\left[ \sum_{t=1}^{T-1} \hat x_t^\top Q \hat x_t + \mathrm{tr}(\Sigma_t) + u_t^\top R u_t \right]\\ \text{s.t.}~~& \hat x_{t+1} = A\hat x_t + Bu_t + L_t(C(u_t)\hat x_t - y_t)\\ &\Sigma_{t+1} = (A+ L_tC(u_t))\Sigma_tA^\top + \Sigma_w \end{align*}$$

$$\mathbb E\left[x_t^\top Q x_t \mid \mathcal I_t \right]= \hat x_t\top Q\hat x_t + \mathrm{tr}(Q\Sigma_t)$$

$$\begin{align*} \min_{u_t=\pi_t(\mathcal I_t)} ~~&\mathbb E\left[\sum_{t=1}^{T-1} x_t^\top Q x_t + u_t^\top R u_t \right]\\ \text{s.t.}~~& x_{t+1} = Ax_t + Bu_t + w_t \\& y_t =C(u_t)x_t + v_t\end{align*}$$

\(L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}\)

Andrew Lowitt

Beixi Du

Daniel Cao

Rewrite OCP in terms of the belief space from KF

This is now a fully observed, nonlinear, stochastic optimal control problem

innovations \(\sim \mathcal N(0, C(u_t)\Sigma_t C(u_t)^\top + \Sigma_v)\)

Belief Space MPC

$$\begin{align*}\pi(\hat x_t,\Sigma_t)=\arg \min_{u_k}~~& \mathbb E\left[ \sum_{k=1}^{H} \bar x_k^\top Q \bar x_k + \mathrm{tr}(\bar\Sigma_k) + u_k^\top R u_k \right]\\ \text{s.t.}~~& \bar x_{k+1} = A\bar x_k + Bu_k\\ &\bar\Sigma_{k+1} = (A+ L_kC(u_k))\Sigma_kA^\top + \Sigma_w \\ &\bar x_0 = \hat x_t ,~~ \bar \Sigma_0=\Sigma_t\end{align*}$$

\(L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}\)

Rewrite OCP in terms of the belief space from KF

Expected dynamics, open loop inputs, finite horizon \(\rightarrow\) MPC

"Solve" MPC problem with autograd and L-BFGS

Summary

1. Identification

2. Separation Principle

inputs

outputs

time

3. Optimal Control

Kalman

Filter

State Feedback

\(y\)

\(\hat x\)

\(u\)

$$x_{t+1} = Ax_t + Bu_t + w_t\qquad  y_t = u_t^\top Cx_t + v_t$$

Stability is still crucial for identification

Bilinear observation changes the character of the learning and control problem

Certainty equivalent control may not be efficient

Optimal policy does not follow separation principle

What lessons did we learn about RL & ML-enabled control?

  1. Simple model-based approaches work (no need for model-free)
  2. Naive exploration is sufficient, or even no exploration
  3. No need to account for finite sample uncertainty*

Lessons from LQ Control

\(\implies\) Problem does not capture all issues of interest!

*Exceptions: low data regime, safety/actuation limits

Approach: ID then Control

1. Collect \(N\) observations and estimate \(\widehat A,\widehat B, \widehat C\)

2. Design policy as if estimate is true ("certainty equivalent")

 

\((A_\star, B_\star,C_\star)\)

 

\(\widehat \pi\)

 

\((A_\star, B_\star,C_\star)\)

 

Control Result (Informal):

sub-opt. of \(\widehat \pi\lesssim(\)param. err.\()^2 \lesssim \frac{1}{N}\)

Learning Result (Informal):

parameter error \( \lesssim \frac{1}{\sqrt{N}}\)

least squares regression

Naive exploration is essentially optimal!

white noise inputs

  • Learning Linear Dynamics from Bilinear Observations at ACC25 (arxiv:2409.16499) with Yahya Sattar and Yassir Jedra
  • Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics at AISTATS26 (arxiv:2510.16208) with Sunmook Choi, Yahya Sattar, Yassir Jedra, Maryam Fazel
  • Sub-optimality of the Separation Principle for Quadratic Control from Bilinear Observations at CDC25 (arxiv:2504.11555) with Yahya Sattar, Sunmook Choi, Yassir Jedra, Maryam Fazel
  • Dual Control of Linear Systems from Bilinear Observations with Belief Space Model Predictive Control (arXiv:2604.24663) with Daniel Cao, Beixi Du, Andrew Lowitt, Sunmook Choi, Yahya Sattar

Thanks! Questions?

Sunmook Choi

Yahya Sattar

Yassir Jedra

Maryam Fazel

Leo Maynard-Zhang

Andrew Lowitt

Beixi Du

Daniel Cao

Learning and control in the presence of observer effects

By Sarah Dean

Learning and control in the presence of observer effects

  • 7