Sarah Dean PRO
asst prof in CS at Cornell
IMSI Workshop, May 2026
Observer effects occur when there is coupling between actuation and observation
\(u_t\)
unknown preference parameters \(\theta\)
expressed preferences
recommended content
recommender policy
\(\mathbb E[y_t] = \theta^\top u_t \)
approach: identify \(\theta\) sufficiently well to make good recommendations
Classically studied as an online decision problem (e.g. multi-armed bandits)
\(y_t\)
\(u_t\)
However, interests may be impacted by recommended content
preference state \(x_t\)
expressed preferences
recommended content
recommender policy
\(\mathbb E[y_t] = u_t^\top C x_t \)
updates to \(x_{t+1}\)
\(y_t\)
$$x_{t+1} = Ax_t + Bu_t + w_t\\ y_t = u_t^\top Cx_t + v_t$$
Setting: bilinearly observed linear dynamical system (BO-LDS)
1. Identification
2. Separation Principle
inputs
outputs
time
3. Optimal Control
Kalman
Filter
State Feedback
\(y\)
\(\hat x\)
\(u\)
inputs
outputs
time
1. Identification from Bilinear Observations
e.g. playlist attributes
e.g. listen time
inputs \(u_t\)
\( \)
outputs \(y_t\)
Input: data \((u_0,y_0,...,u_T,y_T)\), history length \(L\), state dim \(n\)
Step 1: Regression
$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$
Step 2: Decomposition \(\hat A,\hat B,\hat C = \mathrm{HoKalman}(\hat G, n)\)
(Omyak & Ozay, 2019)
\(t\)
\(L\)
\(\underbrace{\qquad\qquad}\)
inputs
outputs
time
Yahya Sattar
\(~\)
Yassir Jedra
$$\underbrace{\qquad\qquad}_{} \\ \downarrow \\ \begin{bmatrix} u_{t-1}^\top & ... & u_{t-L}^\top \end{bmatrix} \otimes u_t^\top \mathrm{vec}(G) $$
degree 2 polynomial features
$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$
\(t\)
\(L\)
\(\underbrace{\qquad\qquad}\)
\(\bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) \)
inputs
outputs
time
Under the following assumptions:
Choosing \(L=\log(T)/\log(\rho(A)^{-1})\) guarantees that with high probabilty, for bounded random design inputs \(u_{0:T}\), $$\mathrm{estimation~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dimension})}{T}}$$
Assumptions:
With probability at least \(1-\delta\), $$\|G-\hat G\|_{Z^\top Z} \lesssim \sqrt{ \frac{p^2 L}{\delta} \cdot c_{\mathrm{stability,noise}} }+ \rho(A)^L\sqrt{T} c_{\mathrm{stability}}$$
\(\hat G\)
$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) \big)^2 $$
\(*\)
\(=\)
$$ = \begin{bmatrix}\bar u_{L-1}^\top \otimes u_L^\top \\ \vdots \\ \bar u_{T-1}^\top \otimes u_T^\top\end{bmatrix} =Z$$
Choosing \(L=\log(T)/\log(\rho(A)^{-1})\) guarantees that with high probabilty, for bounded random design inputs \(u_{0:T}\), $$\mathrm{estimation~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dimension})}{T}}$$
1. Identification
2. Separation Principle
inputs
outputs
time
3. Optimal Control
Kalman
Filter
State Feedback
\(y\)
\(\hat x\)
\(u\)
2. Separation Principle for Control from Bilinear Obs
Kalman
Filter
State Feedback
\(y\)
\(\hat x\)
\(u\)
$$\min_{u_t=\pi_t(\mathcal I_t)} \mathbb E\left[x_T^\top Q x_T+ \sum_{t=1}^{T-1} x_t^\top Q x_t + u_t^\top R u_t \right]\\ \text{s.t.} \quad x_{t+1} = Ax_t + Bu_t + w_t \\\qquad\qquad\qquad y_t =\Big(C_0 + \sum_{i=1}^p u_t[i] C_i \Big)x_t + v_t$$
Small departure from classic LQG control
Sunmook Choi
Yahya Sattar
Yassir Jedra
Maryam Fazel
Leo Maynard-Zhang
Kalman
Filter
State Feedback
\(y\)
\(\hat x\)
\(u\)
Simplest problem: linear dynamics, quadratic cost, zero mean noise
\(u_t = K_t^\star \hat x_t,\quad \hat x_t = \mathbb E[x_t|u_0,...,u_t,y_0,...,y_t]\)
Linear policy is optimal and can be computed in closed-form:
minimize \(\mathbb{E}\left[ \sum_{t=0}^{T-1} x_t^\top Q x_t + u_t^\top R u_t\right]\)
s.t. \(x_{t+1} = Ax_t+Bu_t+w_t\)
\(y_{t} = Cx_t+v_t\)
?
(separation principle)
?
?
?
?
The posterior distribution is given by the Kalman filter \(x_t|\mathcal I_t \sim \mathcal N(\hat x_t,\Sigma_t)\)
Open question: does \(\varepsilon\) estimation error in dynamics lead to performance degradation scaling with \(\varepsilon^2\) (as in LQG) or \(\varepsilon\)?
1. State Estimation with the Kalman Filter
\(\hat x_{t+1} = A\hat x_t + Bu_t - L_t\big(y_t-C(u_t)\hat x_t\big)\)
\(\Sigma_{t+1} = (A+ L_tC(u_t))\Sigma_tA^\top + \Sigma_w\)
\(L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}\)
2. State Feedback Control via LQR
$$u_t = K_t^\star \hat x_t$$
where \(K^\star = \{K^\star_0,...,K^\star_T\}\) defined recursively depending on \(A,B,Q,R\)
1. Identification
2. Separation Principle
inputs
outputs
time
3. Optimal Control
Kalman
Filter
State Feedback
\(y\)
\(\hat x\)
\(u\)
3. Optimal Control from Bilinear Observations
$$\begin{align*} x_{t+1} &= \begin{bmatrix} 1 & 0.3 \\ 0 & 1\end{bmatrix} x_t + \begin{bmatrix}0.3 \\ 0 \end{bmatrix} u_t + w_t \\ y_t &= u_t\begin{bmatrix} 1 & 0\end{bmatrix} x_t + v_t \end{align*}$$
with \(Q=I\) and \(R=1000\)
$$\begin{align*} x_{t+1} &= x_t + \begin{bmatrix}0 & 1 \end{bmatrix} u_t + w_t \\ y_t &= u_t^\top \begin{bmatrix} 1\\ 0\end{bmatrix} x_t + v_t \end{align*}$$
with \(Q=\frac{1}{2}\) and \(R=I\)
\(\implies\)
infinite horizon \(K_\star = \begin{bmatrix} 0 \\ \frac{1}{2}\end{bmatrix}\)
\(y_t = \hat x_t \begin{bmatrix} 0 \\ \frac{1}{2}\end{bmatrix}^\top \begin{bmatrix} 1 \\ 0\end{bmatrix} x_t + v_t\)
\(0\) only noise is observed!
Sunmook Choi
Yahya Sattar
$$\begin{align*}\min_{u_t=\pi_t(\hat x_t,\Sigma_t)}~~& \mathbb E\left[ \sum_{t=1}^{T-1} \hat x_t^\top Q \hat x_t + \mathrm{tr}(\Sigma_t) + u_t^\top R u_t \right]\\ \text{s.t.}~~& \hat x_{t+1} = A\hat x_t + Bu_t + L_t(C(u_t)\hat x_t - y_t)\\ &\Sigma_{t+1} = (A+ L_tC(u_t))\Sigma_tA^\top + \Sigma_w \end{align*}$$
$$\mathbb E\left[x_t^\top Q x_t \mid \mathcal I_t \right]= \hat x_t\top Q\hat x_t + \mathrm{tr}(Q\Sigma_t)$$
$$\begin{align*} \min_{u_t=\pi_t(\mathcal I_t)} ~~&\mathbb E\left[\sum_{t=1}^{T-1} x_t^\top Q x_t + u_t^\top R u_t \right]\\ \text{s.t.}~~& x_{t+1} = Ax_t + Bu_t + w_t \\& y_t =C(u_t)x_t + v_t\end{align*}$$
\(L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}\)
Andrew Lowitt
Beixi Du
Daniel Cao
Rewrite OCP in terms of the belief space from KF
This is now a fully observed, nonlinear, stochastic optimal control problem
innovations \(\sim \mathcal N(0, C(u_t)\Sigma_t C(u_t)^\top + \Sigma_v)\)
$$\begin{align*}\pi(\hat x_t,\Sigma_t)=\arg \min_{u_k}~~& \mathbb E\left[ \sum_{k=1}^{H} \bar x_k^\top Q \bar x_k + \mathrm{tr}(\bar\Sigma_k) + u_k^\top R u_k \right]\\ \text{s.t.}~~& \bar x_{k+1} = A\bar x_k + Bu_k\\ &\bar\Sigma_{k+1} = (A+ L_kC(u_k))\Sigma_kA^\top + \Sigma_w \\ &\bar x_0 = \hat x_t ,~~ \bar \Sigma_0=\Sigma_t\end{align*}$$
\(L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}\)
Rewrite OCP in terms of the belief space from KF
Expected dynamics, open loop inputs, finite horizon \(\rightarrow\) MPC
"Solve" MPC problem with autograd and L-BFGS
1. Identification
2. Separation Principle
inputs
outputs
time
3. Optimal Control
Kalman
Filter
State Feedback
\(y\)
\(\hat x\)
\(u\)
$$x_{t+1} = Ax_t + Bu_t + w_t\qquad y_t = u_t^\top Cx_t + v_t$$
Stability is still crucial for identification
Bilinear observation changes the character of the learning and control problem
Certainty equivalent control may not be efficient
Optimal policy does not follow separation principle
What lessons did we learn about RL & ML-enabled control?
\(\implies\) Problem does not capture all issues of interest!
*Exceptions: low data regime, safety/actuation limits
1. Collect \(N\) observations and estimate \(\widehat A,\widehat B, \widehat C\)
2. Design policy as if estimate is true ("certainty equivalent")
\((A_\star, B_\star,C_\star)\)
\(\widehat \pi\)
\((A_\star, B_\star,C_\star)\)
Control Result (Informal):
sub-opt. of \(\widehat \pi\lesssim(\)param. err.\()^2 \lesssim \frac{1}{N}\)
Learning Result (Informal):
parameter error \( \lesssim \frac{1}{\sqrt{N}}\)
least squares regression
Naive exploration is essentially optimal!
white noise inputs
Sunmook Choi
Yahya Sattar
Yassir Jedra
Maryam Fazel
Leo Maynard-Zhang
Andrew Lowitt
Beixi Du
Daniel Cao
By Sarah Dean