Learning and Decision-Making Under Observer Effects

Sarah Dean, Cornell University

Simons Workshop, April 2025

Large-scale automated systems

enabled by machine learning

Large-scale automated systems

enabled by machine learning

learning & decision-making algorithm

control theory +

learning theory

learning & decision-making algorithm

Outline

1. Motivation and Background

2. Learning Dynamics

inputs

outputs

time

3. Optimal Control

Outline

i) LQ Control

ii) Preference Dynamics

1. Motivation and Background

Sample Complexity of Control

Work with Horia Mania, Nikolai Matni, Ben Recht, and Stephen Tu in 2017

Sample Complexity: How much data is necessary to control a system?

Given error $\epsilon$ and failure probability $\delta$, how many samples are necessary to ensure $\mathbb P(\text{sub-opt.}\geq \epsilon)\leq\delta$?

Motivation: foundation for understanding RL & ML-enabled control

Classic RL setting: discrete problems and inspired by games

RL techniques applied to continuous systems interacting with the physical world

Linear Quadratic Control

Simplest problem: linear dynamics, quadratic cost, zero mean noise

minimize $\mathbb{E}\left[ \sum_{t=0}^{T-1} x_t^\top Q x_t + u_t^\top R u_t\right]$

s.t. $x_{t+1} = Ax_t+Bu_t+w_t$

$u_t = K_t^\star x_t$

where $K^\star = \{K^\star_0,...,K^\star_T\}$ defined recursively depending on $A,B,Q,R$

Linear policy is optimal and can be computed in closed-form:

Partial Observation LQ Control

Simplest problem: linear dynamics, quadratic cost, zero mean noise

$u_t = K_t^\star \hat x_t,\quad \hat x_t = \mathbb E[x_t|u_0,...,u_t,y_0,...,y_t]$

where $K^\star = \{K^\star_0,...,K^\star_T\}$ defined recursively depending on $A,B,Q,R$
and $\hat x_t$ depends on $A,B,C,\Sigma_w,\Sigma_v,\Sigma_0$
when noise is Gaussian, $\hat x_t$ computed efficiently with Kalman filter

Linear policy is optimal and can be computed in closed-form:

minimize $\mathbb{E}\left[ \sum_{t=0}^{T-1} x_t^\top Q x_t + u_t^\top R u_t\right]$

s.t. $x_{t+1} = Ax_t+Bu_t+w_t$

$y_{t} = Cx_t+v_t$

(separation principle)

Approach: ID then Control

1. Collect $N$ observations and estimate $\widehat A,\widehat B, \widehat C$

2. Design policy as if estimate is true ("certainty equivalent")

$(A_\star, B_\star,C_\star)$

$\widehat \pi$

$(A_\star, B_\star,C_\star)$

Control Result (Informal):

sub-opt. of $\widehat \pi\lesssim($param. err.$)^2 \lesssim \frac{1}{N}$

Learning Result (Informal):

parameter error $ \lesssim \frac{1}{\sqrt{N}}$

least squares regression

Naive exploration is essentially optimal!

white noise inputs

What lessons did we learn about RL & ML-enabled control?

Simple model-based approaches work (no need for model-free)
Naive exploration is sufficient, or even no exploration
No need to account for finite sample uncertainty*

Lessons from LQ Control

$\implies$ Problem does not capture all issues of interest!

*Exceptions: low data regime, safety/actuation limits

Observer effect: coupling between actuation
and observation

examples: electronic circuits, quantum wave collapse, human psychology, robotics ....

New Setting: Observer Effects

Example: Personalization

$u_t$

$y_t$

Classically studied as an online decision problem (e.g. multi-armed bandits)

unknown preference

expressed preferences

Example: Personalization

$u_t$

unknown preference parameters $\theta$

expressed preferences

Example: Preference Dynamics

$u_t$

However, interests may be impacted by recommended content

preference state $x_t$

expressed preferences

Example: Preference Dynamics

Implications for personalization [DM22]

It is not necessary to estimate preferences to make "good" recommendations
Instead of polarization, preferences "collapse" towards whatever users are often recommended
Randomization can prevent such outcomes
Even if harmful content is never recommended, can cause harm through preference shifts [CKEWDI24]

initial preference
resulting preference
recommendation

Outline

1. Motivation and Background

2. Learning Dynamics

inputs

outputs

time

3. Optimal Control

Outline

2. Learning Dynamics from Bilinear Observations

i) Setting

ii) Algorithm

iii) Results

inputs

outputs

time

Problem Setting: Identification

Unknown dynamics and measurement functions
Observed trajectory of inputs $u\in\mathbb R^p$ and outputs $y\in\mathbb R$ $$u_0,y_0,u_1,y_1,...,u_T,y_T$$
Goal: identify dynamics and measurement models from data
Setting: linear/bilinear with $A\in\mathbb R^{n\times n}$, $B\in\mathbb R^{n\times p}$, $C\in\mathbb R^{p\times n}$ $$x_{t+1} = Ax_t + Bu_t + w_t\\ y_t = u_t^\top Cx_t + v_t$$

e.g. playlist attributes

e.g. listen time

inputs $u_t$

outputs $y_t$

Identification Algorithm

Input: data $(u_0,y_0,...,u_T,y_T)$, history length $L$, state dim $n$

Step 1: Regression

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

Step 2: Decomposition $\hat A,\hat B,\hat C = \mathrm{HoKalman}(\hat G, n)$
(Omyak & Ozay, 2019)

$t$

$L$

$\underbrace{\qquad\qquad}$

inputs

outputs

time

Yahya Sattar

$~$

Yassir Jedra

Estimation Errors

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

(Biased) estimate of Markov parameters $$ G =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
Regress $y_t$ against $$ \underbrace{ \begin{bmatrix} u_{t-1}^\top & ... & u_{t-L}^\top \end{bmatrix}}_{\bar u_{t-1}^\top } \otimes u_t^\top $$
Data matrix: circulant-like structure $$Z = \begin{bmatrix}\bar u_{L-1}^\top \otimes u_L^\top \\ \vdots \\ \bar u_{T-1}^\top \otimes u_T^\top\end{bmatrix} $$

$t$

$L$

$\underbrace{\qquad\qquad}$

$\bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) $

inputs

outputs

time

Main Results

Assumptions:

Process and measurement noise $w_t,v_t$ are i.i.d., zero mean, and have bounded second moments
Inputs $u_t$ are bounded
The dynamics are strictly stable, i.e. $\rho(A)<1$

Informal Theorem (Markov parameter estimation)

With probability at least $1-\delta$, $$\|G-\hat G\|_{Z^\top Z} \lesssim \sqrt{ \frac{p^2 L}{\delta} \cdot c_{\mathrm{stability,noise}} }+ \rho(A)^L\sqrt{T} c_{\mathrm{stability}}$$

$\hat G$

Main Results

Assumptions:

Process and measurement noise $w_t,v_t$ are i.i.d., zero mean, and have bounded second moments
Inputs $u_t$ are bounded
The dynamics are strictly stable, i.e. $\rho(A)<1$
Choosing $L=\log(T)/\log(\rho(A)^{-1})$
(For state space recovery: $(A,B,C)$ are observable, controllable)

Informal Summary Theorem

With high probabilty, for bounded random design inputs $u_{0:T}$, $$\mathrm{est.~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dimension})}{\sigma_{\min}(Z^\top Z)}}$$

$$ \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dim.})}{T}}$$

Proof Sketch

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) \big)^2 $$

Claim: this is a biased estimate of Markov parameters $$ G_\star =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
- Observe that $x_t = \sum_{k=1}^L CA^{k-1} B (u_{t-k} + w_{t-k}) + A^L x_{t-L}$
- Hence, $y_t= \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G_\star) +u_t^\top \textstyle \sum_{k=1}^L CA^{k-1} w_{t-k} + u_t CA^L x_{t-L} + v_t $
Least squares: for $y_t = z_t^\top \theta + n_t$, the estimate $\hat\theta=\arg \min\sum_t (z_t^\top \theta - y_t)^2$ $$= \textstyle\arg \min \|Z \theta - Y\|^2_2 = (Z^\top Z)^\dagger Z^\top Y= \theta_\star + (Z^\top Z)^\dagger Z^\top N$$
Estimation errors are therefore $\|G_\star -\hat G\|_{Z^\top Z} = \|Z^\top N\| $
Blocking technique to bound minimum singular value of $Z$

$*$

$=$

$$ = \begin{bmatrix}\bar u_{L-1}^\top \otimes u_L^\top \\ \vdots \\ \bar u_{T-1}^\top \otimes u_T^\top\end{bmatrix} =Z$$

Summary

2. Learning Dynamics from Bilinear Observations

Algorithm: nonlinear features

Analysis: blocking technique

Exploration: similar to linear setting

inputs

outputs

time

Outline

3. Optimal Control with Bilinear Observations

i) Setting

ii) Separation Principle

iii) Results

Problem Setting: Optimal Control

Linear state update with $A\in\mathbb R^{n\times n}$, $B\in\mathbb R^{n\times p}$ $$x_{t+1} = Ax_t + Bu_t + w_t $$
Bilinear measurements with $C_i\in\mathbb R^{m\times n}$ $$y_t = \underbrace{\Big(C_0 + \sum_{i=1}^p u_t[i] C_i \Big)}_{C(u_t)}x_t + v_t$$
Quadratic costs with $Q,R\succ 0$ $$c(x,u) = x^\top Q x + u^\top R u $$
Gaussian process noise, measurement noise, and initial state $$\{w_t\} \sim \mathcal N(0,\Sigma_w),\quad \{v_t\} \sim \mathcal N(0, \Sigma_v),\quad x_0 \sim \mathcal N(0,\Sigma_0)$$
Information set for decision-making $$\mathcal I_t = \{u_0,...,u_{t-1}, y_0, ..., y_{t-1}\}$$

Problem Setting: Optimal Control

$$\min_{u_t=\pi_t(\mathcal I_t)} \mathbb E\left[x_T^\top Q x_T+ \sum_{t=1}^{T-1} x_t^\top Q x_t + u_t^\top R u_t \right]\\ \text{s.t.} \quad x_{t+1} = Ax_t + Bu_t + w_t \\\qquad\qquad\qquad y_t =\Big(C_0 + \sum_{i=1}^p u_t[i] C_i \Big)x_t + v_t$$

Small departure from classic LQG control

Separation Principle

Separation principle (SP): for partially observed optimal control, it suffices to independently design estimation & control
- Optimal for partially observed LQ control
The SP policy has two components:
1. State estimation $\hat x_t = \mathbb E[x_t|\mathcal I_t]$
2. State dependent policy $u_t = K^\star_t \hat x_t$

State Estimation

As in LQG, we use the Kalman Filter
- $\hat x_{t+1} = A\hat x_t + Bu_t - L_t\big(y_t-C(u_t)\hat x_t\big)$
- $\Sigma_{t+1} = (A+ L_tC(u_t))\Sigma_tA^\top + \Sigma_w$
- $L_t = -A\Sigma_tC(u_t)^\top(C(u_t)\Sigma_tC(u_t)^\top+\Sigma_v)^{-1}$
Lemma: the posterior distribution is given by the Kalman filter $$x_t|\mathcal I_t \sim \mathcal N(\hat x_t,\Sigma_t)$$
Unlike the standard linear setting, there is a nonlinear dependence on the inputs

Example

$$ x_{t+1} = \begin{bmatrix} 1 & 0.3 \\ 0 & 1\end{bmatrix} x_t + \begin{bmatrix}0.3 \\ 0 \end{bmatrix} u_t + w_t $$

$$ y_t = (C_0+C_1 u_t)\begin{bmatrix} 1 & 0\end{bmatrix} x_t + v_t$$

with $Q=I$ and $R=1000$

Main Results

Theorem: For $T\geq 2$, the optimal policy is not affine in the estimated state
- as a consequence, the SP policy is not optimal
Theorem: There exist instances in which the SP policy locally maximizes the cost
- in these instances, the optimal controller is nonlinear and not unique, i.e. for scalar system at $t=T-2$, $$ u^\star_{t} = -\alpha\hat x_{t}\left(1\pm \frac{1}{K_{t} \hat x_{t}}\sqrt{-\frac{\Sigma_z}{\Sigma_{t}} +\beta K_{t}}\right) $$

Proof Sketch

Strategy: analyze solution to dynamic programming
At $t=T$, the value function is $V_T(x) = x^\top Q x$
At $t=T-1$, $$V_{T-1}(x_t)= \min_u \underbrace{ \mathbb E[ c(x_t, u) +V_T(x_{t+1})|\mathcal I_{T-1}]}_{f_{T-1}(u) = f_{T-1}^\mathrm{LQ}(u)}$$
- The solution coincides with LQG $$u^\star_{T-1} = K^\star_{T-1}\mathbb E[x_{T-1}|\mathcal I_{T-1}]$$
At $t=T-2$, due to dependence of state estimation on input $$ \min_u\underbrace{\mathbb E[ c(x_t, u) +V_{T-1}x_{t+1})|\mathcal I_{T-2}]}_{f_{T-2}(u) = f_{T-2}^\mathrm{LQ}(u) + f^\mathrm{obs}_{T-2}(u)} $$

Proof Sketch

Strategy: analyze solution to dynamic programming
At $t=T$, the value function is $V_T(x) = x^\top Q x$
At $t=T-1$, $u^\star_{T-1} = K^\star_{T-1}\mathbb E[x_{T-1}|\mathcal I_{T-1}]$

At $t=T-2$, due to dependence of state estimation on input $$ \min_u\underbrace{\mathbb E[ c(x_t, u) +V_{T-1}x_{t+1})|\mathcal I_{T-2}]}_{f_{T-2}(u) = f_{T-2}^\mathrm{LQ}(u) + f^\mathrm{obs}_{T-2}(u)} $$

Summary

1. Motivation and Background

2. Learning Dynamics

inputs

outputs

time

3. Optimal Control

Open Directions:

Learning without stability assumption on $A$
- Idea: regress $y_t$ against $u_{0:t},y_{0:t-1}$
Control with guarantees of stability or sub-optimality
- Idea: observability-aware inputs
Full sample complexity analysis
- State estimation & control with imperfect models

Learning Linear Dynamics from Bilinear Observations at ACC25 (arxiv:2409.16499) with Yahya Sattar and Yassir Jedra
Sub-optimality of the Separation Principle for Quadratic Control from Bilinear Observations (arxiv:2504.11555) with Yahya Sattar, Sunmook Choi, Yassir Jedra, Maryam Fazel

Thanks! Questions?

On the Sample Complexity of the Linear Quadratic Regulator in FoCM (arxiv.org:1710.01688) with Horia Mania, Nikolai Matni, Benjamin Recht, Stephen Tu
Preference Dynamics Under Personalized Recommendations at EC22 (arxiv:2205.13026) with Jamie Morgenstern
Harm Mitigation in Recommender Systems under User Preference Dynamics at KDD24 (arxiv:2406.09882) with Chee, Kalyanaraman, Ernala, Weinsberg, Ioannidis

(Oymak & Ozay, 2019)

Learning and decision making under observer effects

By Sarah Dean

Learning and decision making under observer effects

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Learning and Decision-Making Under Observer Effects

Sarah Dean, Cornell University

Large-scale automated systems

enabled by machine learning

Large-scale automated systems

enabled by machine learning

control theory +

learning theory

Outline

Outline

i) LQ Control

ii) Preference Dynamics

Sample Complexity of Control

Linear Quadratic Control

Partial Observation LQ Control

Approach: ID then Control

Lessons from LQ Control

New Setting: Observer Effects

Example: Personalization

Example: Personalization

Example: Preference Dynamics

Example: Preference Dynamics

Outline

Outline

i) Setting

ii) Algorithm

iii) Results

Problem Setting: Identification

Identification Algorithm

Estimation Errors

Main Results

Informal Theorem (Markov parameter estimation)

Main Results

Informal Summary Theorem

Proof Sketch

Summary

Outline

i) Setting

ii) Separation Principle

iii) Results

Problem Setting: Optimal Control

Problem Setting: Optimal Control

Separation Principle

State Estimation

Example

Main Results

Proof Sketch

Proof Sketch

Summary

Open Directions:

Thanks! Questions?

Learning and decision making under observer effects

More from Sarah Dean