Learning Dynamics from Bilinear Observations

Sarah Dean, Cornell University

June 2024

Large scale automated systems, powered by machine learning

\(\to\)

historical interactions

probability of new interaction

Feedback arises when actions impact the world

Training data is correlated due to dynamics and feedback

Outline

1. Motivation: Implications for Personalization

2. Learning Dynamics from Bilinear Observations

inputs

outputs

time

Outline

1. Motivation: Implications for Personalization

i) Setting

ii) Assimilation

iii) Harm

Setting: Preference Dynamics

\(u_t\)

               \(y_t\)

Interests may be impacted by recommended content

preference state \(x_t\)

expressed preferences

recommended content

recommender policy

\(u_t\)

\(y_t = \langle x_t, u_t\rangle + w_t  \)

Interests may be impacted by recommended content

expressed preferences

recommended content

recommender policy

\(\approx\)

\(y_{ij} \approx x_i^\top u_j\)

underlies factorization-based methods

preference state \(x_t\)

Setting: Preference Dynamics

\(u_t\)

\(y_t = \langle x_t, u_t\rangle + w_t  \)

Interests may be impacted by recommended content

expressed preferences

recommended content

recommender policy

underlies factorization-based methods

state \(x_t\) updates to \(x_{t+1}\)

Setting: Preference Dynamics

\(y_t = \langle x_t, u_t\rangle + v_t  \)

\(x_{t+1} = f_t(x_t, u_t)\)

preferences \(x\in\mathcal X=\mathcal S^{d-1}\)

recommendations \(u_t\in\mathcal U\subseteq \mathcal S^{d-1}\)

Examples of Preference Dynamics

  • Assimilation: interests become more similar to recommended content $$x_{t+1} \propto x_t + \eta_t u_t$$
  • Biased Assimilation: interest update is proportional to affinity $$ x_{t+1} \propto x_t + \eta_t\langle x_t, u_t\rangle u_t$$
  • Proposed by Hązła et al. (2019) as model of opinion dynamics

initial preference
resulting preference

Prior Work

2. Biased assimilation

\(x_{t+1} \propto x_t + \eta_t\langle x_t, u_t\rangle u_t\)

When recommendations are made globally, the outcomes differ:

initial preference
resulting preference

1. Assimilation

\(x_{t+1} \propto x_t + \eta_t u_t\)

polarization (Hązła et al. 2019; Gaitonde et al. 2021)

homogenized preferences

Personalized Recommendations

Regardless of whether assimilation is biased,

Personalized fixed recommendation \(u_t=u\)

$$ x_t = \alpha_t x_0 +  \beta_t u$$

positive and decreasing

increasing magnitude (same sign as \(\langle x_0, u\rangle\) if biased assimilation)

\(x_{t+1} \propto x_t + \eta_t\langle x_t, u_t\rangle u_t\)

\(x_{t+1} \propto x_t + \eta_t u_t\)

Personalized Recommendations

Regardless of whether assimilation is biased,

Implications [DM22]

  1. It is not necessary to identify preferences to make high affinity recommendations

\(x_{t+1} \propto x_t + \eta_t\langle x_t, u_t\rangle u_t\)

\(x_{t+1} \propto x_t + \eta_t u_t\)

Personalized Recommendations

Regardless of whether assimilation is biased,

initial preference
resulting preference

Implications [DM22]

  1. It is not necessary to identify preferences to make high affinity recommendations

  2. Preferences "collapse" towards whatever users are often recommended

\(x_{t+1} \propto x_t + \eta_t\langle x_t, u_t\rangle u_t\)

\(x_{t+1} \propto x_t + \eta_t u_t\)

Personalized Recommendations

Regardless of whether assimilation is biased,

initial preference
resulting preference

Implications [DM22]

  1. It is not necessary to identify preferences to make high affinity recommendations

  2. Preferences "collapse" towards whatever users are often recommended

  3. Non-manipulation (and other goals) can be achieved through randomization

\(x_{t+1} \propto x_t + \eta_t\langle x_t, u_t\rangle u_t\)

\(x_{t+1} \propto x_t + \eta_t u_t\)

Harmful Recommendations

Simple choice model: given a recommendation, a user

  1. Selects the recommendation with probability determined by affinity
  2. Otherwise, selects from among all content based on affinities

Preference dynamics lead to a new perspective on harm

Simple definition: harm caused by consumption of harmful content

\(\mathbb P\{\mathrm{click}\}\)

\(\circ\)

𝅘𝅥

#

𝅘𝅥

𝅘𝅥𝅘𝅥

\(\circ\)

𝅘𝅥

#

Harmful Recommendations

Without preference dynamics, harm minimizing policy is the engagement maximizing policy (excluding harmful items)

Recommendation: ♫

𝅘𝅥𝅘𝅥

\(\circ\)

𝅘𝅥

#

Recommendation: 𝅘𝅥

𝅘𝅥

\(\circ\)

𝅘𝅥

#

\(\mathbb P\{\mathrm{click}\}\)

\(\mathbb P \{\mathrm{click}\}\)

\(\circ\)

𝅘𝅥

#

𝅘𝅥

Harmful Recommendations

With preference dynamics, there may be downstream harm, even when no harmful content is recommended

Recommendation: ♫

𝅘𝅥𝅘𝅥

\(\circ\)

𝅘𝅥

#

Recommendation: 𝅘𝅥

𝅘𝅥

\(\circ\)

𝅘𝅥

#

\(\mathbb P\{\mathrm{click}\}\)

\(\mathbb P \{\mathrm{click}\}\)

\(\circ\)

𝅘𝅥

#

𝅘𝅥

Harmful Recommendations

With preference dynamics, there may be downstream harm, even when no harmful content is recommended

Recommendation: ♫

𝅘𝅥𝅘𝅥

\(\circ\)

𝅘𝅥

#

Recommendation: 𝅘𝅥

𝅘𝅥

\(\circ\)

𝅘𝅥

#

\(\mathbb P\{\mathrm{click}\}\)

\(\mathbb P \{\mathrm{click}\}\)

This motivates a new recommendation objective which takes into account the probability of future harm [CDEIKW24]

Outline

1. Motivation: Implications for Personalization

2. Learning Dynamics from Bilinear Observations

inputs

outputs

time

Outline

2. Learning Dynamics from Bilinear Observations

i) Setting

ii) Algorithm

iii) Results

inputs

outputs

time

Problem Setting

  • Unknown dynamics and measurement functions
  • Observed trajectory of inputs \(u\in\mathbb R^p\) and outputs \(y\in\mathbb R\) $$u_0,y_0,u_1,y_1,...,u_T,y_T$$
  • Goal: identify dynamics and measurement models from data
  • Setting: linear/bilinear with \(A\in\mathbb R^{n\times n}\), \(B\in\mathbb R^{n\times p}\), \(C\in\mathbb R^{p\times n}\) $$x_{t+1} = Ax_t + Bu_t + w_t\\ y_t = u_t^\top Cx_t + v_t$$

e.g. playlist attributes

e.g. listen time

inputs \(u_t\)

 

\( \)

 

 

outputs \(y_t\)

Identification Algorithm

Input: data \((u_0,y_0,...,u_T,y_T)\), history length \(L\), state dim \(n\)

Step 1: Regression

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

Step 2: Decomposition \(\hat A,\hat B,\hat C = \mathrm{HoKalman}(\hat G, n)\)

\(t\)

\(L\)

\(\underbrace{\qquad\qquad}\)

inputs

outputs

time

Estimation Errors

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

  • (Biased) estimate of Markov parameters $$ G  =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
  • Regress \(y_t\) against $$ \underbrace{ \begin{bmatrix} u_{t-1}^\top & ... & u_{t-L}^\top \end{bmatrix}}_{\bar u_{t-1}^\top } \otimes u_t^\top $$

\(t\)

\(L\)

\(\underbrace{\qquad\qquad}\)

\(\bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) \)

inputs

outputs

time

Estimation Errors

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - u_t^\top \textstyle \sum_{k=1}^L G[k] u_{t-k} \big)^2 $$

  • (Biased) estimate of Markov parameters $$ G  =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
  • Define the data matrix $$\tilde U = \begin{bmatrix}\bar u_{L-1}^\top  \otimes u_L^\top \\ \vdots \\ \bar u_{T-1}^\top \otimes u_T^\top\end{bmatrix} $$

\(\bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G) \)

\(*\)

\(=\)

\(\underbrace{\qquad\qquad}\)

\(\underbrace{\qquad\qquad}\)

\(\underbrace{\qquad\qquad}\)

\(\underbrace{\qquad\qquad\qquad}\)

\(t\)

\(L\)

\(\underbrace{\qquad\qquad}\)

inputs

outputs

time

Main Results

Assumptions:

  1. Process and measurement noise \(w_t,v_t\) are i.i.d., zero mean, and have bounded second moments
  2. Inputs \(u_t\) are bounded
  3. The dynamics \((A,B,C)\) are observable, controllable, and strictly stable, i.e. \(\rho(A)<1\)

Informal Theorem (Markov parameter estimation)

With probability at least \(1-\delta\), $$\epsilon_G=\|G-\hat G\|_{\tilde U^\top \tilde U} \lesssim \sqrt{ \frac{p^2 L}{\delta} \cdot c_{\mathrm{stability,noise}} }+ \rho(A)^L\sqrt{T} c_{\mathrm{stability}}$$

Main Results

Assumptions:

  1. Process and measurement noise \(w_t,v_t\) are i.i.d., zero mean, and have bounded second moments
  2. Inputs \(u_t\) are bounded
  3. The dynamics \((A,B,C)\) are observable, controllable, and strictly stable, i.e. \(\rho(A)<1\)

Informal Theorem (system identification)

Suppose \(L\) is sufficiently large. Then, there exists a nonsingular matrix \(S\) (i.e. a similarity transform) such that

\(\|A-S\hat AS^{-1}\|_{F}\)

\(\| B-S\hat B\|_{F}\)

\(\| C-\hat CS^{-1}\|_{F} \)

$$\lesssim c_{\mathrm{contr,obs,dim}}  \frac{\|G-\hat G\|_{F}}{\sqrt{\sigma_{\min}(\tilde U^\top \tilde U)}} $$

\(\underbrace{\qquad\qquad}\)

Main Results

Assumptions:

  1. Process and measurement noise \(w_t,v_t\) are i.i.d., zero mean, and have bounded second moments
  2. Inputs \(u_t\) are bounded
  3. The dynamics \((A,B,C)\) are observable, controllable, and strictly stable, i.e. \(\rho(A)<1\)

Informal Summary Theorem

With high probabilty, $$\mathrm{est.~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dimension})}{\sigma_{\min}(\tilde U^\top \tilde U)}}$$

Sample Complexity

Informal Conjecture

When \(u_t\) are chosen i.i.d. and sub-Gaussian and \(T\) is large enough, whp $$\sigma_{\min}({\tilde U^\top \tilde U} )\gtrsim T$$

Informal Corollary

For i.i.d. and sub-Gaussian inputs, whp $$\mathrm{est.~errors} \lesssim \sqrt{ \frac{\mathsf{poly}(\mathrm{dim.})}{T}}$$

How large does \(T\) need to be to guarantee bounded estimation error?

\(*\)

\(=\)

formal analysis involves the structured random matrix \(\tilde U\)

  1. Motivation: preference dynamics
    • Assimilation dynamics & harm
  2. Learning from bilinear observations

 

  • sample complexity
  • marginal stability
  • prediction (filtering)
  • optimal & adaptive control
  • applications

Conclusion & Discussion

inputs

outputs

time

  1. Preference Dynamics Under Personalized Recommendations at EC22 (arxiv:2205.13026) with Jamie Morgenstern
  2. Harm Mitigation in Recommender Systems under User Preference Dynamics at KDD24 (arxiv:2406.09882) with Jerry Chee, Sindhu Ernala, Stratis Ioannidis, Shankar Kalyanaraman, Udi Weinsberg
  3. Learning Linear Dynamics from Bilinear Observations (poster here!) with Yahya Sattar

Other References

  • Gaitonde, Kleinberg, Tardos, 2021. Polarization in geometric opinion dynamics. EC.
  • Hązła, Jin, Mossel, Ramnarayan, 2019. A geometric model of opinion polarization. Mathematics of Operations Research.
  • Omyak & Ozay, 2019. Non-asymptotic Identification of LTI Systems from a Single Trajectory. ACC.

Thanks! Questions?

References

more details on affinity maximization, preference stationarity, and mode collapse

(Oymak & Ozay, 2019)

Proof Sketch

$$\hat G = \arg\min_{G\in\mathbb R^{p\times pL}} \sum_{t=L}^T \big( y_t - \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G)  \big)^2 $$

  • Claim: this is a biased estimate of Markov parameters $$ G_\star  =\begin{bmatrix} C B & CA B & \dots & CA^{L-1} B \end{bmatrix} $$
    • Observe that \(x_t = \sum_{k=1}^L CA^{k-1} B (u_{t-k} + w_{t-k}) + A^L x_{t-L}\)
    • Hence, \(y_t= \bar u_{t-1}^\top \otimes u_t^\top \mathrm{vec}(G_\star) +u_t^\top \textstyle \sum_{k=1}^L CA^{k-1} w_{t-k} + u_t CA^L x_{t-L} + v_t \)
  • Least squares: for \(y_t = z_t^\top \theta + n_t\), the estimate  \(\hat\theta=\arg \min\sum_t (z_t^\top \theta - y_t)^2\) $$= \textstyle\arg \min  \|Z \theta - Y\|^2_2 = (Z^\top Z)^\dagger Z^\top Y= \theta_\star + (Z^\top Z)^\dagger Z^\top N$$
  • Estimation errors are therefore \(\|G_\star -\hat G\|_{\tilde U^\top\tilde U} = \|\tilde U^\top N\| \)

Equivalent representations

Set of equivalent state space representations for all invertible and square \(M\)

\(s_{t+1} = As_t + Bw_t\)

\(y_t = Cs_t+v_t\)

\(\tilde s_{t+1} = \tilde A\tilde s_t + \tilde B w_t\)

\(y_t = \tilde C\tilde s_t+v_t\)

\(\tilde s = M^{-1}s\)

\(\tilde A = M^{-1}AM\)

\(\tilde B = M^{-1}B\)

\(\tilde C = CM\)

\( s = M\tilde s\)

\( A = M\tilde AM^{-1}\)

\( B = M\tilde B\)

\(C = \tilde CM^{-1}\)

\(y_t\)

\(A\)

\(s\)

\(w_t\)

\(v_t\)

\(C\)

\(B\)

\(y_t\)

\(\tilde A\)

\(s\)

\(w_t\)

\(v_t\)

\(\tilde C\)

\(\tilde B\)

OLC Talk: Learning Dynamics from Bilinear Observations

By Sarah Dean

OLC Talk: Learning Dynamics from Bilinear Observations

  • 251