Adaptive Control

ML in Feedback Sys #15

Fall 2025, Prof Sarah Dean

Adaptive Control

"What we do"

Given a model class (e.g. autoregressive), a model-based method for designing a policy (e.g. MPC), and exploration schedule $\sigma_t$
For $t=0,1,2,...$
- Estimate a model $\hat\Theta$ using observations so far $\{(y_k,a_k,c_k)\}_{k=0}^{t-1}$
- Design a policy $\hat\pi_t$ using the estimated model
- Sample exploration noise $z_t\sim\mathcal N(0,\sigma^2_t)$
- Select action $a_t = \hat \pi_t (y_{0:t}, a_{0:t-1}) + z_t$
This is called "epsilon-greedy" policy

Adaptive Control

"What we do" (simplified)

Given a model class (e.g. autoregressive), a model-based method for designing a policy (e.g. MPC), and exploration horizon $N$
For $N$ steps
- Sample exploration noise $z_t\sim\mathcal N(0,I)$
- Select action $a_t = z_t$
Estimate a model using observations $\{(y_k,a_k,c_k)\}_{t=0}^{N-1}$
Design a policy $\hat\pi$ using the estimated model
For remaining $T-N$ steps
- Select action $a_t = \hat \pi_t (y_{0:t}, a_{0:t-1}) $
This is called "explore-then-commit"

Adaptive Control

"Why we do it"

Fact 1: For linear dynamics and linear models, exploration noise ensures persistence of excitation and we have that $\|\hat\Theta - \Theta_\star\| \leq \frac{C }{\sigma \sqrt{N}}$
Fact 2: For LQ control, an $\epsilon$ error in model estimation results in a $\epsilon^2$ sub-optimal policy for small enough $\epsilon$ $$\|\hat\Theta - \Theta_\star\| \leq O(\epsilon) \implies J(\hat \pi) - J(\pi_\star) \leq O(\epsilon^2)$$
Fact 3: For LQ control, exploration horizon $N=O(\sqrt{T})$ or exploration schedule $\sigma_t = O(t^{1/2})$ leads to average sub-optimality $$\textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \leq O(\frac{1}{\sqrt{T}} )$$
Fact 4: For PO-LQ control, measurement noise suffices for exploration so with $\sigma_t=0$ we can achieve average sub-optimality $$ \textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)\leq O(\frac{\mathrm{poly}\log(T)}{T}) $$

Persistence of Excitation

Fact 1: For linear dynamics and linear models, exploration noise ensures persistence of excitation and we have that $\|\hat\Theta - \Theta_\star\| \leq \frac{C }{\sigma \sqrt{N}}$
Definition: covariates of a least-squares problem satisfy persistence of excitation if there exists $H,\mu>0$ such that for all $0\leq s\leq N-H$, $$\sum_{k=s}^{s+H} x_kx_k^\top\succeq \mu I$$
Recall the following from Lecture 10 and 11
- Suppose $\mathbb E[y_t|x_t] = \Theta_\star^\top x_t$ and $y_t$ has bounded variance. Define $V=\sum_{k=1}^N x_kx_k^\top $. Then with high probability,
  
  $$\|\Theta_\star-\hat\Theta\|_{V}^2 = tr\big((\Theta_\star-\hat\Theta)^\top V(\Theta_\star-\hat\Theta)\big)\leq \beta $$
By the definition of minimum eigenvalue, $$\|\Theta_\star-\hat\Theta\|_F\leq\sqrt{ \frac{\beta}{\lambda_{\min}(V)}} \leq \sqrt{ \frac{\beta H}{ \mu }} \frac{1}{\sqrt{N}}$$

$\implies \sum_{k=1}^N x_kx_k^\top \succeq (N/H)\mu I$

Linear Dynamics: State Observed

State observed linear dynamics $$s_{t+1} = Fs_t + Ga_t + w_t$$
Then by setting $y_t\leftarrow s_{t+1}$ and $x_t \leftarrow \begin{bmatrix} s_t^\top & a_t^\top \end{bmatrix}^\top $ and $\Theta_\star \leftarrow \begin{bmatrix} F^\top & G^\top \end{bmatrix} $
Learning an "autoregressive" model with $L=1$ from $N$ data points
For random actions, persistence of excitation is guaranteed with high probability for $\mu=\Omega(\sigma^2)$ $$\|\Theta_\star-\hat\Theta\|_F = \sqrt{\|F_\star-\hat F\|_F^2 + \|G_\star-\hat G\|_F^2}\leq \frac{C}{ \sigma\sqrt{N }} $$
Why are random actions necessary?
- Example: $a_t=0$

Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator, 2018.

Linear Dynamics: Partially Observed

Partially observed linear dynamics $$s_{t+1} = Fs_t + Ga_t + w_t,\quad y_t=Hs_t + v_t$$
Consider an autoregressive model for some $L$,
- $y_t\leftarrow y_{t+1}$ and $x_t \leftarrow \begin{bmatrix}{\bar y}_{t:t-L+1}^\top & {\bar a}^\top_{t:t-L+1} \end{bmatrix}^\top $
Claim: There exists some parameter such that $\mathbb E[y_{t+1}|x_t] = \Theta_\star^\top x_t$
- The (steady-state) Kalman filter (where $\tilde F=F-LHF$) $$ \hat s_{t+1} = \tilde F\hat s_t + \tilde G a_t + Ly_{t+1},\quad \hat y_t = H\hat s_t$$
- "Unrolling" shows linear model $$\hat s_{t} = \tilde F^{L}\hat s_{t-L}+ \sum_{k=1}^{L} \tilde F^{k-1} (Ly_{t-k+1}+ \tilde Ga_{t-k}), \quad \hat y_{t+1} = H(F\hat s_t + Ga_t)$$
- For truncated history, $\mathbb E[ s_{t-L}| x_t] = 0$ (due to zero mean initial state and exploration actions)

Linear Dynamics: Partially Observed

Partially observed linear dynamics $$s_{t+1} = Fs_t + Ga_t + w_t,\quad y_t=Hs_t + v_t$$
Consider an autoregressive model for some $L$,
- $y_t\leftarrow y_{t+1}$ and $x_t \leftarrow \begin{bmatrix}{\bar y}_{t:t-L+1}^\top & {\bar a}^\top_{t:t-L+1} \end{bmatrix}^\top $
Claim: There exists some parameter such that $\mathbb E[y_{t+1}|x_t] = \Theta_\star^\top x_t$
- "Unrolling" (steady-state) Kalman filter $$\mathbb E[y_{t+1}|x_t] = H F \Big(\sum_{k=1}^{L} \tilde F^{k-1} (Ly_{t-k+1}+ \tilde Ga_{t-k})\Big) + H Ga_t$$
- Thus $\Theta_\star$ depends on $F,G,H$ and Kalman gain
For random actions, persistence of excitation is guaranteed with high probability for $\mu=\Omega(\sigma^2)$ $$\|\Theta_\star-\hat\Theta\|_F \leq \frac{C}{ \sigma\sqrt{N }} $$

Control Sub-optimality

Fact 2: For LQ control, an $\epsilon$ error in model estimation results in a $\epsilon^2$ sub-optimal policy for small enough $\epsilon$ $$\|\hat\Theta - \Theta_\star\| \leq O(\epsilon) \implies J(\hat \pi) - J(\pi_\star) \leq O(\epsilon^2)$$
Define $J(\pi) = \lim_{T\to\infty}\frac{1}{T}\mathbb E\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]$ $$ \text{s.t}\quad s_{k+1} = F s_k+ Ga_k+w_k ,~~y_k=Hs_k+v_t,~~a_k=\pi(y_{0:k},a_{0:k-1})$$
Recall from last lecture that optimal LQ policy
depends on solution to discrete time algebraic
Riccati equation (DARE)
Mania et al. (2019) show an $O(\epsilon^2)$ perturbation
bound on the DARE
Parameters $\hat F,\hat G,\hat H$ can be extracted from $\hat \Theta$ (Omyak & Ozay, 2021) and then used to construct $\hat K $ and $\hat L$

Feedback policy
- $P_\star = Ricc(P_{\star}, F, G, Q, R)$
- $K_\star = -(R+G^\top P_{\star}G)^{-1}G^\top QP_{\star}F$
Kalman filter
- $\Sigma_{\star} = Ricc(\Sigma_{\star}, F^\top, H^\top, \Sigma_w, \Sigma_v)$
- $L_{\star} = F\Sigma_{\star}H^\top ( H\Sigma_{\star} H^\top+\Sigma_v)^{-1}$

Certainty Equivalence is Efficient for Linear Quadratic Control Mania, Tu, Recht. 2019.

Fact 3: For LQ control, exploration horizon $N=O(\sqrt{T})$ or exploration schedule $\sigma_t = O(t^{-1/2})$ leads to average sub-optimality $$\textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \leq O(\frac{1}{\sqrt{T}} )$$
Proof sketch:
- We will consider suboptimality separately during the two phases
- $ \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)= \sum_{t=1}^N J(\pi^{exp}) - J(\pi_\star) + \sum_{t=N+1}^T J(\hat \pi) - J(\pi_\star)$
- $\leq \sum_{t=1}^N C_1 + \sum_{t=N+1}^T C_2 \epsilon(N)^2$ where $\epsilon(N)$ is the estimation error
- $= C_1 N + C_2 \epsilon(N)^2 (T-N)$
- $\leq C_1 N + C_2 \epsilon(N)^2 T$
- $\leq C_1 N + C_2 \frac{C_3^2}{\sqrt{N}^2 }T$
- $= (C_1 + C_2C_3^2)\sqrt{T}$ if we set $N=\sqrt{T}$

State Feedback Control

Fact 3: For LQ control, exploration horizon $N=O(\sqrt{T})$ or exploration schedule $\sigma_t = O(t^{-1/2})$ leads to average sub-optimality $$\textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \leq O(\frac{1}{\sqrt{T}} )$$
Proof sketch:
- We will separate the exploration noise from the policy
- $ \sum_{t=1}^T J(\hat \pi^\sigma_t) - J(\pi_\star)\approx \sum_{t=1}^T J(\hat \pi^{\exp}_t) - J(\pi_\star)+\sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) $
- $ \leq \sum_{t=1}^T C_1 \sigma_t^2 + \sum_{t=1}^T C_2 \epsilon_t(\sigma_t)^2 $
- $ \leq \sum_{t=1}^T C_1 \sigma_t^2 + \sum_{t=1}^T C_2 \frac{C_3^2}{\sigma_t^2 t} $
- $ = \sum_{t=1}^T C_1 t^{-1/2} + \sum_{t=1}^T C_2 \frac{C_3^2}{ t^{1/2}} $
- $ \leq 2(C_1 +C_2 {C_3^2})\sqrt{T} $

State Feedback Control

Fact 4: For PO-LQ control, measurement noise suffices for exploration so with $\sigma_t=0$ we can achieve sub-optimality $$ \textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)\leq O(\frac{\mathrm{poly}\log(T)}{T}) $$
Random actions are not required to ensure persistence of excitation, instead it is guaranteed for noiseless $a_t=\hat K_t \hat s_{t|t}$ with $\mu=\Omega(\min\{\sigma_v^2,\sigma_w^2\})$ $$\|\Theta_\star-\hat\Theta\|_F \leq \frac{C}{ \sqrt{N }} $$
Proof sketch:
- $ \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)$
- $ \leq \sum_{t=1}^T C_1 \epsilon_t^2 $
- $ \leq \sum_{t=1}^T C_1 \frac{C_2^2}{t} $
- $ \leq C_1 C_2 (\log T +1)$

Output Feedback Control

Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems, 2020.

Recap

Epsilon-greedy policy
Persistence of excitation
Sub-optimality

Next time: Policy optimization

Announcements

Next assignment posted, due next Thursday
Paper presentations start on 10/28

15 - Adaptive Control - ML in Feedback Sys F25

By Sarah Dean

15 - Adaptive Control - ML in Feedback Sys F25

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Adaptive Control

ML in Feedback Sys #15

Adaptive Control

Adaptive Control

Adaptive Control

Persistence of Excitation

Linear Dynamics: State Observed

Linear Dynamics: Partially Observed

Linear Dynamics: Partially Observed

Control Sub-optimality

State Feedback Control

State Feedback Control

Output Feedback Control

Recap

Announcements

15 - Adaptive Control - ML in Feedback Sys F25

More from Sarah Dean