Adaptive Control

ML in Feedback Sys #15

Fall 2025, Prof Sarah Dean

Adaptive Control

"What we do"

  • Given a model class (e.g. autoregressive), a model-based method for designing a policy (e.g. MPC), and exploration schedule \(\sigma_t\)
  • For \(t=0,1,2,...\)
    • Estimate a model \(\hat\Theta\) using observations so far \(\{(y_k,a_k,c_k)\}_{k=0}^{t-1}\)
    • Design a policy \(\hat\pi_t\) using the estimated model
    • Sample exploration noise \(z_t\sim\mathcal N(0,\sigma^2_t)\)
    • Select action \(a_t = \hat \pi_t (y_{0:t}, a_{0:t-1}) + z_t\)
  • This is called "epsilon-greedy" policy

Adaptive Control

"What we do" (simplified)

  • Given a model class (e.g. autoregressive), a model-based method for designing a policy (e.g. MPC), and exploration horizon \(N\)
  • For \(N\) steps
    • Sample exploration noise \(z_t\sim\mathcal N(0,I)\)
    • Select action \(a_t = z_t\)
  • Estimate a model using observations \(\{(y_k,a_k,c_k)\}_{t=0}^{N-1}\)
  • Design a policy \(\hat\pi\) using the estimated model
  • For remaining \(T-N\) steps
    • Select action \(a_t = \hat \pi_t (y_{0:t}, a_{0:t-1}) \)
  • This is called "explore-then-commit"

Adaptive Control

"Why we do it"

  • Fact 1: For linear dynamics and linear models, exploration noise ensures persistence of excitation and we have that \(\|\hat\Theta - \Theta_\star\| \leq \frac{C }{\sigma \sqrt{N}}\)
  • Fact 2: For LQ control, an \(\epsilon\) error in model estimation results in a \(\epsilon^2\) sub-optimal policy for small enough \(\epsilon\) $$\|\hat\Theta - \Theta_\star\| \leq O(\epsilon) \implies J(\hat \pi) - J(\pi_\star) \leq O(\epsilon^2)$$
  • Fact 3: For LQ control, exploration horizon \(N=O(\sqrt{T})\) or exploration schedule \(\sigma_t = O(t^{1/2})\) leads to average sub-optimality $$\textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \leq O(\frac{1}{\sqrt{T}} )$$
  • Fact 4: For PO-LQ control, measurement noise suffices for exploration so with \(\sigma_t=0\) we can achieve average sub-optimality $$ \textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)\leq O(\frac{\mathrm{poly}\log(T)}{T}) $$

Persistence of Excitation

  • Fact 1: For linear dynamics and linear models, exploration noise ensures persistence of excitation and we have that \(\|\hat\Theta - \Theta_\star\| \leq \frac{C  }{\sigma \sqrt{N}}\)
  • Definition: covariates of a least-squares problem satisfy persistence of excitation if there exists \(H,\mu>0\) such that for all \(0\leq s\leq N-H\), $$\sum_{k=s}^{s+H} x_kx_k^\top\succeq \mu I$$
  • Recall the following from Lecture 10 and 11
    • Suppose \(\mathbb E[y_t|x_t] = \Theta_\star^\top x_t\) and \(y_t\) has bounded variance. Define \(V=\sum_{k=1}^N x_kx_k^\top \). Then with high probability,

      $$\|\Theta_\star-\hat\Theta\|_{V}^2 = tr\big((\Theta_\star-\hat\Theta)^\top V(\Theta_\star-\hat\Theta)\big)\leq \beta $$

  • By the definition of minimum eigenvalue, $$\|\Theta_\star-\hat\Theta\|_F\leq\sqrt{ \frac{\beta}{\lambda_{\min}(V)}} \leq \sqrt{ \frac{\beta H}{  \mu }} \frac{1}{\sqrt{N}}$$

\(\implies \sum_{k=1}^N x_kx_k^\top \succeq (N/H)\mu I\)

Linear Dynamics: State Observed

  • State observed linear dynamics $$s_{t+1} = Fs_t + Ga_t + w_t$$
  • Then by setting \(y_t\leftarrow s_{t+1}\) and \(x_t \leftarrow \begin{bmatrix} s_t^\top & a_t^\top \end{bmatrix}^\top \) and \(\Theta_\star \leftarrow  \begin{bmatrix} F^\top & G^\top \end{bmatrix} \)
  • Learning an "autoregressive" model with \(L=1\) from \(N\) data points
  • For random actions, persistence of excitation is guaranteed with high probability for \(\mu=\Omega(\sigma^2)\) $$\|\Theta_\star-\hat\Theta\|_F = \sqrt{\|F_\star-\hat F\|_F^2 + \|G_\star-\hat G\|_F^2}\leq \frac{C}{ \sigma\sqrt{N }}  $$
  • Why are random actions necessary?
    • Example: \(a_t=0\)

Linear Dynamics: Partially Observed

  • Partially observed linear dynamics $$s_{t+1} = Fs_t + Ga_t + w_t,\quad y_t=Hs_t + v_t$$
  • Consider an autoregressive model for some \(L\),
    • \(y_t\leftarrow y_{t+1}\) and \(x_t \leftarrow \begin{bmatrix}{\bar y}_{t:t-L+1}^\top & {\bar a}^\top_{t:t-L+1} \end{bmatrix}^\top \)
  • Claim: There exists some parameter such that \(\mathbb E[y_{t+1}|x_t] = \Theta_\star^\top x_t\)
    • The (steady-state) Kalman filter (where \(\tilde F=F-LHF\)) $$  \hat s_{t+1} = \tilde F\hat s_t + \tilde G a_t + Ly_{t+1},\quad \hat y_t = H\hat s_t$$

    • "Unrolling" shows linear model $$\hat s_{t} = \tilde F^{L}\hat s_{t-L}+ \sum_{k=1}^{L} \tilde F^{k-1}  (Ly_{t-k+1}+ \tilde Ga_{t-k}), \quad \hat y_{t+1} = H(F\hat s_t + Ga_t)$$

    • For truncated history, \(\mathbb E[ s_{t-L}| x_t] = 0\) (due to zero mean initial state and exploration actions)

Linear Dynamics: Partially Observed

  • Partially observed linear dynamics $$s_{t+1} = Fs_t + Ga_t + w_t,\quad y_t=Hs_t + v_t$$
  • Consider an autoregressive model for some \(L\),
    • \(y_t\leftarrow y_{t+1}\) and \(x_t \leftarrow \begin{bmatrix}{\bar y}_{t:t-L+1}^\top & {\bar a}^\top_{t:t-L+1} \end{bmatrix}^\top \)
  • Claim: There exists some parameter such that \(\mathbb E[y_{t+1}|x_t] = \Theta_\star^\top x_t\)
    • "Unrolling" (steady-state) Kalman filter $$\mathbb E[y_{t+1}|x_t] = H F \Big(\sum_{k=1}^{L} \tilde F^{k-1}  (Ly_{t-k+1}+ \tilde Ga_{t-k})\Big)  + H Ga_t$$

    • Thus \(\Theta_\star\) depends on \(F,G,H\) and Kalman gain

  • For random actions, persistence of excitation is guaranteed with high probability for \(\mu=\Omega(\sigma^2)\) $$\|\Theta_\star-\hat\Theta\|_F \leq \frac{C}{ \sigma\sqrt{N }}  $$

Control Sub-optimality

  • Fact 2: For LQ control, an \(\epsilon\) error in model estimation results in a \(\epsilon^2\) sub-optimal policy for small enough \(\epsilon\) $$\|\hat\Theta - \Theta_\star\| \leq O(\epsilon) \implies J(\hat \pi) - J(\pi_\star) \leq O(\epsilon^2)$$
  • Define \(J(\pi) =  \lim_{T\to\infty}\frac{1}{T}\mathbb E\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\) $$ \text{s.t}\quad s_{k+1} = F s_k+ Ga_k+w_k ,~~y_k=Hs_k+v_t,~~a_k=\pi(y_{0:k},a_{0:k-1})$$
  • Recall from last lecture that optimal LQ policy
    depends on solution to discrete time algebraic
    Riccati equation (DARE)
  • Mania et al. (2019) show an \(O(\epsilon^2)\) perturbation
    bound on the DARE
  • Parameters \(\hat F,\hat G,\hat H\) can be extracted from \(\hat \Theta\) (Omyak & Ozay, 2021) and then used to construct \(\hat K \) and \(\hat L\)
  • Feedback policy
    • \(P_\star = Ricc(P_{\star}, F, G, Q, R)\)
    • \(K_\star = -(R+G^\top P_{\star}G)^{-1}G^\top QP_{\star}F\)
  • Kalman filter
    • \(\Sigma_{\star} = Ricc(\Sigma_{\star}, F^\top, H^\top, \Sigma_w, \Sigma_v)\)
    • \(L_{\star} = F\Sigma_{\star}H^\top ( H\Sigma_{\star} H^\top+\Sigma_v)^{-1}\)
  • Fact 3: For LQ control, exploration horizon \(N=O(\sqrt{T})\) or exploration schedule \(\sigma_t = O(t^{-1/2})\) leads to average sub-optimality $$\textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \leq O(\frac{1}{\sqrt{T}} )$$
  • Proof sketch:
    • We will consider suboptimality separately during the two phases
    • \(  \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)= \sum_{t=1}^N J(\pi^{exp}) - J(\pi_\star) +  \sum_{t=N+1}^T J(\hat \pi) - J(\pi_\star)\)
    • \(\leq \sum_{t=1}^N C_1 +  \sum_{t=N+1}^T C_2 \epsilon(N)^2\) where \(\epsilon(N)\) is the estimation error
    • \(=  C_1 N +  C_2 \epsilon(N)^2 (T-N)\)
    • \(\leq C_1 N +  C_2 \epsilon(N)^2 T\)
    • \(\leq C_1 N +  C_2 \frac{C_3^2}{\sqrt{N}^2 }T\)
    • \(= (C_1  +  C_2C_3^2)\sqrt{T}\) if we set \(N=\sqrt{T}\)

State Feedback Control

  • Fact 3: For LQ control, exploration horizon \(N=O(\sqrt{T})\) or exploration schedule \(\sigma_t = O(t^{-1/2})\) leads to average sub-optimality $$\textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \leq O(\frac{1}{\sqrt{T}} )$$
  • Proof sketch:
    • We will separate the exploration noise from the policy
    • \(  \sum_{t=1}^T J(\hat \pi^\sigma_t) - J(\pi_\star)\approx  \sum_{t=1}^T J(\hat \pi^{\exp}_t) - J(\pi_\star)+\sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star) \)
    • \( \leq \sum_{t=1}^T C_1 \sigma_t^2 + \sum_{t=1}^T C_2 \epsilon_t(\sigma_t)^2 \)
    • \( \leq \sum_{t=1}^T C_1 \sigma_t^2 + \sum_{t=1}^T C_2 \frac{C_3^2}{\sigma_t^2 t} \)
    • \( = \sum_{t=1}^T C_1 t^{-1/2} + \sum_{t=1}^T C_2 \frac{C_3^2}{ t^{1/2}} \)
    • \( \leq 2(C_1 +C_2 {C_3^2})\sqrt{T} \)

State Feedback Control

  • Fact 4: For PO-LQ control, measurement noise suffices for exploration so with \(\sigma_t=0\) we can achieve sub-optimality $$ \textstyle \frac{1}{T} \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)\leq O(\frac{\mathrm{poly}\log(T)}{T}) $$
  • Random actions are not required to ensure persistence of excitation, instead it is guaranteed for noiseless \(a_t=\hat K_t \hat s_{t|t}\) with \(\mu=\Omega(\min\{\sigma_v^2,\sigma_w^2\})\) $$\|\Theta_\star-\hat\Theta\|_F \leq \frac{C}{ \sqrt{N }}  $$
  • Proof sketch:
    • \(  \sum_{t=1}^T J(\hat \pi_t) - J(\pi_\star)\)
    • \( \leq  \sum_{t=1}^T C_1 \epsilon_t^2 \)
    • \( \leq \sum_{t=1}^T C_1 \frac{C_2^2}{t} \)
    • \( \leq C_1 C_2 (\log T +1)\)

Output Feedback Control

Recap

  • Epsilon-greedy policy
  • Persistence of excitation
  • Sub-optimality

Next time: Policy optimization

Announcements

  • Next assignment posted, due next Thursday
  • Paper presentations start on 10/28

15 - Adaptive Control - ML in Feedback Sys F25

By Sarah Dean

15 - Adaptive Control - ML in Feedback Sys F25

  • 22