Safely Learning to Control the Linear Quadratic Regulator
Sarah Dean, UC Berkeley EECS
joint work with Stephen Tu, Nikolai Matni, and Ben Recht
American Controls Conference 2019
Motivation for Safe Learning

High performance in the real world involves complex dynamics and safety constraints

- How can we learn while maintaining safety?
- How well do we have to know a system to safely control it?
Problem Setting
- performance optimization
- safety constraint satisfaction
- stochastic process noise
and uncertain dynamics
minimize \(\mathbb{E}[\)cost\((x_0,u_0,x_1...)]\)
s.t. \(x_{t+1} = Ax_t + Bu_t + w_t\)
\(x_t\in\mathcal X,~~u_t\in\mathcal U\) for all \(t\)
Initial Estimates
\(\|\widehat A_0 - A\|\leq \epsilon_A\), \(\|\widehat B_0 - B\|\leq \epsilon_B\)

Goal: Analyze learning and performance in the presence of state and input constraints
and all \(\|w_t\|\leq \sigma_w\)
Learn Dynamics
Robust control
- process noise
- injected excitation
- dynamics uncertainty
- Run system with control \(u_t = \eta_t +\mathbf{K}_0(x_t, x_{t-1}...)\)
- Least squares estimation on trajectory \(\{(x_t, u_t)\}_{t=0}^T\)
- Synthesize new robust controller \(\widehat{\mathbf K}\) using estimates
Maintaining Safety
while
Persistent Excitation
- stochastic and bounded
How to
How well do we have to know a system to safely control it?
Main Result:
Where \(M\) is the safety margin cost gap of \(\mathbf{K}_*\), as long as \(T\) is large enough,
rel. error of \(\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|\)CL\((A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M\)

SNR \(=\frac{\text{process noise}}{\text{excitation}}\)
sample complexity
safety margin optimal cost gap
robustness of optimal controller
How well do we have to know a system to safely control it?
Main Result:
Where \(M\) is the safety margin cost gap of \(\mathbf{K}_*\), as long as \(T\) is large enough,
rel. error of \(\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|\)CL\((A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M\)
How well do we have to know a system to safely control it?
Main Result:
Where \(M\) is the safety margin cost gap of \(\mathbf{K}_*\), as long as \(T\) is large enough,
rel. error of \(\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|\)CL\((A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M\)
Ingredients:
- Statistical learning rate
- Robust control for safety during learning
- Sub-optimality analysis of robust control
Informal Theorem (Learning):
For stabilizing control of the form \(u_t = \mathbf{K}(x_t, x_{t-1}...) + \eta_t\) and large enough \(T\), we have w.p. \(1-\delta\)
\(\Big\|\begin{bmatrix} \widehat A - A \\ \widehat B - B\end{bmatrix}\Big\| \lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d }{T} \log(1/\delta)} \)
Least squares estimate \((\widehat A, \widehat B) \in \arg\min \sum_{t=0}^T \|Ax_t +B u_t - x_{t+1}\|^2 \)
Finite Sample Learning Rate
where \(C_u\) is the gain from disturbance to control input
Assume that process noise \(w_t\) and excitation \(\eta_t\) are zero mean, independent over time, and with fourth moments bounded by \(\sigma_w\) and \(\sigma_\eta\)
Maintaining Safety with System Level Synthesis
Instead of reasoning about a controller \(\mathbf{K}\),
plant \((A,B)\)
controller \(\mathbf{K}\)
\(\bf x\)
\(\bf u\)
\(\bf w\)
\(\mathbf{\Phi}\)
\(\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = \mathbf{\Phi}\mathbf{w} \)
\(\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = CL([ {A \atop I} {B \atop 0}],\mathbf{K}) \mathbf{w} \)
we reason about the interconnection \(\mathbf\Phi\) directly.
This correspondence holds for all \(\mathbf\Phi\) constrained to lie in an affine space defined by the true dynamics
\(\begin{bmatrix}zI- A&- B\end{bmatrix} \mathbf\Phi = I\)
Constrained LQR with System Level Synthesis
quadratic cost
polytope constraints
achievable subspace
minimize \(\mathbb{E}[\)cost\((x_0,u_0,x_1...)]\)
s.t. \(x_{t+1} = Ax_t + Bu_t + w_t\)
\(x_t\in\mathcal X,~~u_t\in\mathcal U\) for all \(t\)
and all \(\|w_t\|\leq \sigma_w\)
minimize cost(\(\mathbf{\Phi}\))
s.t. \(\begin{bmatrix}zI- A&- B\end{bmatrix} \mathbf\Phi = I\)
\( \mathbf\Phi\in\) constraints\(_{\sigma_w}\)(\(\mathcal{X},\mathcal{U}\))
Robust Constrained LQR with System Level Synthesis
robust cost
tightened polytope constraints
nominal achievable subspace
\( \underset{\mathbf u=\mathbf{Kx}}{\min}\) \(\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}\) \(\mathbb{E}[\)cost\((x_0,u_0,x_1...)]\)
s.t. \(x_{t+1} = Ax_t + Bu_t + w_t\)
\(x_t\in\mathcal X,~~u_t\in\mathcal U\) for all \(t\)
and all \(A, B, w_t\)
\( \underset{\mathbf{\Phi}}{\min}\) \(\frac{1}{1-\gamma}\)\(\text{cost}(\mathbf{\Phi})\)
\(\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)
\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau, \)
\( \mathbf\Phi\in \text{constraints}\)\(_{\sigma_w,\tau}\)(\(\mathcal{X},\mathcal{U})\)
sensitivity constraints
Informal Theorem (Safety):
Using any feasible \(\mathbf K\) with \(0\leq\gamma,\tau\leq 1\) for learning results in a stable interconnection and satisfies the state and input constraints for any system in the uncertainty set.
\(\text{find}~\begin{bmatrix}zI- \widehat A_0&- \widehat B_0\end{bmatrix} \mathbf\Phi = I\)
\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau, \)
\( \mathbf\Phi\in \text{constraints}\)\(_{\tilde\sigma_w,\tau}\)(\(\mathcal{X},\mathcal{U}_{\sigma_\eta})\)
\(\sigma_w+(\|\widehat B\|+\varepsilon_B)\sigma_\eta\)

Maintaining Safety While Learning
Informal Theorem (Suboptimality):
The relative suboptimality is bounded as
\(\frac{\text{cost}(\widehat\mathbf{K})-\text{cost}(\mathbf{K}_*) }{\text{cost}(\mathbf{K}_*)}\leq 4\sqrt{2}(1+M) \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi_*\|_{H_\infty}+M\)
where \(M\) is the safety margin sub-optimality gap of the optimal controller.
Suboptimality Analysis
\(\underset{\gamma,\tau}{\min} ~\underset{\mathbf{\Phi}}{\min}\) \(\frac{1}{1-\gamma}\)\(\text{cost}(\mathbf{\Phi})\)
\(\text{s.t.}~~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)
\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau, \)
\( \mathbf\Phi\in \text{constraints}\)\(_{\sigma_w,\tau}\)(\(\mathcal{X},\mathcal{U})\)
Robust synthesis using estimated dynamics:
The double integrator dynamics \(x_{t+1} = \begin{bmatrix}1&0.1\\0&1\end{bmatrix}x_t + \begin{bmatrix}0\\1\end{bmatrix}u_t + w_t\)
Example: Constrained Double Integrator


Learning with \(u_t = \eta_t +\mathbf{K}_0(x_t, x_{t-1}...)\)


Controlling the system with \(\widehat{\mathbf{K}}\)
Future Work
- Online analysis of adaptive control
- Output feedback
Thank you! Questions?
- Receding horizon control
- nonlinear dynamics
S. Dean, S. Tu, N. Matni, and B. Recht, Safely Learning to Control the Constrained Linear Quadratic Regulator. arXiv:1809.10121
Based on work supported by NSF Graduate Research Fellowship under Grant No. DGE 1752814




Backup Slides + Details
In the Gaussian case, the statistical bound comes from
\(\Big\|\begin{bmatrix} \widehat A - A \\ \widehat B - B\end{bmatrix}\Big\| \lesssim\sqrt{\frac{\sigma_w^2(n+d ) }{T\lambda_{\min}(\Sigma_{x,u})} } \)
Open loop Guassian inputs (result due to [Simchowitz et al. 2018])
where \(\Sigma_{x,u}= \sum_{k=0}^\infty A^k (\sigma_w^2 I + \sigma_u^2BB^\top)(A^k)^\top \)
Now with \(u_k = Kx_k + \eta_k\):
where \(\Sigma_{x,u}= \begin{bmatrix} \Sigma & \Sigma K^\top \\ K\Sigma & K\Sigma K^\top + \sigma_u^2 I\end{bmatrix}\)
with \(\Sigma= \sum_{k=0}^\infty (A+BK)^k (\sigma_w^2 I + \sigma_u^2BB^\top)((A+BK)^k)^\top \)
System Level Synthesis
\(x_t = \sum_{k=0}^t (A-BK)^{k}w_{t-k}\)
\(u_t = \sum_{k=0}^t K(A-BK)^{k}w_{t-k}\)
\(x_t = \sum_{k=0}^t \Phi_x(t) w_{t-k}\)
\(u_t = \sum_{k=0}^t \Phi_u(t) w_{t-k}\)
\(\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = \begin{bmatrix} \mathbf{\Phi_x}\\ \mathbf{\Phi_u} \end{bmatrix} \mathbf{w} \)
\(\mathbf{K} = \mathbf{\Phi_u} \mathbf{\Phi_x}^{-1} \)
The specific form of the constraints
constraints\(_{\sigma_w}\)(\(\mathcal{X}\))\( = \{ F_j^\top \Phi(k+1)x_0 + \sigma_w\|F_j^\top[\Phi(k) ~...~\Phi(1)]\|_1 \leq b_j ~~\forall~j,k\}\)
for \(\mathcal{X} = \{Fx\leq b\}\)
where \(F_j\) are rows of \(F\)
The robust constraint condition is instead
\( F_j^\top \Phi(k+1)x_0 + \sigma_w\|F_j^\top[\Phi(k) ~...~\Phi(1)]\|_1 + \max(\sigma_w,\|x_0\|_\infty)\frac{\tau}{1-\tau}\|F_j^\top[\Phi(k+1) ~...~\Phi(1)]\|_1 \leq b_j \)
Under the dynamics uncertainty, the system response is
\(\mathbf{\Phi}(1+\mathbf\Delta)^{-1}\mathbf w\)
where \(\mathbf \Delta = [\varepsilon_A~~\varepsilon_B]\mathbf\Phi\), so the robust synthesis constraints essentially come from considering expanded noise process \(\tilde{\mathbf w} = (1+\mathbf\Delta)^{-1}\mathbf w\)
\(\tilde{\mathbf w} = (1+\mathbf\Delta)^{-1}\mathbf w\)
\(= \mathbf w + \mathbf\Delta(1+\mathbf\Delta)^{-1}\mathbf w\)
In MPC, it is common to model the uncertainty in an additive disturbance manner, \(\tilde w_k = \Delta_A x_k + \Delta_B u_k + w_k\)
\(\tilde{\mathbf w} = \mathbf w + \Delta_A \mathbf x + \Delta_B \mathbf u \)
vs.
Safe LQR ACC
By Sarah Dean
Safe LQR ACC
- 1,387