Safely Learning to Control the Linear Quadratic Regulator

Sarah Dean, UC Berkeley EECS

joint work with Stephen Tu, Nikolai Matni, and Ben Recht

American Controls Conference 2019

Motivation for Safe Learning

High performance in the real world involves complex dynamics and safety constraints

How can we learn while maintaining safety?
How well do we have to know a system to safely control it?

Problem Setting

performance optimization
safety constraint satisfaction
stochastic process noise
and uncertain dynamics

minimize \(\mathbb{E}[\)cost\((x_0,u_0,x_1...)]\)

s.t. \(x_{t+1} = Ax_t + Bu_t + w_t\)

\(x_t\in\mathcal X,~~u_t\in\mathcal U\) for all \(t\)

Initial Estimates
\(\|\widehat A_0 - A\|\leq \epsilon_A\), \(\|\widehat B_0 - B\|\leq \epsilon_B\)

Goal: Analyze learning and performance in the presence of state and input constraints

and all \(\|w_t\|\leq \sigma_w\)

Learn Dynamics

Robust control

process noise
injected excitation
dynamics uncertainty

Run system with control \(u_t = \eta_t +\mathbf{K}_0(x_t, x_{t-1}...)\)
Least squares estimation on trajectory \(\{(x_t, u_t)\}_{t=0}^T\)
Synthesize new robust controller \(\widehat{\mathbf K}\) using estimates

Maintaining Safety

while

Persistent Excitation

stochastic and bounded

How to

How well do we have to know a system to safely control it?

Main Result:

Where \(M\) is the safety margin cost gap of \(\mathbf{K}_*\), as long as \(T\) is large enough,
rel. error of \(\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|\)CL\((A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M\)

SNR \(=\frac{\text{process noise}}{\text{excitation}}\)

sample complexity

safety margin optimal cost gap

robustness of optimal controller

How well do we have to know a system to safely control it?

Main Result:

Where \(M\) is the safety margin cost gap of \(\mathbf{K}_*\), as long as \(T\) is large enough,
rel. error of \(\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|\)CL\((A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M\)

How well do we have to know a system to safely control it?

Main Result:

Ingredients:

Statistical learning rate
Robust control for safety during learning
Sub-optimality analysis of robust control

Informal Theorem (Learning):

For stabilizing control of the form \(u_t = \mathbf{K}(x_t, x_{t-1}...) + \eta_t\) and large enough \(T\), we have w.p. \(1-\delta\)

\(\Big\|\begin{bmatrix} \widehat A - A \\ \widehat B - B\end{bmatrix}\Big\| \lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d }{T} \log(1/\delta)} \)

Least squares estimate \((\widehat A, \widehat B) \in \arg\min \sum_{t=0}^T \|Ax_t +B u_t - x_{t+1}\|^2 \)

Finite Sample Learning Rate

where \(C_u\) is the gain from disturbance to control input

Assume that process noise \(w_t\) and excitation \(\eta_t\) are zero mean, independent over time, and with fourth moments bounded by \(\sigma_w\) and \(\sigma_\eta\)

Maintaining Safety with System Level Synthesis

Instead of reasoning about a controller \(\mathbf{K}\),

plant \((A,B)\)

controller \(\mathbf{K}\)

\(\bf x\)

\(\bf u\)

\(\bf w\)

\(\mathbf{\Phi}\)

\(\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = \mathbf{\Phi}\mathbf{w} \)

\(\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = CL([ {A \atop I} {B \atop 0}],\mathbf{K}) \mathbf{w} \)

we reason about the interconnection \(\mathbf\Phi\) directly.

This correspondence holds for all \(\mathbf\Phi\) constrained to lie in an affine space defined by the true dynamics

\(\begin{bmatrix}zI- A&- B\end{bmatrix} \mathbf\Phi = I\)

Constrained LQR with System Level Synthesis

quadratic cost

polytope constraints

achievable subspace

minimize \(\mathbb{E}[\)cost\((x_0,u_0,x_1...)]\)

s.t. \(x_{t+1} = Ax_t + Bu_t + w_t\)

\(x_t\in\mathcal X,~~u_t\in\mathcal U\) for all \(t\)

and all \(\|w_t\|\leq \sigma_w\)

minimize cost(\(\mathbf{\Phi}\))

s.t. \(\begin{bmatrix}zI- A&- B\end{bmatrix} \mathbf\Phi = I\)

\( \mathbf\Phi\in\) constraints\(_{\sigma_w}\)(\(\mathcal{X},\mathcal{U}\))

Robust Constrained LQR with System Level Synthesis

robust cost

tightened polytope constraints

nominal achievable subspace

\( \underset{\mathbf u=\mathbf{Kx}}{\min}\) \(\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}\) \(\mathbb{E}[\)cost\((x_0,u_0,x_1...)]\)

s.t. \(x_{t+1} = Ax_t + Bu_t + w_t\)

\(x_t\in\mathcal X,~~u_t\in\mathcal U\) for all \(t\)

and all \(A, B, w_t\)

\( \underset{\mathbf{\Phi}}{\min}\) \(\frac{1}{1-\gamma}\)\(\text{cost}(\mathbf{\Phi})\)

\(\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)

\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau, \)

\( \mathbf\Phi\in \text{constraints}\)\(_{\sigma_w,\tau}\)(\(\mathcal{X},\mathcal{U})\)

sensitivity constraints

Informal Theorem (Safety):

Using any feasible \(\mathbf K\) with \(0\leq\gamma,\tau\leq 1\) for learning results in a stable interconnection and satisfies the state and input constraints for any system in the uncertainty set.

\(\text{find}~\begin{bmatrix}zI- \widehat A_0&- \widehat B_0\end{bmatrix} \mathbf\Phi = I\)

\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau, \)

\( \mathbf\Phi\in \text{constraints}\)\(_{\tilde\sigma_w,\tau}\)(\(\mathcal{X},\mathcal{U}_{\sigma_\eta})\)

\(\sigma_w+(\|\widehat B\|+\varepsilon_B)\sigma_\eta\)

Maintaining Safety While Learning

Informal Theorem (Suboptimality):

The relative suboptimality is bounded as

\(\frac{\text{cost}(\widehat\mathbf{K})-\text{cost}(\mathbf{K}_*) }{\text{cost}(\mathbf{K}_*)}\leq 4\sqrt{2}(1+M) \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi_*\|_{H_\infty}+M\)

where \(M\) is the safety margin sub-optimality gap of the optimal controller.

Suboptimality Analysis

\(\underset{\gamma,\tau}{\min} ~\underset{\mathbf{\Phi}}{\min}\) \(\frac{1}{1-\gamma}\)\(\text{cost}(\mathbf{\Phi})\)

\(\text{s.t.}~~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)

\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau, \)

\( \mathbf\Phi\in \text{constraints}\)\(_{\sigma_w,\tau}\)(\(\mathcal{X},\mathcal{U})\)

Robust synthesis using estimated dynamics:

The double integrator dynamics \(x_{t+1} = \begin{bmatrix}1&0.1\\0&1\end{bmatrix}x_t + \begin{bmatrix}0\\1\end{bmatrix}u_t + w_t\)

Example: Constrained Double Integrator

Learning with \(u_t = \eta_t +\mathbf{K}_0(x_t, x_{t-1}...)\)

Controlling the system with \(\widehat{\mathbf{K}}\)

Future Work

Online analysis of adaptive control
Output feedback

Thank you! Questions?

Receding horizon control
- nonlinear dynamics

S. Dean, S. Tu, N. Matni, and B. Recht, Safely Learning to Control the Constrained Linear Quadratic Regulator. arXiv:1809.10121

Based on work supported by NSF Graduate Research Fellowship under Grant No. DGE 1752814

Backup Slides + Details

In the Gaussian case, the statistical bound comes from

\(\Big\|\begin{bmatrix} \widehat A - A \\ \widehat B - B\end{bmatrix}\Big\| \lesssim\sqrt{\frac{\sigma_w^2(n+d ) }{T\lambda_{\min}(\Sigma_{x,u})} } \)

Open loop Guassian inputs (result due to [Simchowitz et al. 2018])

where \(\Sigma_{x,u}= \sum_{k=0}^\infty A^k (\sigma_w^2 I + \sigma_u^2BB^\top)(A^k)^\top \)

Now with \(u_k = Kx_k + \eta_k\):

where \(\Sigma_{x,u}= \begin{bmatrix} \Sigma & \Sigma K^\top \\ K\Sigma & K\Sigma K^\top + \sigma_u^2 I\end{bmatrix}\)

with \(\Sigma= \sum_{k=0}^\infty (A+BK)^k (\sigma_w^2 I + \sigma_u^2BB^\top)((A+BK)^k)^\top \)

System Level Synthesis

\(x_t = \sum_{k=0}^t (A-BK)^{k}w_{t-k}\)

\(u_t = \sum_{k=0}^t K(A-BK)^{k}w_{t-k}\)

\(x_t = \sum_{k=0}^t \Phi_x(t) w_{t-k}\)

\(u_t = \sum_{k=0}^t \Phi_u(t) w_{t-k}\)

\(\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = \begin{bmatrix} \mathbf{\Phi_x}\\ \mathbf{\Phi_u} \end{bmatrix} \mathbf{w} \)

\(\mathbf{K} = \mathbf{\Phi_u} \mathbf{\Phi_x}^{-1} \)

The specific form of the constraints

constraints\(_{\sigma_w}\)(\(\mathcal{X}\))\( = \{ F_j^\top \Phi(k+1)x_0 + \sigma_w\|F_j^\top[\Phi(k) ~...~\Phi(1)]\|_1 \leq b_j ~~\forall~j,k\}\)

for \(\mathcal{X} = \{Fx\leq b\}\)

where \(F_j\) are rows of \(F\)

The robust constraint condition is instead

\( F_j^\top \Phi(k+1)x_0 + \sigma_w\|F_j^\top[\Phi(k) ~...~\Phi(1)]\|_1 + \max(\sigma_w,\|x_0\|_\infty)\frac{\tau}{1-\tau}\|F_j^\top[\Phi(k+1) ~...~\Phi(1)]\|_1 \leq b_j \)

Under the dynamics uncertainty, the system response is

\(\mathbf{\Phi}(1+\mathbf\Delta)^{-1}\mathbf w\)

where \(\mathbf \Delta = [\varepsilon_A~~\varepsilon_B]\mathbf\Phi\), so the robust synthesis constraints essentially come from considering expanded noise process \(\tilde{\mathbf w} = (1+\mathbf\Delta)^{-1}\mathbf w\)

\(\tilde{\mathbf w} = (1+\mathbf\Delta)^{-1}\mathbf w\)

\(= \mathbf w + \mathbf\Delta(1+\mathbf\Delta)^{-1}\mathbf w\)

In MPC, it is common to model the uncertainty in an additive disturbance manner, \(\tilde w_k = \Delta_A x_k + \Delta_B u_k + w_k\)

\(\tilde{\mathbf w} = \mathbf w + \Delta_A \mathbf x + \Delta_B \mathbf u \)

vs.