# Safely Learning to Control the Linear Quadratic Regulator

## Motivation for Safe Learning

High performance in the real world involves complex dynamics and safety constraints

• How can we learn while maintaining safety?
• How well do we have to know a system to safely control it?

## Problem Setting

• performance optimization

• safety constraint satisfaction

• stochastic process noise
and uncertain dynamics

minimize  $$\mathbb{E}[$$cost$$(x_0,u_0,x_1...)]$$

s.t.  $$x_{t+1} = Ax_t + Bu_t + w_t$$

$$x_t\in\mathcal X,~~u_t\in\mathcal U$$ for all $$t$$

Initial Estimates
$$\|\widehat A_0 - A\|\leq \epsilon_A$$, $$\|\widehat B_0 - B\|\leq \epsilon_B$$

Goal: Analyze learning and performance in the presence of state and input constraints

and all $$\|w_t\|\leq \sigma_w$$

### Learn Dynamics

Robust control

• process noise
• injected excitation
• dynamics uncertainty
1. Run system with control $$u_t = \eta_t +\mathbf{K}_0(x_t, x_{t-1}...)$$
2. Least squares estimation on trajectory $$\{(x_t, u_t)\}_{t=0}^T$$
3. Synthesize new robust controller $$\widehat{\mathbf K}$$ using estimates

### while

Persistent Excitation

• stochastic and bounded

### Main Result:

Where $$M$$ is the safety margin cost gap of $$\mathbf{K}_*$$, as long as $$T$$ is large enough,
rel. error of $$\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|$$CL$$(A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M$$

SNR $$=\frac{\text{process noise}}{\text{excitation}}$$

sample complexity

safety margin optimal cost gap

robustness of optimal controller

### Main Result:

Where $$M$$ is the safety margin cost gap of $$\mathbf{K}_*$$, as long as $$T$$ is large enough,
rel. error of $$\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|$$CL$$(A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M$$

### Main Result:

Where $$M$$ is the safety margin cost gap of $$\mathbf{K}_*$$, as long as $$T$$ is large enough,
rel. error of $$\widehat{\mathbf K}\lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d}{T}} \|$$CL$$(A,B,\mathbf K_*)\|_{H_\infty} (1+M) +M$$

Ingredients:

1. Statistical learning rate
2. Robust control for safety during learning
3. Sub-optimality analysis of robust control

### Informal Theorem (Learning):

For stabilizing control of the form $$u_t = \mathbf{K}(x_t, x_{t-1}...) + \eta_t$$ and large enough $$T$$, we have w.p. $$1-\delta$$

$$\Big\|\begin{bmatrix} \widehat A - A \\ \widehat B - B\end{bmatrix}\Big\| \lesssim \frac{\sigma_w C_u}{\sigma_\eta} \sqrt{\frac{n+d }{T} \log(1/\delta)}$$

Least squares estimate $$(\widehat A, \widehat B) \in \arg\min \sum_{t=0}^T \|Ax_t +B u_t - x_{t+1}\|^2$$

## Finite Sample Learning Rate

where $$C_u$$ is the gain from disturbance to control input

Assume that process noise $$w_t$$ and excitation $$\eta_t$$ are zero mean, independent over time, and with fourth moments bounded by $$\sigma_w$$ and $$\sigma_\eta$$

## Maintaining Safety with System Level Synthesis

Instead of reasoning about a controller $$\mathbf{K}$$,

plant $$(A,B)$$

controller $$\mathbf{K}$$

$$\bf x$$

$$\bf u$$

$$\bf w$$

# $$\mathbf{\Phi}$$

### $$\begin{bmatrix} \mathbf{x}\\ \mathbf{u}\end{bmatrix} = CL([ {A \atop I} {B \atop 0}],\mathbf{K}) \mathbf{w}$$

we reason about the interconnection $$\mathbf\Phi$$ directly.

This correspondence holds for all $$\mathbf\Phi$$ constrained to lie in an affine space defined by the true dynamics

$$\begin{bmatrix}zI- A&- B\end{bmatrix} \mathbf\Phi = I$$

### Constrained LQR with System Level Synthesis

polytope constraints

achievable subspace

minimize  $$\mathbb{E}[$$cost$$(x_0,u_0,x_1...)]$$

s.t.  $$x_{t+1} = Ax_t + Bu_t + w_t$$

$$x_t\in\mathcal X,~~u_t\in\mathcal U$$ for all $$t$$

and all $$\|w_t\|\leq \sigma_w$$

minimize  cost($$\mathbf{\Phi}$$)

s.t. $$\begin{bmatrix}zI- A&- B\end{bmatrix} \mathbf\Phi = I$$

$$\mathbf\Phi\in$$ constraints$$_{\sigma_w}$$($$\mathcal{X},\mathcal{U}$$)

### Robust Constrained LQR with System Level Synthesis

robust cost

tightened polytope constraints

nominal achievable subspace

$$\underset{\mathbf u=\mathbf{Kx}}{\min}$$  $$\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}$$ $$\mathbb{E}[$$cost$$(x_0,u_0,x_1...)]$$

s.t.  $$x_{t+1} = Ax_t + Bu_t + w_t$$

$$x_t\in\mathcal X,~~u_t\in\mathcal U$$ for all $$t$$

and all $$A, B, w_t$$

$$\underset{\mathbf{\Phi}}{\min}$$ $$\frac{1}{1-\gamma}$$$$\text{cost}(\mathbf{\Phi})$$

$$\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I$$

$$\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau,$$

$$\mathbf\Phi\in \text{constraints}$$$$_{\sigma_w,\tau}$$($$\mathcal{X},\mathcal{U})$$

sensitivity constraints

### Informal Theorem (Safety):

Using any feasible $$\mathbf K$$ with $$0\leq\gamma,\tau\leq 1$$ for learning results in a stable interconnection and satisfies the state and input constraints for any system in the uncertainty set.

$$\text{find}~\begin{bmatrix}zI- \widehat A_0&- \widehat B_0\end{bmatrix} \mathbf\Phi = I$$

$$\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau,$$

$$\mathbf\Phi\in \text{constraints}$$$$_{\tilde\sigma_w,\tau}$$($$\mathcal{X},\mathcal{U}_{\sigma_\eta})$$

$$\sigma_w+(\|\widehat B\|+\varepsilon_B)\sigma_\eta$$

## Maintaining Safety While Learning

### Informal Theorem (Suboptimality):

The relative suboptimality is bounded as

$$\frac{\text{cost}(\widehat\mathbf{K})-\text{cost}(\mathbf{K}_*) }{\text{cost}(\mathbf{K}_*)}\leq 4\sqrt{2}(1+M) \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi_*\|_{H_\infty}+M$$

where $$M$$ is the safety margin sub-optimality gap of the optimal controller.

## Suboptimality Analysis

$$\underset{\gamma,\tau}{\min} ~\underset{\mathbf{\Phi}}{\min}$$ $$\frac{1}{1-\gamma}$$$$\text{cost}(\mathbf{\Phi})$$

$$\text{s.t.}~~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I$$

$$\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{H_\infty}\leq\gamma,~ \|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf\Phi\|_{L_1}\leq\tau,$$

$$\mathbf\Phi\in \text{constraints}$$$$_{\sigma_w,\tau}$$($$\mathcal{X},\mathcal{U})$$

Robust synthesis using estimated dynamics:

The double integrator dynamics   $$x_{t+1} = \begin{bmatrix}1&0.1\\0&1\end{bmatrix}x_t + \begin{bmatrix}0\\1\end{bmatrix}u_t + w_t$$

### Example: Constrained Double Integrator

Learning with $$u_t = \eta_t +\mathbf{K}_0(x_t, x_{t-1}...)$$

Controlling the system with $$\widehat{\mathbf{K}}$$

## Future Work

• Online analysis of adaptive control
• Output feedback

## Thank you! Questions?

• Receding horizon control
• nonlinear dynamics

S. Dean, S. Tu, N. Matni, and B. Recht, Safely Learning to Control the Constrained Linear Quadratic Regulator. arXiv:1809.10121

Based on work supported by NSF Graduate Research Fellowship under Grant No. DGE 1752814

# Backup Slides + Details

In the Gaussian case, the statistical bound comes from

$$\Big\|\begin{bmatrix} \widehat A - A \\ \widehat B - B\end{bmatrix}\Big\| \lesssim\sqrt{\frac{\sigma_w^2(n+d ) }{T\lambda_{\min}(\Sigma_{x,u})} }$$

Open loop Guassian inputs (result due to [Simchowitz et al. 2018])

where $$\Sigma_{x,u}= \sum_{k=0}^\infty A^k (\sigma_w^2 I + \sigma_u^2BB^\top)(A^k)^\top$$

Now with $$u_k = Kx_k + \eta_k$$:

where $$\Sigma_{x,u}= \begin{bmatrix} \Sigma & \Sigma K^\top \\ K\Sigma & K\Sigma K^\top + \sigma_u^2 I\end{bmatrix}$$

with $$\Sigma= \sum_{k=0}^\infty (A+BK)^k (\sigma_w^2 I + \sigma_u^2BB^\top)((A+BK)^k)^\top$$

### System Level Synthesis

$$x_t = \sum_{k=0}^t (A-BK)^{k}w_{t-k}$$

$$u_t = \sum_{k=0}^t K(A-BK)^{k}w_{t-k}$$

$$x_t = \sum_{k=0}^t \Phi_x(t) w_{t-k}$$

$$u_t = \sum_{k=0}^t \Phi_u(t) w_{t-k}$$

### $$\mathbf{K} = \mathbf{\Phi_u} \mathbf{\Phi_x}^{-1}$$

The specific form of the constraints

constraints$$_{\sigma_w}$$($$\mathcal{X}$$)$$= \{ F_j^\top \Phi(k+1)x_0 + \sigma_w\|F_j^\top[\Phi(k) ~...~\Phi(1)]\|_1 \leq b_j ~~\forall~j,k\}$$

for $$\mathcal{X} = \{Fx\leq b\}$$

where $$F_j$$ are rows of $$F$$

The robust constraint condition is instead

$$F_j^\top \Phi(k+1)x_0 + \sigma_w\|F_j^\top[\Phi(k) ~...~\Phi(1)]\|_1 + \max(\sigma_w,\|x_0\|_\infty)\frac{\tau}{1-\tau}\|F_j^\top[\Phi(k+1) ~...~\Phi(1)]\|_1 \leq b_j$$

Under the dynamics uncertainty, the system response is

$$\mathbf{\Phi}(1+\mathbf\Delta)^{-1}\mathbf w$$

where $$\mathbf \Delta = [\varepsilon_A~~\varepsilon_B]\mathbf\Phi$$, so the robust synthesis constraints essentially come from considering expanded noise process $$\tilde{\mathbf w} = (1+\mathbf\Delta)^{-1}\mathbf w$$

$$\tilde{\mathbf w} = (1+\mathbf\Delta)^{-1}\mathbf w$$

$$= \mathbf w + \mathbf\Delta(1+\mathbf\Delta)^{-1}\mathbf w$$

In MPC, it is common to model the uncertainty in an additive disturbance manner, $$\tilde w_k = \Delta_A x_k + \Delta_B u_k + w_k$$

$$\tilde{\mathbf w} = \mathbf w + \Delta_A \mathbf x + \Delta_B \mathbf u$$

vs.