# Sample Complexity of the Linear Quadratic Regulator

Sarah Dean¹    Horia Mania¹    Nikolai Matni²    Ben Recht¹    Stephen Tu³

¹University of California, Berkeley           ²University of Pennsylvania           ³Google Brain

Classic RL setting: discrete problems and inspired by games

## Motivation

RL techniques applied to continuous systems interacting with the physical world

## Setting: Feedback Control

Control theory:

• long history of studying feedback
• specified models

Optimal Control Problem:

minimize $$\mathbb{E}[\sum_{t=0}^T$$cost$$(x_t, u_t)]$$

s.t.  $$x_{t+1} = f_t(x_t, u_t, w_t)$$

plant

controller

state $$x$$,

cost

input $$u$$

## Linear Quadratic Regulator

Simplest OCP: linear dynamics, quadratic cost, Gaussian process noise

minimize $$\mathbb{E}\left[ \displaystyle\lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^T x_t^\top Q x_t + u_t^\top R u_t\right]$$

s.t.  $$x_{t+1} = Ax_t+Bu_t+w_t$$

$$u_t = \underbrace{-(R+B^\top P B)^{-1} B^\top P A}_{K_\star}x_t$$

where $$P=\text{DARE}(A,B,Q,R)$$ also defines the value function $$V(x) = x^\top P x$$

Static feedback controller is optimal and can be computed in closed-form:

## Sample Complexity Problem

How many observations are necessary to control unknown system?

### Main Result (Informal):

As long as $$N$$ is large enough, then with probability at least $$1-\delta$$,
$$\mathrm{rel.~error~of}~\widehat{\mathbf K}\lesssim \frac{\mathrm{size~of~noise}}{\mathrm{size~of~excitation}} \sqrt{\frac{\mathrm{dimension}}{N} \log(1/\delta)} \cdot\mathrm{robustness~of~}K_\star$$

excitation

$$(A_\star, B_\star)$$

$$N$$ observations

## Coarse-ID Control

1. Collect $$N$$ observations and estimate $$\widehat A,\widehat B$$ and confidence intervals

2. Use estimates to synthesize robust controller $$\widehat{\mathbf{K}}$$

$$(A_\star, B_\star)$$

$$\widehat{\mathbf{K}}$$

$$(A_\star, B_\star)$$

## System Identification

Least squares estimate: $$(\widehat A, \widehat B) \in \arg\min \sum_{\ell=1}^N \|Ax_{T}^{(\ell)} +B u_{T}^{(\ell)} - x_{T+1}^{(\ell)}\|^2$$

### Learning Result:

As long as $$N\gtrsim n+p+\log(1/\delta)$$, then with probability at least $$1-\delta$$,

$$\|\widehat A - A_\star\| \lesssim \frac{\sigma_w}{\sqrt{\lambda_{\min}(\sigma_u^2 G_T G_T^\top + \sigma_w^2 F_T F_T^\top )}} \sqrt{\frac{n+p }{N} \log(1/\delta)}$$,        $$\|\widehat B - B_\star\| \lesssim \frac{\sigma_w}{\sigma_u} \sqrt{\frac{n+p }{N} \log(1/\delta)}$$

with controllability Grammians defined as

$$G_T = \begin{bmatrix}A_\star^{T-1}B_\star&A_\star^{T-2}B_\star&\dots&B_\star\end{bmatrix} \qquad F_T = \begin{bmatrix}A_\star^{T-1}&A_\star^{T-2}&\dots&I\end{bmatrix}$$

$$(A_\star, B_\star)$$

Excitation

$$u_t^{(\ell)} \sim \mathcal{N}(0, \sigma_u^2)$$

Observe states $$\{x_t^{(\ell)}\}$$

## Random Matrix Analysis

have independent Gaussian entries

$$\begin{bmatrix} x_{T}^{(\ell)} \\u_{T}^{(\ell)} \end{bmatrix} \sim \mathcal{N}\left(0, \begin{bmatrix}\sigma_u^2 G_TG_T^\top + \sigma_w^2 F_TF_T^\top &\\ & \sigma_u^2 I\end{bmatrix}\right)$$

$$w_t^{(\ell)} \sim \mathcal{N}\left(0, \sigma_w^2\right)$$

The least-squares estimate is

$$\arg \min \|Z_N \begin{bmatrix} A & B\end{bmatrix} ^\top - X_N\|^2_F = (Z_N^\top Z_N)^\dagger Z_N^\top X_N$$

$$= \begin{bmatrix} A_\star & B_\star \end{bmatrix} ^\top + (Z_N^\top Z_N)^\dagger Z_N^\top W_N$$

Data and noise matrices

$$X_N = \begin{bmatrix} x_{T+1}^{(1)} & \dots & x_{T+1}^{(N)} \end{bmatrix}^\top$$

$$Z_N = \begin{bmatrix} x_{T}^{(1)} & \dots & x_{T}^{(N)} \\u_{T}^{(1)} & \dots & u_{T}^{(N)} \end{bmatrix}^\top$$

$$W_N = \begin{bmatrix} w_{T}^{(1)} & \dots & w_{T}^{(N)} \end{bmatrix}^\top$$

lower bound minimum singular value,

or compute data-dependent bound

upper bound inner products

## What about learning from single trajectory?

• Ellipsoidal confidence intervals using self-normalized martingale techniques [Abbasi-Yadkori & Szepesvári, 2011]
• Recent results [Simchowitz et al., 2018] for linear time series:

$$\Big\|\begin{bmatrix} \widehat A - A_\star \\ \widehat B - B_\star\end{bmatrix}\Big\| \lesssim \frac{\sigma_w}{C_u \sigma_u}\sqrt{\frac{n+p }{T} \log(d/\delta)}$$

• We suggest bootstrapped confidence intervals, though we do not present formal guarantees on their accuracy

## Robust Control Problem

To guarantee worst-case performance, use estimates $$\widehat A,\widehat B$$ and confidence intervals for robust synthesis:

$$\underset{\mathbf u=\mathbf{Kx}}{\min}$$  $$\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}$$ $$\mathbb{E}\left[\lim_{T\to\infty} \frac{1}{T}\sum_{t=0}^T x_t^\top Q x_t + u_t^\top R u_t\right]$$

s.t.  $$x_{t+1} = Ax_t + Bu_t + w_t$$

To tackle this nonconvex problem, we use an alternate parametrization.

## A System Level Perspective

$$(A,B)$$

controller $$\mathbf{K}$$

$$\bf x$$

$$\bf u$$

$$\bf w$$

$$\bf x$$

$$\bf u$$

$$\bf w$$

# $$\mathbf{\Phi}$$

In closed loop, state and input are linear functions of disturbance

$$x_t = \sum_{k=0}^t A^{k}(Bu_{t-k} + w_{t-k})$$

$$u_t = \sum_{k=0}^t K_kx_{t-k}$$

$$\begin{bmatrix} x_t\\u_t \end{bmatrix} = \sum_{k=0}^t \begin{bmatrix} \Phi_x(t)\\ \Phi_u(t) \end{bmatrix} w_{t-k}$$

Instead of reasoning about a controller $$\mathbf{K}$$, we reason about the interconnection $$\mathbf\Phi$$ directly.

## Optimal control with SLS

$$\underset{\color{red}\mathbf{K}}{\min}$$   $$\mathbb{E}\left[\lim_{T\to\infty} \frac{1}{T}\sum_{t=0}^T x_t^\top Q x_t + u_t^\top R u_t\right]$$

s.t.  $$x_{t+1} = Ax_t + Bu_t + w_t$$

$$u_{t} = {\color{red}\mathbf{K}}(x_t)$$

$$\underset{\color{blue}\mathbf{\Phi}}{\min}$$$$\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix}{\color{blue} \mathbf{\Phi}} \right\|_{\mathcal{H}_2}^2$$

s.t. $${\color{blue}\mathbf\Phi }\in\mathrm{Affine}(A, B)$$

Instead of reasoning about $$\mathbf{K}$$, reason about the interconnection $$\mathbf\Phi$$ directly.

$$\underset{\color{red}\mathbf{K}}{\min}$$ $$\mathbb{E}\left\|\begin{bmatrix} Q^{1/2} \mathbf x\\ R^{1/2} \mathbf u\end{bmatrix}\right\|_{2}^2$$

s.t.  $$z\mathbf x = A\mathbf x + B\mathbf u + \mathbf w$$

$$\mathbf u = {\color{red}\mathbf{K}}\mathbf x$$

Theorem [Anderson et al., 2019]: Correspondence between feedback controller and system response,

$${\mathbf\Phi }\in\mathrm{Affine}(A, B)$$ $$\iff$$ $$\mathbf K = \mathbf{\Phi_u\Phi_x}^{-1}$$ achieves response $$\mathbf \Phi$$ in closed loop with $$A,B$$.

## SLS with uncertain dynamics

When dynamics are unknown, we optimize over

$$\widehat x_{t+1} = \widehat A\widehat x_t + \widehat B \widehat u_t$$

while the actual trajectories obey

$$x_{t+1} = Ax_t + Bu_t$$

For system responses,

$$\widehat\mathbf{\Phi} \in\mathrm{Affine}(\widehat A,\widehat B)$$

if and only if

$$\widehat\mathbf{\Phi}(I - \begin{bmatrix}\Delta_A & \Delta_B \end{bmatrix} \widehat\mathbf{\Phi})^{-1} \in\mathrm{Affine}(A,B)$$

as long as the inverse exists.

Theorem [Anderson et al., 2019]: Robust correspondence between feedback controller and system response,

$$\widehat{\mathbf\Phi }\in\mathrm{Affine}(\widehat A, \widehat B)$$ $$\iff$$ $$\mathbf K = \widehat\mathbf{\Phi}_u\widehat\mathbf{\Phi}_x^{-1}$$ achieves response
$$\widehat{\mathbf\Phi }(1-\mathbf\Delta)^{-1}$$ as long as $$\|\mathbf\Delta\|<1$$.

## Robust Synthesis with SLS

Using insights from SLS, rewrite and upper bound robust synthesis problem:

$$\underset{\mathbf u=\mathbf{Kx}}{\min}$$  $$\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}$$ $$\mathbb{E}\left[ \frac{1}{T}\sum_{t=0}^T x_t^\top Q x_t + u_t^\top R u_t\right]$$

$$\text{s.t.}~x_{t+1} = Ax_t + Bu_t + w_t$$

$$\widehat{\mathbf\Phi} = \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\arg\min}$$ $$\frac{1}{1-\gamma}$$$$\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}^2$$

$$\qquad\qquad\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I$$

$$\qquad\qquad\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma$$

$$\underset{\mathbf{\Phi}}{\min}$$ $$\underset{\|\Delta_A\|\leq \varepsilon_A \atop \|\Delta_B\|\leq \varepsilon_B}{\max}$$ $$\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2}^2$$

$$\text{s.t.}~ {\mathbf\Phi }\in\mathrm{Affine}(\widehat A, \widehat B)$$

$$~~~~~\color{teal} \mathbf \Delta = \begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}$$

### Suboptimality Result:

For $$\widehat{\mathbf{K}} = \widehat\mathbf{\Phi}_u\widehat\mathbf{\Phi}_x^{-1}$$ synthesized as above and $$K_\star$$ the true optimal controller,

$$\frac{\text{cost}(\widehat\mathbf{K})-\text{cost}({K}_*) }{\text{cost}({K}_*)}\leq 5(\varepsilon_A + \varepsilon_B\|K_\star\|) \|\mathscr{R}_{A_\star+B_\star K_\star}\|_{\mathcal H_\infty}$$

as long as $$(\varepsilon_A + \varepsilon_B\|K_\star\|) \|\mathscr{R}_{A_\star+B_\star K_\star}\|_{\mathcal H_\infty} \leq 1/5$$

## Main Result

As long as $$N\gtrsim (n+p)\sigma_w^2\|\mathscr R_{A_\star+B_\star K_\star}\|_{\mathcal H_\infty}(1/\lambda_G + \|K_\star\|^2/\sigma_u^2)\log(1/\delta)$$, then for robustly synthesized $$\widehat{\mathbf{K}}$$, with probability at least $$1-\delta$$,

rel. error of $$\widehat{\mathbf K}$$ $$\lesssim \sigma_w (\frac{1}{\sqrt{\lambda_G}} + \frac{\|K_\star\|}{\sigma_u}) \|\mathscr{R}_{A_\star+B_\star K_\star}\|_{\mathcal H_\infty} \sqrt{\frac{n+p }{T} \log(1/\delta)}$$

robustness of optimal closed-loop

excitability of system

optimal controller gain

vs.

## Controller Synthesis Implementation

System response variables are infinite-dimensional!

Approximations yield finite SDP

$$\underset{\mathbf{\Phi}}{\min}$$ $$\frac{1}{1-\gamma}$$$$\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}^2$$

$$\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I$$

$$\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma$$

• FIR approximation: optimize over horizon $$L$$
• Maintain guarantee as long as $$L \approx \log(1/\mathrm{subopt.})$$
• Total of $$n(n+p)L$$ decision variables
• Common Lyapunov approximation:  static controller
• No theoretical guarantee
• Total of $$(n+p)(2n+p)$$ decision variables

## Numerical Demonstration

We consider a stylized temperature regulation example

$$A_\star = \begin{bmatrix} 1.01 & 0.01 & 0 \\ 0.01 & 1.01 & 0.01 \\ 0 & 0.01 & 1.01\end{bmatrix}, ~~B_\star = I$$

with relatively low penalty on state

$$Q=10^{-3}\cdot I, ~~R = I$$

## Extensions

1. Robust Adaptive Control [DMMRT18]
• Greedy exploration + robust exploitation achieves $$T^{2/3}$$ regret
2. Learning while satisfying safety constraints [DTMR19]
• Robust controller compensates for bounded exploration noise to keep system safe during exploration

What is interesting about LQR?

• Exploration is easy due to linearity
• Exploitation is easy: new certainty equivalent result [Mania et al., 2019]

Nonlinear system identification and control are both much more challenging, and many strategies use LQR as a building block (iLQR, MPC).

### References

1. Abbasi-Yadkori, Szepesvári. "Regret bounds for the adaptive control of linear quadratic systems." COLT, 2011.
2. Anderson, Doyle, Low, Matni. "System level synthesis." Annual Reviews in Control (2019).
3. Dean, Mania, Matni, Recht, Tu. "Regret bounds for robust adaptive control of the linear quadratic regulator." NeurIPS, 2018.
4. Dean, Tu, Matni, Recht. "Safely learning to control the constrained linear quadratic regulator." ACC, 2019.
5. Mania, Tu, Recht. "Certainty Equivalence is Efficient for Linear Quadratic Control." NeuRIPS, 2019.

6. Simchowitz, Mania, Tu, Jordan, Recht. "Learning without mixing: Towards a sharp analysis of linear system identification." COLT, 2018.

For details, see our paper: "On the sample complexity of the linear quadratic regulator." Foundations of Computational Mathematics (2019): 1-47.

By Sarah Dean

• 977