Data-Driven Optimal Control
ML in Feedback Sys #16
Prof Sarah Dean
Reminders
- Office hours this week moved to Friday 9-10am
- cancelled next week due to travel
- Feedback on final project proposal
- Upcoming paper presentations starting next week
- Project midterm update due 11/11
Recap: System Level LQR
\(a_t = {\color{Goldenrod} K_t }s_{t}\)
\( \underset{\mathbf a }{\min}\) \(\displaystyle\mathbb{E}\left[\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)
\(\text{s.t.}~~s_{t+1} = As_t + Ba_t + w_t\)
\(\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w \)
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}\bar Q^{1/2} &\\& \bar R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix} \right\|_{F}^2\)
\(\text{s.t.}~~ \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s\\ \color{teal} \mathbf{\Phi}_a \end{bmatrix}= I \)
\(B\)

\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
instead of a loop,
\(\mathbf{\Phi}\)
\(B\)

\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
system looks like a line
References: System Level Synthesis by Anderson, Doyle, Low, Matni
System Level Synthesis
Theorem: For the a linear system in feedback with a linear controller over the horizon \(t=0,\dots, T\):
- The affine subspace \(\{(I - \mathcal Z \bar A )\mathbf \Phi_s- \mathcal Z \bar B \mathbf \Phi_a = I\} \) parametrizes all possible system responses.
- For any block-lower-triangular matrices \((\mathbf \Phi_s,\mathbf \Phi_a)\) in the affine subspace, there exists a linear feedback controller achieving this response.
Phi_s = cvx.Variable((T*n, T*n), name="Phi_s")
Phi_a = cvx.Variable((T*p, T*n), name="Phi_a")
# Affine dynamics constraint
constr = [Phi_s[:n, :] == np.eye(n)]
for k in range(T-1):
constr.append(Phi_s[n*(k+1):n*(k+1+1),:] == A*Phi_s[n*k:n*(k+1),:] + B*Phi_a[p*k:p*(k+1),:])
constr.append(A*Phi_s[n*(T-1):,:] + B*Phi_a[p*(T-1):,:] == 0)
# Quadratic cost
cost_matrix = cvx.bmat([[Q_sqrt*Phi_s[n*k:n*(k+1), :]] for k in range(T)]
+ [[R_sqrt*Phi_a[p*k:p*(k+1), :]] for k in range(T)])
objective = cvx.norm(cost_matrix,'fro')
prob = cvx.Problem(cvx.Minimize(objective), constr)
prob.solve()
Phi_s = np.array(Phi_s.value)
Phi_a = np.array(Phi_a.value)
Policies via convex programming
Optimal Control on Arbitrary Horizons
- In many cases, task horizon is long or not pre-defined
- Allow \(T\to\infty\)
- average cost $$ \min_{\pi} ~~\mathbb E_w\Big[\lim_{t\to\infty} \frac{1}{T}\sum_{k=0}^{T} c(s_k, \pi(s_k)) \Big]$$
- discounted average cost $$ \min_{\pi} ~~\mathbb E_w\Big[\sum_{k=0}^{\infty} \gamma^k c(s_k, \pi(s_k)) \Big]$$
- Policy is stationary (no longer depends on time) \(\pi:\mathcal S\to\mathcal A\)
Steady State LQR
Infinite Horizon LQR Problem
$$ \min_{\pi} ~~\lim_{T\to\infty}\mathbb E_w\Big[\frac{1}{T}\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$
Claim: The optimal cost-to-go function is quadratic and the optimal policy is linear $$J^\star (s) = s^\top P s,\qquad \pi^\star(s) = K s$$
- \(P = Q+A^\top PA + A^\top PB(R+B^\top PB)^{-1}B^\top PA\)
- Discrete Algebraic Riccati Equation: \(P=\mathrm{DARE}(A,B,Q,R)\)
- \(K = -(R+B^\top PB)^{-1}B^\top QPA\)
Infinite Horizon Optimal Control
Stochastic Infinite Horizon Optimal Control Problem
$$ \min_{\pi} ~~\lim_{t\to\infty} \mathbb E_w\Big[\frac{1}{T}\sum_{k=0}^{T} c(s_k, \pi(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi(s_k),w_k) $$
\(\underbrace{\qquad\qquad}_{J^\pi(s_0)}\)
Bellman Optimality Equation
- \(\underbrace{J^\star (s)}_{\text{value function}} = \min_{a\in\mathcal A} \underbrace{c(s, a)+\mathbb E_w[J^\star (F(s,a,w))]}_{\text{state-action function}}\)
- Minimizing argument is \(\pi^\star(s)\)
Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas
LQR Example
$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
Goal: stay near origin and be energy efficient
- \(c(s,a) = s^\top \begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix}s + 5a^2 \)
\(\pi_\star(s) \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix} s\)
\(J^\star(s) \approx s^\top \begin{bmatrix} 33.5 & 5.8 \\ 5.8 & 2.4 \end{bmatrix} s\)
Steady State System Response
\(B\)

\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
\( = (A+BK)^{t+1} s_0 + \sum_{k=0}^{t} (A+BK)^{t-k} w_{k}\)
\( = K(A+BK)^{t+1} s_0 + \sum_{k=0}^{t} K(A+BK)^{t-k} w_{k}\)
\(s_{t+1} = As_{t}+Ba_{t}+w_{t}\)
\(a_t = K s_t\)
\(s_{t} = \Phi_s^{0} s_0 + \sum_{k=1}^t \Phi_s^{k}w_{t-k}\)
\(a_{t} = \Phi_a^{0} s_0 + \sum_{k=1}^t \Phi_a^{k}w_{t-k}\)
\(\mathbf{\Phi}\)
\(\begin{bmatrix} s_{0}\\\vdots \\s_T\end{bmatrix} = \begin{bmatrix} \Phi_s^{0}\\ \Phi_s^{ 1}& \Phi_s^{0}\\ \vdots & \ddots & \ddots \\ \Phi_s^{T} & \Phi_s^{T-1} & \dots & \Phi_s^{0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)
\(\begin{bmatrix} a_{0}\\\vdots \\a_T\end{bmatrix} = \begin{bmatrix} \Phi_a^{0}\\ \Phi_a^{1}& \Phi_a^{0}\\ \vdots & \ddots & \ddots \\ \Phi_a^{T} & \Phi_a^{T-1} & \dots & \Phi_a^{0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)
- Cost depends on the (semi-)infinite sequence \(\mathbf s = (s_0, s_1, s_2,\dots)\)
- Generated by (semi-)infinite operator \(\mathbf \Phi_s = (\Phi_s^0, \Phi_s^1,\dots)\) acting on disturbance sequence \(\mathbf w = (w_{-1}, w_0, w_1,\dots)\)
- the operation is a convolution \(s_{t} = \sum_{k=1}^{t+1} \Phi_s^{k-1}w_{t-k}\)
- We represent this operation with the notation \(\mathbf s = \mathbf \Phi_s\mathbf w\)
- Concretely,
- semi-infinite vectors and Toeplitz matrices
- frequency domain
Sequences & Operators
- \(\mathbf s = (s_0, s_1, s_2,\dots)\), \(\mathbf w = (w_{-1}, w_0, w_1,\dots)\), and \(\mathbf \Phi_s = (\Phi_s^0, \Phi_s^1,\dots)\) $$\mathbf s = \mathbf \Phi_s\mathbf w$$
- Concretely,
- semi-infinite vectors and Toeplitz matrices $$\begin{bmatrix} s_{0}\\\vdots \\s_t\\\vdots \end{bmatrix} = \begin{bmatrix} \Phi_s^{0}\\ \Phi_s^{ 1}& \Phi_s^{0}\\ \vdots & \ddots & \ddots \\ \Phi_s^{t} & \Phi_s^{t-1} & \dots & \Phi_s^{0} \\ \vdots & & \ddots &&\ddots \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{t-1} \\\vdots \end{bmatrix}$$
- frequency domain
Sequences & Operators
- \(\mathbf s = (s_0, s_1, s_2,\dots)\), \(\mathbf w = (w_{-1}, w_0, w_1,\dots)\), and \(\mathbf \Phi_s = (\Phi_s^0, \Phi_s^1,\dots)\) $$\mathbf s = \mathbf \Phi_s\mathbf w$$
- Concretely,
- semi-infinite vectors and Toeplitz matrices
- frequency domain
- define time shift operator \(z\) such that $$z(s_0, s_1,s_2 \dots) = (s_1, s_2,\dots)$$
- represent \(\mathbf s(z) = \sum_{t=0}^\infty z^{-t}s_t\) and \(\mathbf \Phi_s(z) = \sum_{t=0}^\infty z^{-t}\Phi_s^t\)
- multiplication of polynomials: $$ \mathbf \Phi_s(z) \mathbf w(z) = (\sum_{t=-1}^\infty z^{-t}w_{t})(\sum_{t=0}^\infty z^{-t}\Phi_s^t) = \sum_{t=0}^\infty z^{-t} \sum_{k=1}^{t+1} \Phi_s^{k-1} w_{t-k} $$
Sequences & Operators
LQR Example
$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
\(\pi_\star(s) \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix} s\)
\(\Phi_s^t \approx \begin{bmatrix}0.9 & 0.1 \\ -0.070 & 0.86\end{bmatrix}^{t-1} \quad \Phi_a^t \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix}\begin{bmatrix}0.9 & 0.1 \\ -0.070 & 0.86\end{bmatrix}^{t-1} \)
eigenvalues \(\approx 0.88\pm 0.082j\)
\(a_t = {\color{Goldenrod} K}s_{t}\)
\( \underset{\mathbf a }{\min}\) \(\displaystyle\lim_{T\to\infty}\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)
\(\text{s.t.}~~s_{t+1} = As_t + Ba_t + w_t\)
\(\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w \)
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix} \right\|_{\mathcal H_2}^2\)
\(\text{s.t.}~~ \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix}= I \)
Infinite Horizon LQR
Exercise: Using the frequency domain notation, derive the expression for the SLS cost and constraints. Hint: in signal notation, the dynamics can be written \(z\mathbf s = A\mathbf s + B\mathbf a + \mathbf w\)
Where we use the norm:
$$ \|\mathbf \Phi\|_{\mathcal H_2}^2 = \sum_{t=0}^\infty \|\Phi^t\|_F^2 $$
Recap: LQR
- Goal: minimize quadratic cost (\(Q,R\)) in a system with linear dynamics (\(A,B\))
- Classic approach: Dynamic programming/Bellman optimality
- \(P = \mathrm{DARE}(A,B,Q,R)\) and \(K_\star = -(R+B^\top PB)^{-1}B^\top QPA\)
- System level synthesis: Convex optimization
-
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix} \right\|_{\mathcal H_2}^2~~\text{s.t.}~~ \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix}= I \)
-
- Both require knowledge of dynamics and costs!

policy
\(\pi_t:\mathcal S\to\mathcal A\)
observation
\(s_t\)
accumulate
\(\{(s_t, a_t, c_t)\}\)
Action in an unknown dynamic world
Goal: select actions \(a_t\) to bring environment to low-cost states
action
\(a_{t}\)
\(?\)
\(s\)
Setting: dynamics (and cost) functions are not known, but we have data \(\{s_k, a_k, c_k\}_{k=0}^N\). Approaches include a focus on:
-
Model: learn dynamics/costs from data, then do policy design
- For LQR: estimate \(\hat A,\hat B,\hat Q,\hat R\) then design \(\hat K\)
- "model based"
-
Bellman: learn value or state-action function
- For LQR: estimate \(\hat J\) then determine \(\hat K\) as \(\argmin\)
- "model free"
-
Policy: estimate gradients and update policy directly
- For LQR: \(\hat K \leftarrow \hat K -\alpha\widehat{\nabla J}(\hat K)\)
- "model free"
Data-driven Policy Design
Setting: dynamics \(A,B\) are not known, but we have data \(\{s_k, a_k\}_{k=0}^N\)
-
Learn Model:
- estimate \(\hat A,\hat B\) via least-squares $$\hat A,\hat B = \arg\min_{A,B} \sum_{k=0}^{N-1} \|s_{k+1}-As_k-Ba_k\|_2^2$$
- error bounds $$ \|A-\hat A\|_2\leq \varepsilon_A,\quad \|B-\hat B\|_2\leq \varepsilon_B$$
- system identification guarantees \(\max\{\varepsilon_A, \varepsilon_B\}\lesssim \sqrt{\frac{m+n}{N}}\)
-
Design Policy:
- nominal or certainty equivalent approach uses \(\hat A, \hat B\)
- robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)
Model-based LQR
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
Goal: be energy efficient
- \(c(s,a) = s^\top \begin{bmatrix} 0.01& \\ & 0.01 \end{bmatrix}s + 100a^2 \)
\(\hat\pi_\star(s) \approx -\begin{bmatrix} 6.1\times 10^{-5}& 2.8\times 10^{-4}\end{bmatrix} s\) does not stabilize the system!
Even though \(\varepsilon=0.02\), \(J(\hat K)\) is infinite!
LQR Example
true dynamics \(\left(\begin{bmatrix} 1.01 & 0.1\\ & 1.01 \end{bmatrix}, \begin{bmatrix}0\\1\end{bmatrix}\right) \) but we estimate \(\left(\begin{bmatrix} 0.99 & 0.1\\ & 0.99 \end{bmatrix},\begin{bmatrix}0\\1\end{bmatrix}\right )\)
Robust design is worst-case
\( \underset{\mathbf a=\mathbf{Ks}}{\min}\) \(\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}\) \(\mathbb{E}\left[\lim_{T\to\infty} \frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)
s.t. \(s_{t+1} = As_t + Ba_t + w_t\)
Challenge: translating predictions
\(\hat s_{t+1} = \hat A\hat s_t + \hat B \hat a_t\)
to reality
\(s_{t+1} = As_t + Ba_t\)
Lemma: if the system response variables satisfy
- the nominal system constraint \( \begin{bmatrix} zI - \hat A & - \hat B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}= I \)
- then if the inverse exists, \( \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix} (I-\mathbf \Delta)^{-1}= I \) where \(\mathbf \Delta = (\underbrace{A-\hat A}_{\Delta_A})\mathbf \Phi_s + (\underbrace{B-\hat B}_{\Delta_B})\mathbf \Phi_a\)
Robust synthesis with SLS
Proof:
- \((zI - \hat A)\hat\mathbf{\Phi}_s - \hat B \hat\mathbf{\Phi}_a = I\)
- \((zI - \hat A+A-A)\hat\mathbf{\Phi}_s -( \hat B-B+B) \hat\mathbf{\Phi}_a=I\)
- \((zI -A)\hat\mathbf{\Phi}_s -B\hat\mathbf{\Phi}_a + (A- \hat A)\hat\mathbf{\Phi}_s + (B-\hat B)\mathbf{\Phi}_a=I\)
- \((zI -A)\hat\mathbf{\Phi}_s -B\hat\mathbf{\Phi}_a=I - \Delta_A\hat\mathbf{\Phi}_s - \Delta_B\mathbf{\Phi}_a\)
- \( \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix} = I-\mathbf\Delta\)
Therefore, the estimated cost is $$ \hat J(\hat{\mathbf \Phi}) = \left\|\begin{bmatrix} Q^{1/2}\\ & R^{1/2}\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}\right\|_{\mathcal H_2}^2 $$ while the cost actually achieved is $$ J(\hat{\mathbf \Phi}) = \left\|\begin{bmatrix} Q^{1/2}\\ & R^{1/2}\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}(I+\mathbf \Delta)^{-1} \right\|_{\mathcal H_2}^2 $$
Robust synthesis with SLS
Theorem (Anderson et al., 2019): A policy designed from systems responses satisfying \(\begin{bmatrix} zI - \hat A & - \hat B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}= I\) will achieve response \(\begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix} (I-\mathbf \Delta)^{-1}\)
where \(\mathbf \Delta = (\underbrace{A-\hat A}_{\Delta_A})\mathbf \Phi_s + (\underbrace{B-\hat B}_{\Delta_B})\mathbf \Phi_a\) if the inverse exists.
\( \widehat{\mathbf\Phi} = \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\arg\min}\) \(\frac{1}{1-\gamma}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}\)
\(\qquad\qquad\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)
\(\qquad\qquad\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma\)
Robust synthesis with SLS
\( \underset{\mathbf{\Phi}}{\min}\) \(\underset{\|\Delta_A\|\leq \varepsilon_A \atop \|\Delta_B\|\leq \varepsilon_B}{\max}\) \(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2}\)
\(\text{s.t.}~ {\mathbf\Phi }\in\mathrm{Affine}(\widehat A, \widehat B)\)
\(~~~~~\color{teal} \mathbf \Delta = \begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}\)
\( \underset{\mathbf a=\mathbf{Ks}}{\min}\) \(\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}\) \(\mathbb{E}\left[\lim_{T\to\infty} \frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)
s.t. \(s_{t+1} = As_t + Ba_t + w_t\)
Where we use the norm:
\(\|\mathbf \Phi\|_{\mathcal H_\infty} = \max_{\|\mathbf x\|_2\leq 1} \|\mathbf \Phi\mathbf x\|_2 \) induced by \(\|\mathbf x\|_2 =\sqrt{\sum_{t=0}^\infty \|\mathbf x_t\|_2^2} \)
Upper bounding this nonconvex objective leads to
Review of matrix norms
- Euclidean norm on vectors
- \(\|x\|_2 =\sqrt{\sum_{i=1}^n x_i^2} = x^\top x\)
- Frobenius norm: how big are matrix entries
- \(\|A\|_F = \sqrt{\sum_{i=1}^n\sum_{j=1}^m A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}\)
- Operator norm: how big can this matrix make a vector
- \(\|A\|_2 = \max_{\|x\|_2\leq 1} \|Ax\|_2 = \sqrt{\lambda_{\max}(A^\top A)} = \sigma_{\max}(A^\top A)\)
- Relationships:
- \(\|A\|_2 \leq \|A\|_F\)
- \(\|Ax\|_2 \leq \|A\|_2\|x\|_2\)
- \(\|AB\|_F \leq \|A\|_2 \|B\|_F\)
Signal and operator norms
- \(\ell_2\) norm
- \(\|\mathbf x\|_2 =\sqrt{\sum_{t=0}^\infty \|\mathbf x_t\|_2^2} \)
- \(\mathcal H_2\) norm
- \(\|\mathbf \Phi\|_{\mathcal H_2} = \sqrt{\sum_{t=0}^\infty \|\Phi^t\|_F^2}\)
- \(\mathcal H_\infty\) norm
- \(\|\mathbf \Phi\|_{\mathcal H_\infty} = \max_{\|\mathbf x\|_2\leq 1} \|\mathbf \Phi\mathbf x\|_2 \)
- Relationships:
- \(\|\mathbf \Phi\|_{\mathcal H_\infty} \leq \|\mathbf \Phi\|_{\mathcal H_2}\)
- \(\|\mathbf \Phi\mathbf x\|_2 \leq \|\mathbf \Phi\|_{\mathcal H_\infty} \|\mathbf x\|_2\)
- \(\|\mathbf \Phi \mathbf \Psi\|_{\mathcal H_2} \leq \|\mathbf \Phi \|_{\mathcal H_\infty} \|\mathbf \Psi\|_{\mathcal H_2}\)
Upper bounds follow by:
- \(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2} \leq \left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2} \end{bmatrix}\mathbf{\Phi}\right\|_{\mathcal{H}_2} \left\| {\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_\infty} \)
- \(\left\| {\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_\infty} \leq \frac{1}{1- \|\mathbf \Delta\|_{\mathcal{H}_\infty}}\)
- \(\|\begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}\|_{\mathcal{H}_\infty} \leq |[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\)
Robust synthesis derivation
\( \underset{\mathbf{\Phi}}{\min}\) \(\underset{\|\Delta_A\|\leq \varepsilon_A \atop \|\Delta_B\|\leq \varepsilon_B}{\max}\) \(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2}\)
\(\text{s.t.}~ {\mathbf\Phi }\in\mathrm{Affine}(\widehat A, \widehat B)\)
\(~~~~~\color{teal} \mathbf \Delta = \begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}\)
\( \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\min}\) \(\frac{1}{1-\gamma}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}\)
\(\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)
\(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma\)
\( \widehat{\mathbf\Phi} = \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\arg\min}\) \(\frac{1}{1-\gamma}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}\)
\(\qquad\qquad\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)
\(\qquad\qquad\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma\)
Informal Theorem (Suboptimality):
For \(\hat\mathbf{\Phi}\) synthesized as above and \(\mathbf\Phi_\star\) the true optimal system response,
$$ J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \lesssim J(\mathbf \Phi_\star)\left\|\begin{bmatrix} \varepsilon_A & \\ & \varepsilon_B\end{bmatrix} \mathbf \Phi_\star\right\|_{\mathcal H_\infty} $$
Robust synthesis with SLS
-
Learn Model:
- estimate \(\hat A,\hat B\) via least-squares, guarantee \(\max\{\varepsilon_A, \varepsilon_B\}\lesssim \sqrt{\frac{m+n}{N}}\)
-
Design Policy:
-
robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)
-
\(J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \lesssim J(\mathbf \Phi_\star)\left\|\mathbf \Phi_\star\right\|_{\mathcal H_\infty} \sqrt{\frac{m+n}{N}}\)
-
-
nominal or certainty equivalent approach uses \(\hat A, \hat B\)
- for small enough \(\varepsilon\), can show that \(J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \lesssim \varepsilon^2\)
- thus faster rate, \(J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \lesssim \frac{m+n}{N}\)
-
robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)
Model-based LQR
Using an explore then commit algorithm, we have $$R(T) = R_{\text{explore}}(N) + R_{\text{commit}}(N, T)$$
- Robust: \(R(T) \leq C_1 N + C_2 \frac{T}{\sqrt{N}}\)
- \(N\propto T^{2/3}\implies R(T)\lesssim O(T^{2/3})\)
- stability guaranteed
- Certainty equivalent: \(R(T) \leq C_1 N + C_2 \frac{T}{N}\)
- \(N\propto \sqrt{T}\implies R(T)\lesssim O(\sqrt{T})\)
- only holds for \(T\) large enough that estimation errors are small
Online Model-based LQR
Recap
- Steady-state controllers and infinite horizons
- \(\pi^\star(s) = Ks\)
- Taxonomy of RL
- policy, value, model
- Model-based LQR & Robustness
References: System Level Synthesis by Anderson, Doyle, Low, Matni and Ch 2-3 in Machine Learning in Feedback Systems by Sarah Dean
16 - Model-Based RL - ML in Feedback Sys
By Sarah Dean