Data-Driven Optimal Control

ML in Feedback Sys #16

Prof Sarah Dean

Reminders

  • Office hours this week moved to Friday 9-10am
    • cancelled next week due to travel
  • Feedback on final project proposal
  • Upcoming paper presentations starting next week
  • Project midterm update due 11/11

Recap: System Level LQR

       \(a_t = {\color{Goldenrod} K_t }s_{t}\)

\( \underset{\mathbf a }{\min}\)   \(\displaystyle\mathbb{E}\left[\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)

\(\text{s.t.}~~s_{t+1} = As_t + Ba_t + w_t\)

\(\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w \)

\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}\bar Q^{1/2} &\\& \bar R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix} \right\|_{F}^2\)

\(\text{s.t.}~~ \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s\\ \color{teal} \mathbf{\Phi}_a \end{bmatrix}= I \)

\(B\)

\(A\)

\(s\)

\(w_t\)

\(a_t\)

\(s_t\)

\(\mathbf{K}\)

instead of a loop,


\(\mathbf{\Phi}\)

 

\(B\)

\(A\)

\(s\)

\(w_t\)

\(a_t\)

\(s_t\)

\(\mathbf{K}\)

system looks like a line

References: System Level Synthesis by Anderson, Doyle, Low, Matni

System Level Synthesis

Theorem: For the a linear system in feedback with a linear controller over the horizon \(t=0,\dots, T\):

  1. The affine subspace \(\{(I - \mathcal Z \bar A )\mathbf \Phi_s- \mathcal Z \bar B \mathbf \Phi_a = I\} \) parametrizes all possible system responses.
  2. For any block-lower-triangular matrices \((\mathbf \Phi_s,\mathbf \Phi_a)\) in the affine subspace, there exists a linear feedback controller achieving this response.
Phi_s = cvx.Variable((T*n, T*n), name="Phi_s")
Phi_a = cvx.Variable((T*p, T*n), name="Phi_a")

# Affine dynamics constraint
constr = [Phi_s[:n, :] == np.eye(n)]
for k in range(T-1):
    constr.append(Phi_s[n*(k+1):n*(k+1+1),:] == A*Phi_s[n*k:n*(k+1),:] + B*Phi_a[p*k:p*(k+1),:])
constr.append(A*Phi_s[n*(T-1):,:] + B*Phi_a[p*(T-1):,:] == 0)   

# Quadratic cost
cost_matrix = cvx.bmat([[Q_sqrt*Phi_s[n*k:n*(k+1), :]] for k in range(T)] 
                       + [[R_sqrt*Phi_a[p*k:p*(k+1), :]] for k in range(T)])
objective = cvx.norm(cost_matrix,'fro')

prob = cvx.Problem(cvx.Minimize(objective), constr)
prob.solve()     
Phi_s = np.array(Phi_s.value)
Phi_a = np.array(Phi_a.value)

Policies via convex programming

Optimal Control on Arbitrary Horizons

  • In many cases, task horizon is long or not pre-defined
  • Allow \(T\to\infty\)
    • average cost $$ \min_{\pi} ~~\mathbb E_w\Big[\lim_{t\to\infty} \frac{1}{T}\sum_{k=0}^{T} c(s_k, \pi(s_k)) \Big]$$
    • discounted average cost $$ \min_{\pi} ~~\mathbb E_w\Big[\sum_{k=0}^{\infty} \gamma^k c(s_k, \pi(s_k)) \Big]$$
  • Policy is stationary (no longer depends on time) \(\pi:\mathcal S\to\mathcal A\)

Steady State LQR

Infinite Horizon LQR Problem

$$ \min_{\pi} ~~\lim_{T\to\infty}\mathbb E_w\Big[\frac{1}{T}\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$

Claim:  The optimal cost-to-go function is quadratic and the optimal policy is linear $$J^\star (s) = s^\top P s,\qquad \pi^\star(s) = K s$$

  • \(P = Q+A^\top PA + A^\top PB(R+B^\top PB)^{-1}B^\top PA\)
    • Discrete Algebraic Riccati Equation: \(P=\mathrm{DARE}(A,B,Q,R)\)
  • \(K = -(R+B^\top PB)^{-1}B^\top QPA\)

Infinite Horizon Optimal Control

Stochastic Infinite Horizon Optimal Control Problem

$$ \min_{\pi} ~~\lim_{t\to\infty} \mathbb  E_w\Big[\frac{1}{T}\sum_{k=0}^{T} c(s_k, \pi(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi(s_k),w_k) $$

\(\underbrace{\qquad\qquad}_{J^\pi(s_0)}\)

Bellman Optimality Equation

  • \(\underbrace{J^\star (s)}_{\text{value function}} = \min_{a\in\mathcal A} \underbrace{c(s, a)+\mathbb E_w[J^\star (F(s,a,w))]}_{\text{state-action function}}\)
     
  • Minimizing argument is \(\pi^\star(s)\)

Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

LQR Example

$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$

The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).

Goal: stay near origin and be energy efficient

  • \(c(s,a) = s^\top \begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix}s + 5a^2 \)

\(\pi_\star(s) \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix} s\)

\(J^\star(s) \approx s^\top \begin{bmatrix} 33.5 & 5.8 \\ 5.8 & 2.4 \end{bmatrix} s\)

Steady State System Response

\(B\)

\(A\)

\(s\)

\(w_t\)

\(a_t\)

\(s_t\)

\(\mathbf{K}\)

\( = (A+BK)^{t+1} s_0 + \sum_{k=0}^{t} (A+BK)^{t-k} w_{k}\)

\( = K(A+BK)^{t+1} s_0 + \sum_{k=0}^{t} K(A+BK)^{t-k} w_{k}\)

\(s_{t+1} = As_{t}+Ba_{t}+w_{t}\)

\(a_t = K s_t\)

\(s_{t} = \Phi_s^{0} s_0 + \sum_{k=1}^t \Phi_s^{k}w_{t-k}\)

\(a_{t} = \Phi_a^{0} s_0 + \sum_{k=1}^t \Phi_a^{k}w_{t-k}\)


\(\mathbf{\Phi}\)

 

\(\begin{bmatrix} s_{0}\\\vdots \\s_T\end{bmatrix} = \begin{bmatrix} \Phi_s^{0}\\ \Phi_s^{ 1}& \Phi_s^{0}\\ \vdots  & \ddots & \ddots \\ \Phi_s^{T} & \Phi_s^{T-1} & \dots & \Phi_s^{0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)

\(\begin{bmatrix} a_{0}\\\vdots \\a_T\end{bmatrix} = \begin{bmatrix} \Phi_a^{0}\\ \Phi_a^{1}& \Phi_a^{0}\\ \vdots  & \ddots & \ddots \\ \Phi_a^{T} & \Phi_a^{T-1} & \dots & \Phi_a^{0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)

 

  • Cost depends on the (semi-)infinite sequence \(\mathbf s = (s_0, s_1, s_2,\dots)\)
  • Generated by (semi-)infinite operator \(\mathbf \Phi_s = (\Phi_s^0, \Phi_s^1,\dots)\) acting on disturbance sequence \(\mathbf w = (w_{-1}, w_0, w_1,\dots)\)
    • the operation is a convolution \(s_{t} = \sum_{k=1}^{t+1} \Phi_s^{k-1}w_{t-k}\)
  • We represent this operation with the notation \(\mathbf s = \mathbf \Phi_s\mathbf w\)
  • Concretely,
    • semi-infinite vectors and Toeplitz matrices
    • frequency domain

Sequences & Operators

  • \(\mathbf s = (s_0, s_1, s_2,\dots)\), \(\mathbf w = (w_{-1}, w_0, w_1,\dots)\), and \(\mathbf \Phi_s = (\Phi_s^0, \Phi_s^1,\dots)\) $$\mathbf s = \mathbf \Phi_s\mathbf w$$
  • Concretely,
    • semi-infinite vectors and Toeplitz matrices $$\begin{bmatrix} s_{0}\\\vdots \\s_t\\\vdots \end{bmatrix} = \begin{bmatrix} \Phi_s^{0}\\ \Phi_s^{ 1}& \Phi_s^{0}\\ \vdots  & \ddots & \ddots \\ \Phi_s^{t} & \Phi_s^{t-1} & \dots & \Phi_s^{0} \\ \vdots & & \ddots &&\ddots \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{t-1} \\\vdots \end{bmatrix}$$
    • frequency domain

Sequences & Operators

  • \(\mathbf s = (s_0, s_1, s_2,\dots)\), \(\mathbf w = (w_{-1}, w_0, w_1,\dots)\), and \(\mathbf \Phi_s = (\Phi_s^0, \Phi_s^1,\dots)\) $$\mathbf s = \mathbf \Phi_s\mathbf w$$
  • Concretely,
    • semi-infinite vectors and Toeplitz matrices
    • frequency domain
      • define time shift operator \(z\) such that $$z(s_0, s_1,s_2 \dots) = (s_1, s_2,\dots)$$
      • represent \(\mathbf s(z) = \sum_{t=0}^\infty z^{-t}s_t\) and \(\mathbf \Phi_s(z) = \sum_{t=0}^\infty z^{-t}\Phi_s^t\)
      • multiplication of polynomials: $$ \mathbf \Phi_s(z) \mathbf w(z) = (\sum_{t=-1}^\infty z^{-t}w_{t})(\sum_{t=0}^\infty z^{-t}\Phi_s^t) = \sum_{t=0}^\infty z^{-t} \sum_{k=1}^{t+1} \Phi_s^{k-1} w_{t-k} $$

Sequences & Operators

LQR Example

$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$

The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).

\(\pi_\star(s) \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix} s\)

\(\Phi_s^t \approx \begin{bmatrix}0.9 & 0.1 \\ -0.070 & 0.86\end{bmatrix}^{t-1} \quad \Phi_a^t \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix}\begin{bmatrix}0.9 & 0.1 \\ -0.070 & 0.86\end{bmatrix}^{t-1} \)

eigenvalues \(\approx 0.88\pm 0.082j\)

       \(a_t = {\color{Goldenrod} K}s_{t}\)

\( \underset{\mathbf a }{\min}\)   \(\displaystyle\lim_{T\to\infty}\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)

\(\text{s.t.}~~s_{t+1} = As_t + Ba_t + w_t\)

\(\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w \)

\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix} \right\|_{\mathcal H_2}^2\)

\(\text{s.t.}~~ \begin{bmatrix} zI -  A & - B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix}= I \)

Infinite Horizon LQR

Exercise: Using the frequency domain notation, derive the expression for the SLS cost and constraints. Hint: in signal notation, the dynamics can be written \(z\mathbf s = A\mathbf s + B\mathbf a + \mathbf w\)

Where we use the  norm:

$$ \|\mathbf \Phi\|_{\mathcal H_2}^2 = \sum_{t=0}^\infty \|\Phi^t\|_F^2 $$

Recap: LQR

  • Goal: minimize quadratic cost (\(Q,R\)) in a system with linear dynamics (\(A,B\))
  • Classic approach: Dynamic programming/Bellman optimality
    • \(P = \mathrm{DARE}(A,B,Q,R)\) and \(K_\star = -(R+B^\top PB)^{-1}B^\top QPA\)
  • System level synthesis: Convex optimization
    • \( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix} \right\|_{\mathcal H_2}^2~~\text{s.t.}~~ \begin{bmatrix} zI -  A & - B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a \end{bmatrix}= I \)

  • Both require knowledge of dynamics and costs!

policy

\(\pi_t:\mathcal S\to\mathcal A\)

observation

\(s_t\)

accumulate

\(\{(s_t, a_t, c_t)\}\)

Action in an unknown dynamic world

Goal: select actions \(a_t\) to bring environment to low-cost states

action

\(a_{t}\)

\(?\)

\(s\)

Setting: dynamics (and cost) functions are not known, but we have data \(\{s_k, a_k, c_k\}_{k=0}^N\). Approaches include a focus on:

  1. Model: learn dynamics/costs from data, then do policy design
    • For LQR: estimate \(\hat A,\hat B,\hat Q,\hat R\) then design \(\hat K\)
    • "model based"
  2. Bellman: learn value or state-action function
    • For LQR: estimate \(\hat J\) then determine \(\hat K\) as \(\argmin\)
    • "model free"
  3. Policy: estimate gradients and update policy directly
    • For LQR: \(\hat K \leftarrow \hat K -\alpha\widehat{\nabla J}(\hat K)\)
    • "model free"

Data-driven Policy Design

Setting: dynamics \(A,B\) are not known, but we have data \(\{s_k, a_k\}_{k=0}^N\)

  1. Learn Model:
    • estimate \(\hat A,\hat B\) via least-squares $$\hat A,\hat B = \arg\min_{A,B} \sum_{k=0}^{N-1} \|s_{k+1}-As_k-Ba_k\|_2^2$$
    • error bounds  $$ \|A-\hat A\|_2\leq  \varepsilon_A,\quad \|B-\hat B\|_2\leq \varepsilon_B$$
    • system identification guarantees \(\max\{\varepsilon_A, \varepsilon_B\}\lesssim \sqrt{\frac{m+n}{N}}\)
  2. Design Policy:
    • nominal or certainty equivalent approach uses \(\hat A, \hat B\)
    • robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)

Model-based LQR

The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).

Goal: be energy efficient

  • \(c(s,a) = s^\top \begin{bmatrix} 0.01& \\ & 0.01 \end{bmatrix}s + 100a^2 \)

\(\hat\pi_\star(s) \approx -\begin{bmatrix} 6.1\times 10^{-5}& 2.8\times 10^{-4}\end{bmatrix} s\) does not stabilize the system!

Even though \(\varepsilon=0.02\), \(J(\hat K)\) is infinite!

LQR Example

true dynamics \(\left(\begin{bmatrix} 1.01 & 0.1\\ & 1.01 \end{bmatrix}, \begin{bmatrix}0\\1\end{bmatrix}\right) \) but we estimate \(\left(\begin{bmatrix} 0.99 & 0.1\\ & 0.99 \end{bmatrix},\begin{bmatrix}0\\1\end{bmatrix}\right )\)

Robust design is worst-case

\( \underset{\mathbf a=\mathbf{Ks}}{\min}\)  \(\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}\) \(\mathbb{E}\left[\lim_{T\to\infty} \frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)

s.t.  \(s_{t+1} = As_t + Ba_t + w_t\)

Challenge: translating predictions

\(\hat s_{t+1} = \hat A\hat s_t + \hat B \hat a_t\)

to reality

\(s_{t+1} = As_t + Ba_t\)

Lemma: if the system response variables satisfy

  • the nominal system constraint \( \begin{bmatrix} zI -  \hat A & - \hat B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix}= I  \)
  • then if the inverse exists,  \( \begin{bmatrix} zI -   A & -  B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix} (I-\mathbf \Delta)^{-1}= I \) where \(\mathbf \Delta = (\underbrace{A-\hat A}_{\Delta_A})\mathbf \Phi_s + (\underbrace{B-\hat B}_{\Delta_B})\mathbf \Phi_a\)

Robust synthesis with SLS

Proof:

  • \((zI -  \hat A)\hat\mathbf{\Phi}_s - \hat B \hat\mathbf{\Phi}_a = I\)
  • \((zI -  \hat A+A-A)\hat\mathbf{\Phi}_s -( \hat B-B+B) \hat\mathbf{\Phi}_a=I\)
  • \((zI -A)\hat\mathbf{\Phi}_s -B\hat\mathbf{\Phi}_a + (A-  \hat A)\hat\mathbf{\Phi}_s + (B-\hat B)\mathbf{\Phi}_a=I\)
  • \((zI -A)\hat\mathbf{\Phi}_s -B\hat\mathbf{\Phi}_a=I - \Delta_A\hat\mathbf{\Phi}_s - \Delta_B\mathbf{\Phi}_a\)
  • \( \begin{bmatrix} zI -   A & -  B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix} = I-\mathbf\Delta\)

Therefore, the estimated cost is $$ \hat J(\hat{\mathbf \Phi}) = \left\|\begin{bmatrix} Q^{1/2}\\ & R^{1/2}\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix}\right\|_{\mathcal H_2}^2  $$ while the cost actually achieved is  $$ J(\hat{\mathbf \Phi}) = \left\|\begin{bmatrix} Q^{1/2}\\ & R^{1/2}\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix}(I+\mathbf \Delta)^{-1} \right\|_{\mathcal H_2}^2   $$

Robust synthesis with SLS

Theorem (Anderson et al., 2019): A policy designed from systems responses satisfying  \(\begin{bmatrix} zI -  \hat A & - \hat B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix}= I\) will achieve response \(\begin{bmatrix} \hat\mathbf{\Phi}_s \\  \hat\mathbf{\Phi}_a \end{bmatrix} (I-\mathbf \Delta)^{-1}\)

where \(\mathbf \Delta = (\underbrace{A-\hat A}_{\Delta_A})\mathbf \Phi_s + (\underbrace{B-\hat B}_{\Delta_B})\mathbf \Phi_a\) if the inverse exists.

\( \widehat{\mathbf\Phi} = \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\arg\min}\) \(\frac{1}{1-\gamma}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}\)

\(\qquad\qquad\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)

       \(\qquad\qquad\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma\)

Robust synthesis with SLS

\( \underset{\mathbf{\Phi}}{\min}\) \(\underset{\|\Delta_A\|\leq \varepsilon_A \atop \|\Delta_B\|\leq \varepsilon_B}{\max}\) \(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2}\)

\(\text{s.t.}~ {\mathbf\Phi }\in\mathrm{Affine}(\widehat A, \widehat B)\)

\(~~~~~\color{teal} \mathbf \Delta = \begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}\)

\( \underset{\mathbf a=\mathbf{Ks}}{\min}\)  \(\underset{\|A-\widehat A\|\leq \varepsilon_A \atop \|B-\widehat B\|\leq \varepsilon_B}{\max}\) \(\mathbb{E}\left[\lim_{T\to\infty} \frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)

s.t.  \(s_{t+1} = As_t + Ba_t + w_t\)

Where we use the norm:

\(\|\mathbf \Phi\|_{\mathcal H_\infty} = \max_{\|\mathbf x\|_2\leq 1} \|\mathbf \Phi\mathbf x\|_2 \) induced by \(\|\mathbf x\|_2 =\sqrt{\sum_{t=0}^\infty \|\mathbf x_t\|_2^2} \)

Upper bounding this nonconvex objective leads to

Review of matrix norms

  • Euclidean norm on vectors
    • \(\|x\|_2 =\sqrt{\sum_{i=1}^n x_i^2} = x^\top x\)
  • Frobenius norm: how big are matrix entries
    • \(\|A\|_F = \sqrt{\sum_{i=1}^n\sum_{j=1}^m A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}\)
  • Operator norm: how big can this matrix make a vector
    • \(\|A\|_2 = \max_{\|x\|_2\leq 1} \|Ax\|_2 = \sqrt{\lambda_{\max}(A^\top A)} = \sigma_{\max}(A^\top A)\)
  • Relationships:
    • \(\|A\|_2 \leq \|A\|_F\)
    • \(\|Ax\|_2 \leq \|A\|_2\|x\|_2\)
    • \(\|AB\|_F \leq \|A\|_2 \|B\|_F\)

Signal and operator norms

  • \(\ell_2\) norm
    • \(\|\mathbf x\|_2 =\sqrt{\sum_{t=0}^\infty \|\mathbf x_t\|_2^2} \)
  • \(\mathcal H_2\) norm
    • \(\|\mathbf \Phi\|_{\mathcal H_2} = \sqrt{\sum_{t=0}^\infty \|\Phi^t\|_F^2}\)
  • \(\mathcal H_\infty\) norm
    • \(\|\mathbf \Phi\|_{\mathcal H_\infty} = \max_{\|\mathbf x\|_2\leq 1} \|\mathbf \Phi\mathbf x\|_2 \)
  • Relationships:
    • \(\|\mathbf \Phi\|_{\mathcal H_\infty} \leq \|\mathbf \Phi\|_{\mathcal H_2}\)
    • \(\|\mathbf \Phi\mathbf x\|_2 \leq \|\mathbf \Phi\|_{\mathcal H_\infty} \|\mathbf x\|_2\)
    • \(\|\mathbf \Phi \mathbf \Psi\|_{\mathcal H_2} \leq \|\mathbf \Phi \|_{\mathcal H_\infty} \|\mathbf \Psi\|_{\mathcal H_2}\)

Upper bounds follow by:

  • \(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2}  \leq \left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2} \end{bmatrix}\mathbf{\Phi}\right\|_{\mathcal{H}_2} \left\| {\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_\infty} \)
  • \(\left\| {\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_\infty} \leq \frac{1}{1- \|\mathbf \Delta\|_{\mathcal{H}_\infty}}\)
  • \(\|\begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}\|_{\mathcal{H}_\infty} \leq |[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\)

Robust synthesis derivation

\( \underset{\mathbf{\Phi}}{\min}\) \(\underset{\|\Delta_A\|\leq \varepsilon_A \atop \|\Delta_B\|\leq \varepsilon_B}{\max}\) \(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi}{\color{teal}(I-\mathbf \Delta)^{-1}} \right\|_{\mathcal{H}_2}\)

\(\text{s.t.}~ {\mathbf\Phi }\in\mathrm{Affine}(\widehat A, \widehat B)\)

\(~~~~~\color{teal} \mathbf \Delta = \begin{bmatrix}\Delta_A&\Delta_B\end{bmatrix}\mathbf{\Phi}\)

\( \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\min}\) \(\frac{1}{1-\gamma}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}\)

\(\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)

       \(\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma\)

\( \widehat{\mathbf\Phi} = \underset{\mathbf{\Phi}, {\color{teal} \gamma}}{\arg\min}\) \(\frac{1}{1-\gamma}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \mathbf{\Phi} \right\|_{\mathcal{H}_2}\)

\(\qquad\qquad\text{s.t.}~\begin{bmatrix}zI- \widehat A&- \widehat B\end{bmatrix} \mathbf\Phi = I\)

       \(\qquad\qquad\|[{\varepsilon_A\atop ~} ~{~\atop \varepsilon_B}]\mathbf \Phi\|_{\mathcal H_\infty}\leq\gamma\)

Informal Theorem (Suboptimality):

For \(\hat\mathbf{\Phi}\) synthesized as above and \(\mathbf\Phi_\star\) the true optimal system response,

$$  J(\hat{\mathbf \Phi}) -  J(\mathbf \Phi_\star) \lesssim J(\mathbf \Phi_\star)\left\|\begin{bmatrix} \varepsilon_A & \\ & \varepsilon_B\end{bmatrix} \mathbf \Phi_\star\right\|_{\mathcal H_\infty} $$

Robust synthesis with SLS

  1. Learn Model:
    • estimate \(\hat A,\hat B\) via least-squares, guarantee \(\max\{\varepsilon_A, \varepsilon_B\}\lesssim \sqrt{\frac{m+n}{N}}\)
  2. Design Policy:
    • robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)
      • \(J(\hat{\mathbf \Phi}) -  J(\mathbf \Phi_\star) \lesssim J(\mathbf \Phi_\star)\left\|\mathbf \Phi_\star\right\|_{\mathcal H_\infty} \sqrt{\frac{m+n}{N}}\)

    • nominal or certainty equivalent approach uses \(\hat A, \hat B\)
      • for small enough \(\varepsilon\), can show that \(J(\hat{\mathbf \Phi}) -  J(\mathbf \Phi_\star) \lesssim \varepsilon^2\)
      • thus faster rate, \(J(\hat{\mathbf \Phi}) -  J(\mathbf \Phi_\star) \lesssim \frac{m+n}{N}\)

Model-based LQR

Using an explore then commit algorithm, we have $$R(T) = R_{\text{explore}}(N) + R_{\text{commit}}(N, T)$$

  • Robust: \(R(T) \leq C_1 N + C_2 \frac{T}{\sqrt{N}}\)
    • \(N\propto T^{2/3}\implies R(T)\lesssim O(T^{2/3})\)
    • stability guaranteed
  • Certainty equivalent: \(R(T) \leq C_1 N + C_2 \frac{T}{N}\)
    • \(N\propto \sqrt{T}\implies R(T)\lesssim O(\sqrt{T})\)
    • only holds for \(T\) large enough that estimation errors are small

Online Model-based LQR

Recap

  • Steady-state controllers and infinite horizons
    • \(\pi^\star(s) = Ks\)
  • Taxonomy of RL
    • policy, value, model
  • Model-based LQR & Robustness

References: System Level Synthesis by Anderson, Doyle, Low, Matni and Ch 2-3 in Machine Learning in Feedback Systems by Sarah Dean

16 - Model-Based RL - ML in Feedback Sys

By Sarah Dean

Private

16 - Model-Based RL - ML in Feedback Sys