Sarah Dean PRO
asst prof in CS at Cornell
Prof Sarah Dean
policy
\(\pi_t:\mathcal S\to\mathcal A\)
observation
\(s_t\)
accumulate
\(\{(s_t, a_t, c_t)\}\)
Goal: select actions \(a_t\) to bring environment to low-cost states
action
\(a_{t}\)
\(s\)
$$ s_{t+1} = F(s_t, a_t, w_t) $$
\(s\)
\(a_t\)
\(s_t\)
\(w_t\)
$$ s_{t+1} = As_t+B a_t+ w_t$$
\(s\)
\(a_t\)
\(s_t\)
\(w_t\)
Linear System: State space \(\mathcal S = \mathbb R^n\) and actions \(\mathcal A=\mathbb R^m\) with dynamics defined by \(A\in\mathbb R^{n\times n}\) and \(B\in\mathbb R^{n\times m}\)
\(B\)
system is controllable if any \(s_\star\in\mathcal S\) is reachable from any \(s_0 \in\mathcal S\).
Stochastic Optimal Control Problem
$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$
\(\underbrace{\qquad\qquad}_{J^\pi(s_0)}\)
Dynamic Programming Algorithm
Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas
LQR Problem
$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$
$$a_k=\pi_k(s_k) $$
$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
Goal: stay near origin and be energy efficient
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
Claim: For \(t=0,\dots T\), the optimal cost-to-go function is quadratic and the optimal policy is linear
$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
Goal: stay near origin and be energy efficient
\(\pi_\star(s) \approx -\begin{bmatrix} 7.0\times 10^{-2}& 3.7\times 10^{-2}\end{bmatrix} s\)
$$ \min_{a_{0:T}} ~~\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k $$
$$ \min_{\mathbf a} ~~\mathbf s^\top \bar Q\mathbf s + \mathbf a^\top \bar R\mathbf a\quad \text{s.t}\quad \mathbf s_{1:T} = \bar A \mathbf s_{0:T-1}+ \bar B\mathbf a $$
$$ \min_{K_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top (Q + K_k^\top R K_k)s_k \Big]\quad \text{s.t}\quad s_{k+1} = (A +BK_k)s_k+w_k $$
Exercise: Prove \(s_{t+1}=A_t s_t + w_t\implies \) $$s_t = (A_{t-1}A_{t-2}\cdots A_0) s_0 + \sum_{k=0}^{t-2} (A_{t-1}\cdots A_{k+1}) w_{k}+w_{t-1}$$
Example: For a 1D system with \(A=B=1\), \(\mathbb E[s_2] = (1+K_1)(1+K_0)s_0\)
\(B\)
\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
\( = \prod_{\ell=0}^{t}(A+BK_{\ell}) s_0 + \sum_{k=0}^{t} \prod_{\ell=k+1}^{t}(A+BK_\ell) w_{k}\)
\( = K_t \prod_{\ell=0}^{t-1}(A+BK_{\ell}) s_0 + K_t \sum_{k=0}^{t-1} \prod_{\ell=k+1}^{t-1}(A+BK_\ell) w_{k}\)
\(s_{t+1} = As_{t}+Ba_{t}+w_{t}\)
\(a_t = K_t s_t\)
\(s_{t} = \Phi_s^{t,0} s_0 + \sum_{k=1}^t \Phi_s^{t, k}w_{t-k}\)
\(a_{t} = \Phi_a^{t,0} s_0 + \sum_{k=1}^t \Phi_a^{t, k}w_{t-k}\)
\(B\)
\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
\( = \prod_{\ell=0}^{t}(A+BK_{\ell}) s_0 + \sum_{k=0}^{t} \prod_{\ell=k+1}^{t}(A+BK_\ell) w_{k}\)
\( = K_t \prod_{\ell=0}^{t-1}(A+BK_{\ell}) s_0 + K_t \sum_{k=0}^{t-1} \prod_{\ell=k+1}^{t-1}(A+BK_\ell) w_{k}\)
\(s_{t+1} = As_{t}+Ba_{t}+w_{t}\)
\(a_t = K_t s_t\)
\(\begin{bmatrix} s_{0}\\\vdots \\s_T\end{bmatrix} = \begin{bmatrix} \Phi_s^{0,0}\\ \Phi_s^{1, 1}& \Phi_s^{1,0}\\ \vdots & \ddots & \ddots \\ \Phi_s^{T,T} & \Phi_s^{T,T-1} & \dots & \Phi_s^{T,0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)
\(\begin{bmatrix} a_{0}\\\vdots \\a_T\end{bmatrix} = \begin{bmatrix} \Phi_a^{0,0}\\ \Phi_a^{1, 1}& \Phi_a^{1,0}\\ \vdots & \ddots & \ddots \\ \Phi_a^{T,T} & \Phi_a^{T,T-1} & \dots & \Phi_a^{T,0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)
\(\mathbf s = \mathbf \Phi_s \mathbf w\)
\(\mathbf a = \mathbf \Phi_a \mathbf w\)
Reparametrized objective:
\( \mathbf s^\top \bar Q\mathbf s + \mathbf a^\top \bar R\mathbf a = \mathbf w^\top (\mathbf \Phi_s^\top \bar Q \mathbf \Phi_s + \mathbf \Phi_a^\top \bar R \mathbf \Phi_a )\mathbf w\)
\(\mathbf s_{1:T} = \bar A \mathbf s_{0:T-1}+ \bar B\mathbf a+\mathbf w_{0:T-1}\)
Reparametrized objective: \( \mathbf s^\top \bar Q\mathbf s + \mathbf a^\top \bar R\mathbf a = \mathbf w^\top (\mathbf \Phi_s^\top \bar Q \mathbf \Phi_s + \mathbf \Phi_a^\top \bar R \mathbf \Phi_a )\mathbf w\)
Reparametrized constraints:
\(\iff\quad\mathbf \Phi_s\mathbf w = \mathcal Z \bar A \mathbf \Phi_s\mathbf w + \mathcal Z \bar B \mathbf \Phi_a\mathbf w + \mathbf w \)
$$\begin{bmatrix}s_0 \\ \vdots \\ s_T\end{bmatrix}= \underbrace{\begin{bmatrix}0\\ A \\ &\ddots \\&& A\end{bmatrix} }_{\mathcal Z \bar A}\begin{bmatrix}s_0 \\ \vdots \\ s_T\end{bmatrix} + \underbrace{\begin{bmatrix}0\\ B \\ &\ddots \\&& B\end{bmatrix} }_{\mathcal Z \bar B} \begin{bmatrix}a_0 \\ \vdots \\ a_{T-1}\end{bmatrix} +\begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}$$
\(s_{t+1} = As_{t}+Ba_{t}+w_{t}\)
Reparametrized objective: \( \mathbf s^\top \bar Q\mathbf s + \mathbf a^\top \bar R\mathbf a = \mathbf w^\top (\mathbf \Phi_s^\top \bar Q \mathbf \Phi_s + \mathbf \Phi_a^\top \bar R \mathbf \Phi_a )\mathbf w\)
Reparametrized constraints:
\(\sum_{k=0}^{t+1} \Phi_s^{t+1, k}w_{t+1-k} = A\sum_{k=0}^t \Phi_s^{t, k}w_{t-k} + B\sum_{k=0}^t \Phi_a^{t, k}w_{t-k} + w_{t}\)
(let \(w_{-1}=s_0\))
Claim: The above equality is implied by
$$\Phi_s^{t,0}=I,\quad \Phi_s^{t, k+1} = A \Phi_s^{t, k}+B\Phi_a^{t, k} \quad \forall ~t,~k\leq t $$
References: System Level Synthesis by Anderson, Doyle, Low, Matni
Theorem: For the a linear system in feedback with a linear controller over the horizon \(t=0,\dots, T\):
Example: For a 1D system with \(A=B=1\),
$$ \min_{\mathbf \Phi} ~~\mathbb E_w\Big[ \mathbf w^\top(\mathbf \Phi_s^\top \bar Q \mathbf \Phi_s+ \mathbf \Phi_a^\top \bar R \mathbf \Phi_a )\mathbf w \Big]\quad \text{s.t}\quad (I - \mathcal Z \bar A )\mathbf \Phi_s- \mathcal Z \bar B \mathbf \Phi_a = I $$
In closed loop, state and input are linear functions of disturbance
\(x_t = \sum_{k=0}^t A^{k}(Bu_{t-k} + w_{t-k})\)
\(u_t = \sum_{k=0}^t K_kx_{t-k}\)
\(\begin{bmatrix} x_t\\u_t \end{bmatrix} = \sum_{k=0}^t \begin{bmatrix} \Phi_x(t)\\ \Phi_u(t) \end{bmatrix} w_{t-k}\)
Instead of reasoning about a controller \(\mathbf{K}\), we reason about the interconnection \(\mathbf\Phi\) directly.
instead of a loop,
system looks like a line
\((A,B)\)
\(\mathbf{K}\)
\(\bf x\)
\(\bf u\)
\(\bf w\)
\(\bf x\)
\(\bf u\)
\(\bf w\)
\(u_t = \sum_{k=0}^t{\color{Goldenrod} K_k} x_{t-k}\)
\( \underset{\mathbf u }{\min}\) \(\displaystyle\lim_{T\to\infty}\mathbb{E}\left[ \frac{1}{T}\sum_{t=0}^T x_t^\top Q x_t + u_t^\top R u_t\right]\)
\(\text{s.t.}~~x_{t+1} = Ax_t + Bu_t + w_t\)
\(\begin{bmatrix} \mathbf x\\ \mathbf u\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_x\\ \mathbf \Phi_u\end{bmatrix}\mathbf w \)
\(\text{s.t.}~~ {\color{teal}\mathbf\Phi }\in\mathrm{Affine}(A, B)\)
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix} Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_x \\ \color{teal} \mathbf{\Phi}_u \end{bmatrix} \right\|_{\mathcal{H}_2}^2\)
To implement resulting controller:
$$ \min_{\mathbf \Phi} ~~\left\|\begin{bmatrix}\bar Q^{1/2} \\&\bar R^{1/2}\end{bmatrix} \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a \end{bmatrix}\right\|_F^2\quad \text{s.t}\quad \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a \end{bmatrix}= I $$
\(u_t = {\color{Goldenrod} K_t }s_{t}\)
\( \underset{\mathbf u }{\min}\) \(\displaystyle\mathbb{E}\left[\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)
\(\text{s.t.}~~s_{t+1} = As_t + Ba_t + w_t\)
\(\begin{bmatrix} \mathbf x\\ \mathbf u\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_x\\ \mathbf \Phi_u\end{bmatrix}\mathbf w \)
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}\bar Q^{1/2} &\\& \bar R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_x \\ \color{teal} \mathbf{\Phi}_u \end{bmatrix} \right\|_{F}^2\)
\(\text{s.t.}~~ \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_x \\ \color{teal} \mathbf{\Phi}_u \end{bmatrix}= I \)
\(B\)
\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
instead of a loop,
\(B\)
\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
system looks like a line
Infinite Horizon LQR Problem
$$ \min_{\pi_{0:T}} ~~\lim_{T\to\infty}\mathbb E_w\Big[\frac{1}{T}\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$
Claim: The optimal cost-to-go function is quadratic and the optimal policy is linear $$J^\star (s) = s^\top P s,\qquad \pi^\star(s) = K s$$
\(B\)
\(A\)
\(s\)
\(w_t\)
\(a_t\)
\(s_t\)
\(\mathbf{K}\)
\( = (A+BK)^{t+1} s_0 + \sum_{k=0}^{t} (A+BK)^{t-k} w_{k}\)
\( = K(A+BK)^{t+1} s_0 + \sum_{k=0}^{t} K(A+BK)^{t-k} w_{k}\)
\(s_{t+1} = As_{t}+Ba_{t}+w_{t}\)
\(a_t = K s_t\)
\(s_{t} = \Phi_s^{0} s_0 + \sum_{k=1}^t \Phi_s^{k}w_{t-k}\)
\(a_{t} = \Phi_a^{0} s_0 + \sum_{k=1}^t \Phi_a^{k}w_{t-k}\)
\(\begin{bmatrix} s_{0}\\\vdots \\s_T\end{bmatrix} = \begin{bmatrix} \Phi_s^{0}\\ \Phi_s^{ 1}& \Phi_s^{0}\\ \vdots & \ddots & \ddots \\ \Phi_s^{T} & \Phi_s^{T-1} & \dots & \Phi_s^{0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)
\(\begin{bmatrix} a_{0}\\\vdots \\a_T\end{bmatrix} = \begin{bmatrix} \Phi_a^{0}\\ \Phi_a^{1}& \Phi_a^{0}\\ \vdots & \ddots & \ddots \\ \Phi_a^{T} & \Phi_a^{T-1} & \dots & \Phi_a^{0} \end{bmatrix} \begin{bmatrix} s_0\\w_0\\ \vdots \\w_{T-1}\end{bmatrix}\)
\(u_t = {\color{Goldenrod} K}s_{t}\)
\( \underset{\mathbf u }{\min}\) \(\displaystyle\lim_{T\to\infty}\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\right]\)
\(\text{s.t.}~~s_{t+1} = As_t + Ba_t + w_t\)
\(\begin{bmatrix} \mathbf x\\ \mathbf u\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_x\\ \mathbf \Phi_u\end{bmatrix}\mathbf w \)
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_x \\ \color{teal} \mathbf{\Phi}_u \end{bmatrix} \right\|_{\mathcal H_2}^2\)
\(\text{s.t.}~~ \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_x \\ \color{teal} \mathbf{\Phi}_u \end{bmatrix}= I \)
Exercise: Using the frequency domain notation, derive the expression for the SLS cost and constraints, where we define the norm:
$$ \|\mathbf \Phi\|_{\mathcal H_2}^2 = \sum_{t=0}^\infty \|\Phi^t\|_F^2 $$
References: System Level Synthesis by Anderson, Doyle, Low, Matni and Ch 2 in Machine Learning in Feedback Systems by Sarah Dean
By Sarah Dean