Prof Sarah Dean
policy
\(\pi_t:\mathcal S\to\mathcal A\)
observation
\(s_t\)
accumulate
\(\{(s_t, a_t, c_t)\}\)
Goal: select actions \(a_t\) to bring environment to low-cost states
while avoiding unsafe states
action
\(a_{t}\)
\(s\)
A state \(s\) is safe if \(\mathcal s\in\mathcal S_\mathrm{safe}\).
A trajectory of states \((s_0,\dots,s_t)\) is safe if \(\mathcal s_k\in\mathcal S_\mathrm{safe}\) for all \(0\leq k\leq t\).
A system \(s_{t+1}=F(s_t)\) is safe if some \(\mathcal S_\mathrm{inv}\subseteq \mathcal S_{\mathrm{safe}}\) is invariant and \(s_0\in \mathcal S_{\mathrm{safe}}\).
\(a_t = {\color{Goldenrod} K_t }s_{t}\)
\( \underset{\mathbf a }{\min}\) \(\displaystyle\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\)
\(\text{s.t.}~~s_{t+1} = As_t + Ba_t \)
\(s_t \in\mathcal S_\mathrm{safe},~~ a_t \in\mathcal A_\mathrm{safe}\)
\(\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w \)
\(\mathbf w = \begin{bmatrix}s_0\\ 0\\ \vdots \\0 \end{bmatrix}\)
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}\bar Q^{1/2} &\\& \bar R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a\end{bmatrix} \mathbf w\right\|_{2}^2\)
\(\text{s.t.}~~ \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a\end{bmatrix}= I \)
\(\mathbf \Phi_s\mathbf w \in\mathcal S_\mathrm{safe}^T,~~\mathbf \Phi_a\mathbf w\in\mathcal A_\mathrm{safe}^T\)
Claim: Suppose that for all \(t\), the policy satisfies
$$\pi(s_t) = \arg\min_a \|a - \pi^\star_\mathrm{unc}(s_t)\|_2^2 \quad\text{s.t.}\quad C(F(s_t, a)) \leq \gamma C(s_t) $$
size of \(s\)
size of \(a\)
safety constraint
\(C(s)=0\)
Instead of optimizing for open loop control...
\( \underset{a_0,\dots,a_T }{\min}\) \(\displaystyle\sum_{t=0}^T c(s_t, a_t)\)
\(\text{s.t.}~~s_0~\text{given},~~s_{t+1} = F(s_t, a_t)\)
\(s_t \in\mathcal S_\mathrm{safe},~~ a_t \in\mathcal A_\mathrm{safe}\)
...re-optimize to close the loop
model predicts the trajectory during planning
Also called Model Predictive Control
Figure from slides by Borelli, Jones, Morari
Plan:
time
Do
Plan
Do
Plan
Do
Plan
\(a_t\)
\(s\)
\(s_t\)
\(a_t\)
\( \underset{a_0,\dots,a_H }{\min}\) \(\displaystyle\sum_{k=0}^H c(s_k, a_k)\)
\(\text{s.t.}~~s_0~\text{given},~~s_{k+1} = F(s_k, a_k)\)
\(s_k \in\mathcal S_\mathrm{safe},~~ a_k \in\mathcal A_\mathrm{safe}\)
We can:
\(\pi(s_t) = u_0^\star(s_t)\)
$$\min_{u_0,\dots, u_{H}} \quad\sum_{k=0}^{H} c(x_{k}, u_{k})$$
\(\text{s.t.}\quad x_0 = s_t,\quad x_{k+1} = F(x_{k}, u_{k})\)
\(x_k\in\mathcal S_\mathrm{safe},\quad u_k\in\mathcal A_\mathrm{safe}\quad~~~\)
Notation: distinguish real states and actions \(s_t\) and \(a_t\) from the planned optimization variables \(x_k\) and \(u_k\).
\([u_0^\star,\dots, u_{H}^\star](s_t) = \arg\)
$$\min_{u_0,\dots, u_{H}} \quad\sum_{k=0}^{H} c(x_{k}, u_{k})$$
\(\text{s.t.}\quad x_0 = s_t,\quad x_{k+1} = F(x_{k}, u_{k})\)
\(x_k\in\mathcal S_\mathrm{safe},\quad u_k\in\mathcal A_\mathrm{safe}\quad~~~\)
Notation: distinguish real states and actions \(s_t\) and \(a_t\) from the planned optimization variables \(x_k\) and \(u_k\).
\(s\)
\(s_t\)
\(a_t = u_0^\star(s_t)\)
Infinite Horizon LQR Problem
$$ \min ~~\lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T} s_t^\top Qs_t+ a_t^\top Ra_t\quad\\ \text{s.t}\quad s_{t+1} = A s_t+ Ba_t$$
We know that \(a^\star_t = \pi^\star(s_t)\) where \(\pi^\star(s) = K s\) and
Finite LQR Problem
$$ \min ~~\sum_{k=0}^{H} x_k^\top Qx_k + u_k^\top Ru_k \quad \\ \text{s.t}\quad x_0=s,\quad x_{k+1} = A x_k+ Bu_k $$
MPC Policy \(a_t = u^\star_0(s_t)\) where
\(u^\star_0(s) = K_0s\) and
The state is position & velocity \(s=[\theta,\omega]\) with \( s_{t+1} = \begin{bmatrix} 1 & 0.1\\ & 1 \end{bmatrix}s_t + \begin{bmatrix} 0\\ 1 \end{bmatrix}a_t\)
Goal: stay near origin and be energy efficient
Figures from slides by Goulart, Borelli
Figures from slides by Goulart, Borelli
Figures from slides by Goulart, Borelli
The state is position & velocity \(s=[\theta,\omega]\) with \( s_{t+1} = \begin{bmatrix} 1 & 0.1\\ & 1 \end{bmatrix}s_t + \begin{bmatrix} 0\\ 1 \end{bmatrix}a_t\)
Goal: stay near origin and be energy efficient
References: Predictive Control by Borrelli, Bemporad, Morari