Optimal Control

ML in Feedback Sys #14

Prof Sarah Dean

Reminders

Office hours this week moved to Friday 9-10am
Feedback on final project proposal within a week
Upcoming paper presentations starting 10/24
Project midterm update due 11/11

policy

$\pi_t:\mathcal X\to\mathcal A$

observation

action

$x_t$

accumulate

$\{(x_t, a_t, r_t)\}$

$a_{t}$

Recap: Action in a streaming world

Goal: select actions $a_t$ with high reward

Linear Contextual Bandits

for $t=1,2,...$
- receive context $x_t$
- take action $a_t\in\mathcal A$
- receive reward $\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)$

ETC

For $t=1,\dots,N$
- play $\varphi_t$ at random
Estimate $\hat\theta$ with least squares
For $t=N+1,\dots,T$
- play $\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi$

With $N=T^{2/3}$, $R(T) \lesssim T^{2/3}$

UCB

Initialize $V_0=\lambda I$, $b_0=0$
For $t=1,\dots,T$
- play $\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi$
- update $V_t = V_{t-1}+\varphi_t\varphi_t^\top$
  and $b_t = b_{t-1}+r_t\varphi_t$

$\hat\theta_t = V_t^{-1}b_t$
$\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}$

$$R(T) \lesssim \sqrt{T}$$

policy

$\pi_t:\mathcal X\to\mathcal A$

observation

action

$x_t$

accumulate

$\{(x_t, a_t, r_t)\}$

$a_{t}$

Action in a streaming world

Goal: select actions $a_t$ with high reward

policy

$\pi_t:\mathcal S\to\mathcal A$

observation

$s_t$

accumulate

$\{(s_t, a_t, c_t)\}$

Action in a dynamic world

Goal: select actions $a_t$ to bring environment to low-cost states

action

$a_{t}$

$F$

$s$

Controlled systems

$$ s_{t+1} = F(s_t, a_t, w_t) $$

$F$

$s$

$a_t$

$s_t$

$w_t$

$$ s_{t} = \Phi_{F}(s_0, w_{0:t-1}, a_{0:t-1}) $$

For a deterministic system, a state $s_\star$ is reachable from initial state $s_0 $ if there exists a sequence of actions $a_{0:t-1} \in\mathcal A^t$ such that $s_{t}=s_\star$ for some $t$.

Reachability & Controllability

$$ s_{t} = \Phi_{F}(s_0, a_{0:t-1}) $$

A deterministic system is controllable if any target state $s_\star\in\mathcal S$ is reachable form any initial state $s_0 \in\mathcal S$.

charging

working

out of
battery

Linear examples

Example

The state $s=[\theta,\omega]$, input $a\in\mathbb R$.

Which states are reachable if:

$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t$
$\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t$

Linear Controllability

$s_{t+1} = As_t+Ba_t$

A linear system is controllable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}}_{\mathcal C}\Big) = n$$

$=A^{t+1} s_0 + A^t B a_0 + \dots + ABa_{t-1} + Ba_t $

State space $\mathcal S = \mathbb R^n$ and actions $\mathcal A=\mathbb R^m$ and dynamics defined by $A\in\mathbb R^{n\times n}$, $B\in\mathbb R^{n\times m}$

1) $\mathrm{rank}(\mathcal C) = n \implies$ controllable

$$s_n = A^n s_0 + \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix} \begin{bmatrix}a_{n-1} \\\vdots \\ a_0\end{bmatrix} $$

If $\mathcal C$ is rank $n$, we can choose $$ a_{n-1:0} = \mathcal C^\top ( \mathcal C \mathcal C^\top )^{-1}(s_\star-A^ns_0)$$
The range of $\mathcal C$ is $\mathbb R^n$

Proof:

2) $\mathrm{rank}(\mathcal C) < n \implies$ not controllable

for $t\geq n$, $\mathcal C_t$ has more columns than $\mathcal C$
Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree $n$).
Therefore, $A^k$ for $k\geq n$ is a linear combo of $I, A,\dots ,A^{n-1}$
Thus, $\mathrm{rank}(\mathcal C_t) \leq \mathrm{rank}(\mathcal C)<n$ so the range of $\mathcal C_t$ is not all of $\mathbb R^n$

$$s_t = A^t s_0 + \mathcal C_t \begin{bmatrix}a_{t-1} \\\vdots \\ a_0\end{bmatrix} $$

Optimal Control

Optimal Control Problem

$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

General perspective: goal encoded by a cost $c:\mathcal S\times \mathcal A\to\mathcal R$

Optimal Control

Optimal Control Problem

$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

If $w_{0:T-1}$ are known, solve optimization problem
Executing $a^\star_{0:T-1}$ directly is called open loop control

Optimal Control

Stochastic Optimal Control Problem

$$ \min_{\pi_{0:T}}~~ \mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

If $w_{0:T-1}$ are unknown and stochastic, need to adapt
Closed loop control searches over state-feedback policies $a_t = \pi_t(s_t)$

$$a_k=\pi_k(s_k) $$

Optimal Control

Stochastic Optimal Control Problem

$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$

If $w_{0:T-1}$ are unknown and stochastic, need to adapt
Closed loop control searches over state-feedback policies $a_t = \pi_t(s_t)$

$\underbrace{\qquad}_{J^\pi(s_0)}$

Principle of Optimality

Suppose $\pi_\star = (\pi^\star_0,\dots \pi^\star_{T})$ minimizes the optimal control problem

Then the cost-to-go $$ J^\pi_t(s) = \mathbb E_w\Big[\sum_{k=t}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_t=s,~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$

is minimized for all $s$ by the truncated policy $(\pi_t^\star,\dots\pi_T^\star)$

(i.e. $J^\pi(s)\geq J^{\pi^\star}(s)$ for all $\pi, s$)

Dynamic Programming

Algorithm

Initialize $J_{T+1}^\star (s) = 0$
For $k=T,T-1,\dots,0$:
- Compute $J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$
- Record minimizing argument as $\pi_k^\star(s)$

Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

Exercise: Prove that the resulting policy is optimal.

Linear Quadratic Regulator

Linear dynamics: $F(s, a, w) = A s+Ba+w$
Quadratic costs: $ c(s, a) = s^\top Qs + a^\top Ra $ where $Q,R\succ 0$
Stochastic and independent noise $\mathbb E[w_k] = 0$ and $\mathbb E[w_kw_k^\top] = \sigma^2 I$

LQR Problem

$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$

$$a_k=\pi_k(s_k) $$

LQR Example

$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$

The state is position & velocity $s=[\theta,\omega]$, input is a force $a\in\mathbb R$.

Goal: stay near origin and be energy efficient

$c(s,a) = 10\theta^2+0.1\omega^2+5a^2$
$Q =\begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix},\quad R = 5 $

LQR via DP

$k=T$: $\qquad\min_{a} s^\top Q s+a^\top Ra+0$
- $J_T^\star(s) = s^\top Q s$ and $\pi_T^\star(s) =0$
$k=T-1$: $\quad \min_{a} s^\top Q s+a^\top Ra+\mathbb E_w[(As+Ba+w)^\top Q (As+Ba+w)]$

DP: $J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$

$\mathbb E[(As+Ba+w)^\top Q (As+Ba+w)]$
- $=(As+Ba)^\top Q (As+Ba)+\mathbb E[ 2w^\top Q(As+Ba) + w^\top Q w]$
- $=(As+Ba)^\top Q (As+Ba)+\mathrm{tr}( Q )$

LQR via DP

$k=T$: $\qquad\min_{a} s^\top Q s+a^\top Ra+0$
- $J_T^\star(s) = s^\top Q s$ and $\pi_T^\star(s) =0$
$k=T-1$: $\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )$

DP: $J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$

$\min_a a^\top M a + m^\top a + c$
- $2Ma_\star + m = 0 \implies a_\star = -\frac{1}{2}M^{-1} m$
$\pi_{T-1}^\star(s)=-\frac{1}{2}(R+B^\top QB)^{-1}(2B^\top QAs)$

LQR via DP

$k=T$: $\qquad\min_{a} s^\top Q s+a^\top Ra+0$
- $J_T^\star(s) = s^\top Q s$ and $\pi_T^\star(s) =0$
$k=T-1$: $\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )$
- $\pi_{T-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs$
- $J_T^\star(s) = s^\top (Q+A^\top QA + A^\top QB(R+B^\top QB)^{-1}B^\top QA) s +\mathrm{tr}( Q )$

DP: $J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$

Linear Quadratic Regulator

Claim: For $t=0,\dots T$, the optimal cost-to-go function is quadratic and the optimal policy is linear

$J^\star_t (s) = s^\top P_t s + p_t$ and $\pi_t^\star(s) = K_t s$

Exercise: Using DP and induction, prove the claim for:
- $P_t = Q+A^\top P_{t+1}A + A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$
- $p_t = p_{t+1} + \sigma^2\mathrm{tr}(P_{t+1})$
- $K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top QP_{t+1}A$
Exercise: Derive expressions for optimal controllers when
1. Time varying cost: $c_t(s,a) = s^\top Q_t s+a^\top R_t a$
2. General noise covariance: $\mathbb E[w_tw_t^\top] = \Sigma_t$
3. Trajectory tracking: $c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2$ for given $\bar s_t$

Recap

Controllability
- $\mathcal C = \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}$
Optimal control
Dynamic programming
Linear quadratic regulator
- $\pi_t^\star(s) = K_t s$

Reference: Ch 1&4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

14 - Optimal Control - ML in Feedback Sys

By Sarah Dean

14 - Optimal Control - ML in Feedback Sys

3 years ago

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Optimal Control

ML in Feedback Sys #14

Reminders

Recap: Action in a streaming world

Action in a streaming world

Action in a dynamic world

\(F\)

Controlled systems

\(F\)

Reachability & Controllability

Linear examples

Linear Controllability

Optimal Control

Optimal Control

Optimal Control

Optimal Control

Principle of Optimality

Dynamic Programming

Linear Quadratic Regulator

LQR Example

LQR via DP

LQR via DP

LQR via DP

Linear Quadratic Regulator

Recap

14 - Optimal Control - ML in Feedback Sys

14 - Optimal Control - ML in Feedback Sys

Sarah Dean PRO

Optimal Control

ML in Feedback Sys #14

Reminders

Recap: Action in a streaming world

Action in a streaming world

Action in a dynamic world

\(F\)

Controlled systems

\(F\)

Reachability & Controllability

Linear examples

Linear Controllability

Optimal Control

Optimal Control

Optimal Control

Optimal Control

Principle of Optimality

Dynamic Programming

Linear Quadratic Regulator

LQR Example

LQR via DP

LQR via DP

LQR via DP

Linear Quadratic Regulator

Recap

14 - Optimal Control - ML in Feedback Sys

More from Sarah Dean