Optimal Control

ML in Feedback Sys #14

Prof Sarah Dean

Reminders

  • Office hours this week moved to Friday 9-10am
  • Feedback on final project proposal within a week
  • Upcoming paper presentations starting 10/24
  • Project midterm update due 11/11

policy

\(\pi_t:\mathcal X\to\mathcal A\)

observation

action

\(x_t\)

accumulate

\(\{(x_t, a_t, r_t)\}\)

\(a_{t}\)

Recap: Action in a streaming world

Goal: select actions \(a_t\) with high reward

Linear Contextual Bandits

  • for \(t=1,2,...\)
    • receive context \(x_t\)
    • take action \(a_t\in\mathcal A\)
    • receive reward \(\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)\)

ETC

  • For \(t=1,\dots,N\)
    • play \(\varphi_t\) at random
  • Estimate \(\hat\theta\) with least squares
  • For \(t=N+1,\dots,T\)
    • play \(\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi\)

With \(N=T^{2/3}\), \(R(T) \lesssim T^{2/3}\)

UCB

  • Initialize \(V_0=\lambda I\), \(b_0=0\)
  • For \(t=1,\dots,T\)
    • play \(\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi\)
    • update \(V_t = V_{t-1}+\varphi_t\varphi_t^\top\)
      and \(b_t = b_{t-1}+r_t\varphi_t\)
  • \(\hat\theta_t = V_t^{-1}b_t\)
  • \(\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}\)

$$R(T) \lesssim  \sqrt{T}$$

policy

\(\pi_t:\mathcal X\to\mathcal A\)

observation

action

\(x_t\)

accumulate

\(\{(x_t, a_t, r_t)\}\)

\(a_{t}\)

Action in a streaming world

Goal: select actions \(a_t\) with high reward

policy

\(\pi_t:\mathcal S\to\mathcal A\)

observation

\(s_t\)

accumulate

\(\{(s_t, a_t, c_t)\}\)

Action in a dynamic world

Goal: select actions \(a_t\) to bring environment to low-cost states

action

\(a_{t}\)

\(F\)

\(s\)

Controlled systems

$$ s_{t+1} = F(s_t, a_t, w_t) $$

\(F\)

\(s\)

\(a_t\)

\(s_t\)

\(w_t\)

$$ s_{t} = \Phi_{F}(s_0, w_{0:t-1}, a_{0:t-1}) $$

For a deterministic system, a state \(s_\star\) is reachable from initial state \(s_0 \) if there exists a sequence of actions \(a_{0:t-1} \in\mathcal A^t\) such that \(s_{t}=s_\star\) for some \(t\).

Reachability & Controllability

$$ s_{t} = \Phi_{F}(s_0, a_{0:t-1}) $$

A deterministic system is controllable if any target state \(s_\star\in\mathcal S\) is reachable form any initial state \(s_0 \in\mathcal S\).

charging

working

out of
battery

Linear examples

Example

The state \(s=[\theta,\omega]\), input \(a\in\mathbb R\).

Which states are reachable if:

  1. \(\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t\)
  2. \(\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t\)

Linear Controllability

\(s_{t+1} = As_t+Ba_t\)

A linear system is controllable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}}_{\mathcal C}\Big) = n$$

\(=A^{t+1} s_0 + A^t B a_0 + \dots + ABa_{t-1} + Ba_t \)

State space \(\mathcal S = \mathbb R^n\) and actions \(\mathcal A=\mathbb R^m\) and dynamics defined by  \(A\in\mathbb R^{n\times n}\), \(B\in\mathbb R^{n\times m}\)

1) \(\mathrm{rank}(\mathcal C) = n \implies\) controllable

$$s_n = A^n s_0 + \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix} \begin{bmatrix}a_{n-1} \\\vdots \\ a_0\end{bmatrix}  $$

  • If \(\mathcal C\) is rank \(n\), we can choose $$ a_{n-1:0} = \mathcal C^\top ( \mathcal C \mathcal C^\top )^{-1}(s_\star-A^ns_0)$$
  • The range of \(\mathcal C\) is \(\mathbb R^n\)

Proof:

2) \(\mathrm{rank}(\mathcal C) < n \implies\) not controllable

  • for \(t\geq n\), \(\mathcal C_t\) has more columns than \(\mathcal C\)
  • Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree \(n\)).
  • Therefore, \(A^k\) for \(k\geq n\) is a linear combo of \(I, A,\dots ,A^{n-1}\)
  • Thus, \(\mathrm{rank}(\mathcal C_t) \leq \mathrm{rank}(\mathcal C)<n\) so the range of \(\mathcal C_t\) is not all of \(\mathbb R^n\)

$$s_t = A^t s_0 + \mathcal C_t \begin{bmatrix}a_{t-1} \\\vdots \\ a_0\end{bmatrix}  $$

Optimal Control

Optimal Control Problem

$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

General perspective: goal encoded by a cost \(c:\mathcal S\times \mathcal A\to\mathcal R\)

Optimal Control

Optimal Control Problem

$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

  • If \(w_{0:T-1}\) are known, solve optimization problem
  • Executing \(a^\star_{0:T-1}\) directly is called open loop control

Optimal Control

Stochastic Optimal Control Problem

$$ \min_{\pi_{0:T}}~~ \mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$

  • If \(w_{0:T-1}\) are unknown and stochastic, need to adapt
  • Closed loop control searches over state-feedback policies \(a_t = \pi_t(s_t)\)

$$a_k=\pi_k(s_k) $$

Optimal Control

Stochastic Optimal Control Problem

$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$

  • If \(w_{0:T-1}\) are unknown and stochastic, need to adapt
  • Closed loop control searches over state-feedback policies \(a_t = \pi_t(s_t)\)

\(\underbrace{\qquad}_{J^\pi(s_0)}\)

Principle of Optimality

Suppose \(\pi_\star = (\pi^\star_0,\dots \pi^\star_{T})\) minimizes the optimal control problem

Then the cost-to-go $$ J^\pi_t(s) = \mathbb E_w\Big[\sum_{k=t}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_t=s,~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$

is minimized for all \(s\) by the truncated policy \((\pi_t^\star,\dots\pi_T^\star)\)

(i.e. \(J^\pi(s)\geq J^{\pi^\star}(s)\) for all \(\pi, s\))

Dynamic Programming

Algorithm

  • Initialize \(J_{T+1}^\star (s) = 0\)
  • For \(k=T,T-1,\dots,0\):
    • Compute \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
    • Record minimizing argument as \(\pi_k^\star(s)\)

Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

Exercise: Prove that the resulting policy is optimal.

Linear Quadratic Regulator

  • Linear dynamics: \(F(s, a, w) = A s+Ba+w\)
  • Quadratic costs: \( c(s, a) = s^\top Qs + a^\top Ra \) where \(Q,R\succ 0\)
  • Stochastic and independent noise \(\mathbb E[w_k] = 0\) and \(\mathbb E[w_kw_k^\top] = \sigma^2 I\)

LQR Problem

$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$

$$a_k=\pi_k(s_k) $$

LQR Example

$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$

The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).

Goal: stay near origin and be energy efficient

  • \(c(s,a) = 10\theta^2+0.1\omega^2+5a^2\)
  • \(Q =\begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix},\quad R = 5 \)

LQR via DP

  • \(k=T\): \(\qquad\min_{a} s^\top Q s+a^\top Ra+0\)
    • \(J_T^\star(s) = s^\top Q s\) and \(\pi_T^\star(s) =0\)
  • \(k=T-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+\mathbb E_w[(As+Ba+w)^\top Q (As+Ba+w)]\)

DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)

  • \(\mathbb E[(As+Ba+w)^\top Q (As+Ba+w)]\)
    • \(=(As+Ba)^\top Q (As+Ba)+\mathbb E[ 2w^\top Q(As+Ba) + w^\top Q w]\)
    • \(=(As+Ba)^\top Q (As+Ba)+\mathrm{tr}( Q )\)

LQR via DP

  • \(k=T\): \(\qquad\min_{a} s^\top Q s+a^\top Ra+0\)
    • \(J_T^\star(s) = s^\top Q s\) and \(\pi_T^\star(s) =0\)
  • \(k=T-1\): \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )\)

DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)

  • \(\min_a a^\top M a + m^\top a + c\)
    • \(2Ma_\star + m = 0 \implies a_\star = -\frac{1}{2}M^{-1} m\)
  • \(\pi_{T-1}^\star(s)=-\frac{1}{2}(R+B^\top QB)^{-1}(2B^\top QAs)\)

LQR via DP

  • \(k=T\): \(\qquad\min_{a} s^\top Q s+a^\top Ra+0\)
    • \(J_T^\star(s) = s^\top Q s\) and \(\pi_T^\star(s) =0\)
  • \(k=T-1\): \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )\)
    • \(\pi_{T-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • \(J_T^\star(s) = s^\top (Q+A^\top QA + A^\top QB(R+B^\top QB)^{-1}B^\top QA) s +\mathrm{tr}( Q )\)

DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)

Linear Quadratic Regulator

Claim:  For \(t=0,\dots T\), the optimal cost-to-go function is quadratic and the optimal policy is linear

  • \(J^\star_t (s) = s^\top P_t s + p_t\) and \(\pi_t^\star(s) = K_t s\)
  • Exercise: Using DP and induction, prove the claim for:
    • \(P_t = Q+A^\top P_{t+1}A + A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
    • \(p_t = p_{t+1} + \sigma^2\mathrm{tr}(P_{t+1})\)
    • \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top QP_{t+1}A\)
  • Exercise: Derive expressions for optimal controllers when
    1. Time varying cost: \(c_t(s,a) = s^\top Q_t s+a^\top R_t a\)
    2. General noise covariance: \(\mathbb E[w_tw_t^\top] = \Sigma_t\)
    3. Trajectory tracking: \(c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2\) for given \(\bar s_t\)

Recap

  • Controllability
    • \(\mathcal C = \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}\)
  • Optimal control
  • Dynamic programming
  • Linear quadratic regulator
    • \(\pi_t^\star(s) = K_t s\)

Reference: Ch 1&4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas