Optimal Control
ML in Feedback Sys #14
Prof Sarah Dean
Reminders
- Office hours this week moved to Friday 9-10am
- Feedback on final project proposal within a week
- Upcoming paper presentations starting 10/24
- Project midterm update due 11/11

policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Recap: Action in a streaming world
Goal: select actions \(a_t\) with high reward
Linear Contextual Bandits
- for \(t=1,2,...\)
- receive context \(x_t\)
- take action \(a_t\in\mathcal A\)
- receive reward \(\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)\)
ETC
- For \(t=1,\dots,N\)
- play \(\varphi_t\) at random
- Estimate \(\hat\theta\) with least squares
- For \(t=N+1,\dots,T\)
- play \(\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi\)
With \(N=T^{2/3}\), \(R(T) \lesssim T^{2/3}\)
UCB
- Initialize \(V_0=\lambda I\), \(b_0=0\)
- For \(t=1,\dots,T\)
- play \(\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi\)
- update \(V_t = V_{t-1}+\varphi_t\varphi_t^\top\)
and \(b_t = b_{t-1}+r_t\varphi_t\)
- \(\hat\theta_t = V_t^{-1}b_t\)
- \(\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}\)
$$R(T) \lesssim \sqrt{T}$$

policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Action in a streaming world
Goal: select actions \(a_t\) with high reward

policy
\(\pi_t:\mathcal S\to\mathcal A\)
observation
\(s_t\)
accumulate
\(\{(s_t, a_t, c_t)\}\)
Action in a dynamic world
Goal: select actions \(a_t\) to bring environment to low-cost states
action
\(a_{t}\)
\(F\)
\(s\)
Controlled systems
$$ s_{t+1} = F(s_t, a_t, w_t) $$

\(F\)
\(s\)
\(a_t\)
\(s_t\)
\(w_t\)
$$ s_{t} = \Phi_{F}(s_0, w_{0:t-1}, a_{0:t-1}) $$
For a deterministic system, a state \(s_\star\) is reachable from initial state \(s_0 \) if there exists a sequence of actions \(a_{0:t-1} \in\mathcal A^t\) such that \(s_{t}=s_\star\) for some \(t\).
Reachability & Controllability
$$ s_{t} = \Phi_{F}(s_0, a_{0:t-1}) $$
A deterministic system is controllable if any target state \(s_\star\in\mathcal S\) is reachable form any initial state \(s_0 \in\mathcal S\).
charging
working
out of
battery


Linear examples
Example
The state \(s=[\theta,\omega]\), input \(a\in\mathbb R\).
Which states are reachable if:
- \(\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t\)
- \(\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t\)
Linear Controllability
\(s_{t+1} = As_t+Ba_t\)
A linear system is controllable if and only if
$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}}_{\mathcal C}\Big) = n$$
\(=A^{t+1} s_0 + A^t B a_0 + \dots + ABa_{t-1} + Ba_t \)
State space \(\mathcal S = \mathbb R^n\) and actions \(\mathcal A=\mathbb R^m\) and dynamics defined by \(A\in\mathbb R^{n\times n}\), \(B\in\mathbb R^{n\times m}\)
1) \(\mathrm{rank}(\mathcal C) = n \implies\) controllable
$$s_n = A^n s_0 + \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix} \begin{bmatrix}a_{n-1} \\\vdots \\ a_0\end{bmatrix} $$
- If \(\mathcal C\) is rank \(n\), we can choose $$ a_{n-1:0} = \mathcal C^\top ( \mathcal C \mathcal C^\top )^{-1}(s_\star-A^ns_0)$$
- The range of \(\mathcal C\) is \(\mathbb R^n\)
Proof:
2) \(\mathrm{rank}(\mathcal C) < n \implies\) not controllable
- for \(t\geq n\), \(\mathcal C_t\) has more columns than \(\mathcal C\)
- Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree \(n\)).
- Therefore, \(A^k\) for \(k\geq n\) is a linear combo of \(I, A,\dots ,A^{n-1}\)
- Thus, \(\mathrm{rank}(\mathcal C_t) \leq \mathrm{rank}(\mathcal C)<n\) so the range of \(\mathcal C_t\) is not all of \(\mathbb R^n\)
$$s_t = A^t s_0 + \mathcal C_t \begin{bmatrix}a_{t-1} \\\vdots \\ a_0\end{bmatrix} $$
Optimal Control
Optimal Control Problem
$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$
General perspective: goal encoded by a cost \(c:\mathcal S\times \mathcal A\to\mathcal R\)
Optimal Control
Optimal Control Problem
$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$
- If \(w_{0:T-1}\) are known, solve optimization problem
- Executing \(a^\star_{0:T-1}\) directly is called open loop control


Optimal Control
Stochastic Optimal Control Problem
$$ \min_{\pi_{0:T}}~~ \mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$
- If \(w_{0:T-1}\) are unknown and stochastic, need to adapt
- Closed loop control searches over state-feedback policies \(a_t = \pi_t(s_t)\)
$$a_k=\pi_k(s_k) $$


Optimal Control
Stochastic Optimal Control Problem
$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$
- If \(w_{0:T-1}\) are unknown and stochastic, need to adapt
- Closed loop control searches over state-feedback policies \(a_t = \pi_t(s_t)\)


\(\underbrace{\qquad}_{J^\pi(s_0)}\)
Principle of Optimality
Suppose \(\pi_\star = (\pi^\star_0,\dots \pi^\star_{T})\) minimizes the optimal control problem
Then the cost-to-go $$ J^\pi_t(s) = \mathbb E_w\Big[\sum_{k=t}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_t=s,~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$
is minimized for all \(s\) by the truncated policy \((\pi_t^\star,\dots\pi_T^\star)\)

(i.e. \(J^\pi(s)\geq J^{\pi^\star}(s)\) for all \(\pi, s\))
Dynamic Programming
Algorithm
- Initialize \(J_{T+1}^\star (s) = 0\)
- For \(k=T,T-1,\dots,0\):
- Compute \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
- Record minimizing argument as \(\pi_k^\star(s)\)
Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas
Exercise: Prove that the resulting policy is optimal.
Linear Quadratic Regulator
- Linear dynamics: \(F(s, a, w) = A s+Ba+w\)
- Quadratic costs: \( c(s, a) = s^\top Qs + a^\top Ra \) where \(Q,R\succ 0\)
- Stochastic and independent noise \(\mathbb E[w_k] = 0\) and \(\mathbb E[w_kw_k^\top] = \sigma^2 I\)
LQR Problem
$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$
$$a_k=\pi_k(s_k) $$
LQR Example
$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
Goal: stay near origin and be energy efficient
- \(c(s,a) = 10\theta^2+0.1\omega^2+5a^2\)
-
\(Q =\begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix},\quad R = 5 \)
LQR via DP
- \(k=T\): \(\qquad\min_{a} s^\top Q s+a^\top Ra+0\)
- \(J_T^\star(s) = s^\top Q s\) and \(\pi_T^\star(s) =0\)
- \(k=T-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+\mathbb E_w[(As+Ba+w)^\top Q (As+Ba+w)]\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
- \(\mathbb E[(As+Ba+w)^\top Q (As+Ba+w)]\)
- \(=(As+Ba)^\top Q (As+Ba)+\mathbb E[ 2w^\top Q(As+Ba) + w^\top Q w]\)
- \(=(As+Ba)^\top Q (As+Ba)+\mathrm{tr}( Q )\)
LQR via DP
- \(k=T\): \(\qquad\min_{a} s^\top Q s+a^\top Ra+0\)
- \(J_T^\star(s) = s^\top Q s\) and \(\pi_T^\star(s) =0\)
- \(k=T-1\): \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
- \(\min_a a^\top M a + m^\top a + c\)
- \(2Ma_\star + m = 0 \implies a_\star = -\frac{1}{2}M^{-1} m\)
- \(\pi_{T-1}^\star(s)=-\frac{1}{2}(R+B^\top QB)^{-1}(2B^\top QAs)\)
LQR via DP
- \(k=T\): \(\qquad\min_{a} s^\top Q s+a^\top Ra+0\)
- \(J_T^\star(s) = s^\top Q s\) and \(\pi_T^\star(s) =0\)
- \(k=T-1\): \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )\)
- \(\pi_{T-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
- \(J_T^\star(s) = s^\top (Q+A^\top QA + A^\top QB(R+B^\top QB)^{-1}B^\top QA) s +\mathrm{tr}( Q )\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
Linear Quadratic Regulator
Claim: For \(t=0,\dots T\), the optimal cost-to-go function is quadratic and the optimal policy is linear
- \(J^\star_t (s) = s^\top P_t s + p_t\) and \(\pi_t^\star(s) = K_t s\)
-
Exercise: Using DP and induction, prove the claim for:
- \(P_t = Q+A^\top P_{t+1}A + A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
- \(p_t = p_{t+1} + \sigma^2\mathrm{tr}(P_{t+1})\)
- \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top QP_{t+1}A\)
-
Exercise: Derive expressions for optimal controllers when
- Time varying cost: \(c_t(s,a) = s^\top Q_t s+a^\top R_t a\)
- General noise covariance: \(\mathbb E[w_tw_t^\top] = \Sigma_t\)
- Trajectory tracking: \(c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2\) for given \(\bar s_t\)
Recap
- Controllability
- \(\mathcal C = \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}\)
- Optimal control
- Dynamic programming
- Linear quadratic regulator
- \(\pi_t^\star(s) = K_t s\)
Reference: Ch 1&4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas
14 - Optimal Control - ML in Feedback Sys
By Sarah Dean