Prof Sarah Dean

Reminders

• Office hours this week moved to Friday 9-10am
• Feedback on final project proposal within a week
• Upcoming paper presentations starting 10/24
• Project midterm update due 11/11

policy

$$\pi_t:\mathcal X\to\mathcal A$$

observation

action

$$x_t$$

accumulate

$$\{(x_t, a_t, r_t)\}$$

$$a_{t}$$

Recap: Action in a streaming world

Goal: select actions $$a_t$$ with high reward

Linear Contextual Bandits

• for $$t=1,2,...$$
• receive context $$x_t$$
• take action $$a_t\in\mathcal A$$
• receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)$$

ETC

• For $$t=1,\dots,N$$
• play $$\varphi_t$$ at random
• Estimate $$\hat\theta$$ with least squares
• For $$t=N+1,\dots,T$$
• play $$\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi$$

With $$N=T^{2/3}$$, $$R(T) \lesssim T^{2/3}$$

UCB

• Initialize $$V_0=\lambda I$$, $$b_0=0$$
• For $$t=1,\dots,T$$
• play $$\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi$$
• update $$V_t = V_{t-1}+\varphi_t\varphi_t^\top$$
and $$b_t = b_{t-1}+r_t\varphi_t$$
• $$\hat\theta_t = V_t^{-1}b_t$$
• $$\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}$$

$$R(T) \lesssim \sqrt{T}$$

policy

$$\pi_t:\mathcal X\to\mathcal A$$

observation

action

$$x_t$$

accumulate

$$\{(x_t, a_t, r_t)\}$$

$$a_{t}$$

Action in a streaming world

Goal: select actions $$a_t$$ with high reward

policy

$$\pi_t:\mathcal S\to\mathcal A$$

observation

$$s_t$$

accumulate

$$\{(s_t, a_t, c_t)\}$$

Action in a dynamic world

Goal: select actions $$a_t$$ to bring environment to low-cost states

action

$$a_{t}$$

$$F$$

$$s$$

Controlled systems

$$s_{t+1} = F(s_t, a_t, w_t)$$

$$F$$

$$s$$

$$a_t$$

$$s_t$$

$$w_t$$

$$s_{t} = \Phi_{F}(s_0, w_{0:t-1}, a_{0:t-1})$$

For a deterministic system, a state $$s_\star$$ is reachable from initial state $$s_0$$ if there exists a sequence of actions $$a_{0:t-1} \in\mathcal A^t$$ such that $$s_{t}=s_\star$$ for some $$t$$.

Reachability & Controllability

$$s_{t} = \Phi_{F}(s_0, a_{0:t-1})$$

A deterministic system is controllable if any target state $$s_\star\in\mathcal S$$ is reachable form any initial state $$s_0 \in\mathcal S$$.

charging

working

out of
battery

Linear examples

Example

The state $$s=[\theta,\omega]$$, input $$a\in\mathbb R$$.

Which states are reachable if:

1. $$\theta_{t+1} = 0.9\theta_t + 0.1 \omega_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t$$
2. $$\theta_{t+1} = 0.9\theta_t,\quad \omega_{t+1} = 0.9 \omega_t + a_t$$

Linear Controllability

$$s_{t+1} = As_t+Ba_t$$

A linear system is controllable if and only if

$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}}_{\mathcal C}\Big) = n$$

$$=A^{t+1} s_0 + A^t B a_0 + \dots + ABa_{t-1} + Ba_t$$

State space $$\mathcal S = \mathbb R^n$$ and actions $$\mathcal A=\mathbb R^m$$ and dynamics defined by  $$A\in\mathbb R^{n\times n}$$, $$B\in\mathbb R^{n\times m}$$

1) $$\mathrm{rank}(\mathcal C) = n \implies$$ controllable

$$s_n = A^n s_0 + \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix} \begin{bmatrix}a_{n-1} \\\vdots \\ a_0\end{bmatrix}$$

• If $$\mathcal C$$ is rank $$n$$, we can choose $$a_{n-1:0} = \mathcal C^\top ( \mathcal C \mathcal C^\top )^{-1}(s_\star-A^ns_0)$$
• The range of $$\mathcal C$$ is $$\mathbb R^n$$

Proof:

2) $$\mathrm{rank}(\mathcal C) < n \implies$$ not controllable

• for $$t\geq n$$, $$\mathcal C_t$$ has more columns than $$\mathcal C$$
• Theorem (Cayley-Hamilton): a matrix satisfies its own characteristic polynomial (of degree $$n$$).
• Therefore, $$A^k$$ for $$k\geq n$$ is a linear combo of $$I, A,\dots ,A^{n-1}$$
• Thus, $$\mathrm{rank}(\mathcal C_t) \leq \mathrm{rank}(\mathcal C)<n$$ so the range of $$\mathcal C_t$$ is not all of $$\mathbb R^n$$

$$s_t = A^t s_0 + \mathcal C_t \begin{bmatrix}a_{t-1} \\\vdots \\ a_0\end{bmatrix}$$

Optimal Control

Optimal Control Problem

$$\min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k)$$

General perspective: goal encoded by a cost $$c:\mathcal S\times \mathcal A\to\mathcal R$$

Optimal Control

Optimal Control Problem

$$\min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k)$$

• If $$w_{0:T-1}$$ are known, solve optimization problem
• Executing $$a^\star_{0:T-1}$$ directly is called open loop control

Optimal Control

Stochastic Optimal Control Problem

$$\min_{\pi_{0:T}}~~ \mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k)$$

• If $$w_{0:T-1}$$ are unknown and stochastic, need to adapt
• Closed loop control searches over state-feedback policies $$a_t = \pi_t(s_t)$$

$$a_k=\pi_k(s_k)$$

Optimal Control

Stochastic Optimal Control Problem

$$\min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi_k(s_k),w_k)$$

• If $$w_{0:T-1}$$ are unknown and stochastic, need to adapt
• Closed loop control searches over state-feedback policies $$a_t = \pi_t(s_t)$$

$$\underbrace{\qquad}_{J^\pi(s_0)}$$

Principle of Optimality

Suppose $$\pi_\star = (\pi^\star_0,\dots \pi^\star_{T})$$ minimizes the optimal control problem

Then the cost-to-go $$J^\pi_t(s) = \mathbb E_w\Big[\sum_{k=t}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_t=s,~~s_{k+1} = F(s_k, \pi_k(s_k),w_k)$$

is minimized for all $$s$$ by the truncated policy $$(\pi_t^\star,\dots\pi_T^\star)$$

(i.e. $$J^\pi(s)\geq J^{\pi^\star}(s)$$ for all $$\pi, s$$)

Dynamic Programming

Algorithm

• Initialize $$J_{T+1}^\star (s) = 0$$
• For $$k=T,T-1,\dots,0$$:
• Compute $$J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$$
• Record minimizing argument as $$\pi_k^\star(s)$$

Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas

Exercise: Prove that the resulting policy is optimal.

• Linear dynamics: $$F(s, a, w) = A s+Ba+w$$
• Quadratic costs: $$c(s, a) = s^\top Qs + a^\top Ra$$ where $$Q,R\succ 0$$
• Stochastic and independent noise $$\mathbb E[w_k] = 0$$ and $$\mathbb E[w_kw_k^\top] = \sigma^2 I$$

LQR Problem

$$\min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k$$

$$a_k=\pi_k(s_k)$$

LQR Example

$$s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t$$

The state is position & velocity $$s=[\theta,\omega]$$, input is a force $$a\in\mathbb R$$.

Goal: stay near origin and be energy efficient

• $$c(s,a) = 10\theta^2+0.1\omega^2+5a^2$$
• $$Q =\begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix},\quad R = 5$$

LQR via DP

• $$k=T$$: $$\qquad\min_{a} s^\top Q s+a^\top Ra+0$$
• $$J_T^\star(s) = s^\top Q s$$ and $$\pi_T^\star(s) =0$$
• $$k=T-1$$: $$\quad \min_{a} s^\top Q s+a^\top Ra+\mathbb E_w[(As+Ba+w)^\top Q (As+Ba+w)]$$

DP: $$J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$$

• $$\mathbb E[(As+Ba+w)^\top Q (As+Ba+w)]$$
• $$=(As+Ba)^\top Q (As+Ba)+\mathbb E[ 2w^\top Q(As+Ba) + w^\top Q w]$$
• $$=(As+Ba)^\top Q (As+Ba)+\mathrm{tr}( Q )$$

LQR via DP

• $$k=T$$: $$\qquad\min_{a} s^\top Q s+a^\top Ra+0$$
• $$J_T^\star(s) = s^\top Q s$$ and $$\pi_T^\star(s) =0$$
• $$k=T-1$$: $$\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )$$

DP: $$J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$$

• $$\min_a a^\top M a + m^\top a + c$$
• $$2Ma_\star + m = 0 \implies a_\star = -\frac{1}{2}M^{-1} m$$
• $$\pi_{T-1}^\star(s)=-\frac{1}{2}(R+B^\top QB)^{-1}(2B^\top QAs)$$

LQR via DP

• $$k=T$$: $$\qquad\min_{a} s^\top Q s+a^\top Ra+0$$
• $$J_T^\star(s) = s^\top Q s$$ and $$\pi_T^\star(s) =0$$
• $$k=T-1$$: $$\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba+\mathrm{tr}( Q )$$
• $$\pi_{T-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs$$
• $$J_T^\star(s) = s^\top (Q+A^\top QA + A^\top QB(R+B^\top QB)^{-1}B^\top QA) s +\mathrm{tr}( Q )$$

DP: $$J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]$$

Claim:  For $$t=0,\dots T$$, the optimal cost-to-go function is quadratic and the optimal policy is linear

• $$J^\star_t (s) = s^\top P_t s + p_t$$ and $$\pi_t^\star(s) = K_t s$$
• Exercise: Using DP and induction, prove the claim for:
• $$P_t = Q+A^\top P_{t+1}A + A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$$
• $$p_t = p_{t+1} + \sigma^2\mathrm{tr}(P_{t+1})$$
• $$K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top QP_{t+1}A$$
• Exercise: Derive expressions for optimal controllers when
1. Time varying cost: $$c_t(s,a) = s^\top Q_t s+a^\top R_t a$$
2. General noise covariance: $$\mathbb E[w_tw_t^\top] = \Sigma_t$$
3. Trajectory tracking: $$c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2$$ for given $$\bar s_t$$

Recap

• Controllability
• $$\mathcal C = \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}$$
• Optimal control
• Dynamic programming
• $$\pi_t^\star(s) = K_t s$$