Prof Sarah Dean
policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Goal: select actions \(a_t\) with high reward
Linear Contextual Bandits
ETC
With \(N=T^{2/3}\), \(R(T) \lesssim T^{2/3}\)
UCB
$$R(T) \lesssim \sqrt{T}$$
policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Goal: select actions \(a_t\) with high reward
policy
\(\pi_t:\mathcal S\to\mathcal A\)
observation
\(s_t\)
accumulate
\(\{(s_t, a_t, c_t)\}\)
Goal: select actions \(a_t\) to bring environment to low-cost states
action
\(a_{t}\)
\(s\)
$$ s_{t+1} = F(s_t, a_t, w_t) $$
\(s\)
\(a_t\)
\(s_t\)
\(w_t\)
$$ s_{t} = \Phi_{F}(s_0, w_{0:t-1}, a_{0:t-1}) $$
For a deterministic system, a state \(s_\star\) is reachable from initial state \(s_0 \) if there exists a sequence of actions \(a_{0:t-1} \in\mathcal A^t\) such that \(s_{t}=s_\star\) for some \(t\).
$$ s_{t} = \Phi_{F}(s_0, a_{0:t-1}) $$
A deterministic system is controllable if any target state \(s_\star\in\mathcal S\) is reachable form any initial state \(s_0 \in\mathcal S\).
charging
working
out of
battery
Example
The state \(s=[\theta,\omega]\), input \(a\in\mathbb R\).
Which states are reachable if:
\(s_{t+1} = As_t+Ba_t\)
A linear system is controllable if and only if
$$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix}}_{\mathcal C}\Big) = n$$
\(=A^{t+1} s_0 + A^t B a_0 + \dots + ABa_{t-1} + Ba_t \)
State space \(\mathcal S = \mathbb R^n\) and actions \(\mathcal A=\mathbb R^m\) and dynamics defined by \(A\in\mathbb R^{n\times n}\), \(B\in\mathbb R^{n\times m}\)
1) \(\mathrm{rank}(\mathcal C) = n \implies\) controllable
$$s_n = A^n s_0 + \begin{bmatrix}B&AB &\dots & A^{n-1}B\end{bmatrix} \begin{bmatrix}a_{n-1} \\\vdots \\ a_0\end{bmatrix} $$
Proof:
2) \(\mathrm{rank}(\mathcal C) < n \implies\) not controllable
$$s_t = A^t s_0 + \mathcal C_t \begin{bmatrix}a_{t-1} \\\vdots \\ a_0\end{bmatrix} $$
Optimal Control Problem
$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$
General perspective: goal encoded by a cost \(c:\mathcal S\times \mathcal A\to\mathcal R\)
Optimal Control Problem
$$ \min_{a_{0:T}} \sum_{k=0}^{T} c(s_k, a_k) \quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$
Stochastic Optimal Control Problem
$$ \min_{\pi_{0:T}}~~ \mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_0~~\text{given},~~ s_{k+1} = F(s_k, a_k,w_k) $$
$$a_k=\pi_k(s_k) $$
Stochastic Optimal Control Problem
$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$
\(\underbrace{\qquad}_{J^\pi(s_0)}\)
Suppose \(\pi_\star = (\pi^\star_0,\dots \pi^\star_{T})\) minimizes the optimal control problem
Then the cost-to-go $$ J^\pi_t(s) = \mathbb E_w\Big[\sum_{k=t}^{T} c(s_k, \pi_k(s_k)) \Big]\quad \text{s.t}\quad s_t=s,~~s_{k+1} = F(s_k, \pi_k(s_k),w_k) $$
is minimized for all \(s\) by the truncated policy \((\pi_t^\star,\dots\pi_T^\star)\)
(i.e. \(J^\pi(s)\geq J^{\pi^\star}(s)\) for all \(\pi, s\))
Algorithm
Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas
Exercise: Prove that the resulting policy is optimal.
LQR Problem
$$ \min_{\pi_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top Qs_k + a_k^\top Ra_k \Big]\quad \text{s.t}\quad s_{k+1} = A s_k+ Ba_k+w_k $$
$$a_k=\pi_k(s_k) $$
$$ s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t + w_t $$
The state is position & velocity \(s=[\theta,\omega]\), input is a force \(a\in\mathbb R\).
Goal: stay near origin and be energy efficient
\(Q =\begin{bmatrix} 10 & \\ & 0.1 \end{bmatrix},\quad R = 5 \)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
DP: \(J_k^\star (s) = \min_{a\in\mathcal A} c(s, a)+\mathbb E_w[J_{k+1}^\star (F(s,a,w))]\)
Claim: For \(t=0,\dots T\), the optimal cost-to-go function is quadratic and the optimal policy is linear
Reference: Ch 1&4 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas