Prof Sarah Dean
model
\(f_t:\mathcal X\to\mathcal Y\)
observation
prediction
\(x_t\)
accumulate
\(\{(x_t, y_t)\}\)
\(\hat y_{t}\)
policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Linear Contextual Bandits
Linear Contextual Bandits
Related Goals:
The prediction error \((\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi \sim \mathcal N(0,\sigma^2\underbrace{\varphi^\top V_t^{-1}\varphi}_{ \|\varphi\|^2_{V_t^{-1}} } )\)
How much should I trust my predicted reward \(\hat\theta^\top \varphi\) if observed rewards are corrupted by Gaussian noise?
$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\sum_{k=1}^t\varphi_k\varphi_k^\top $$
With probability \(1-\delta\), we have \(|(\hat\theta_t-\theta_\star)^\top \varphi| \leq \sigma\sqrt{2\log(2/\delta) }\|\varphi\|_{V_t^{-1}}\)
Correction: last lecture, expressions like \(\|Mx\|_2\) should have instead been $$\sqrt{x^\top Mx} = \|M^{1/2}x\|_2 =: \|M\|_V$$
For symmetric matrices with non-negative eigenvalues (i.e. "positive semi-definite matrices"), we know that \(M=V\Lambda V^\top\)
So the "square root" of a PSD matrix is $$ M^{1/2} = \Lambda^{1/2}V^\top $$
For a diagonal matrix \(D^{1/2}=\mathrm{diag}(\sqrt{\lambda_1},\dots,\sqrt{\lambda_n})\)
where notation \(M^{1/2}\) means a matrix such that $$(M^{1/2})^\top M^{1/2} = M$$
How much should I trust the estimate \(\hat \theta\)?
Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
For the right choice of \(\beta_t\), it is possible to guarantee \(\theta_\star\in\mathcal C_t\) with high probability
Exercise: For a fixed action \(\varphi\), show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$
$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$
example: \(K=2\) and we've pulled the arms 2 and 1 times respectively
\(V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}\)
\((\hat \mu_1,\hat\mu_2)\)
Confidence set
\(2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t\)
Pulling arm 1: \(\hat \mu_1 \pm \sqrt{\beta_t/2}\)
Pulling arm 2: \(\hat \mu_2 \pm \sqrt{\beta_t}\)
$$ \left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \} $$
example: \(d=2\) linear bandits
\(V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}\)
\(\hat \theta\)
Trying action \(a=[0,1]\):
$$ \left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \} $$
\(V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}\)
Exercise: For fixed \(\varphi\), show that best/worst case elements of \(\mathcal C_t\) are given by $$\theta = \pm\frac{\sqrt{\beta_t} V_t^{-1}\varphi}{\|a\|_{V_t^{-1}}}$$
Now we have
Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top $$
$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$
To handle cases where \(\{\varphi_k\}\) are not full rank, we consider regularized LS
How does estimation error affect suboptimality?
The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
This perspective motivates techniques in experiment design
When \(\{\varphi_k\}_{k=1}^N\) are chosen at random from "nice" distribution
\(\|V^{-1}\| = \frac{1}{\lambda_{\min}(V)} \lesssim \sqrt{\frac{d}{N}}\) with high probability.
Then with high probability, the suboptimality
Informal Theorem: Let the norm of all \(\varphi\in\mathcal A_t\) be bounded by \(B\) for all \(t\). Suppose that \(\{\varphi_k\}_{k=1}^N\) are chosen at random from "nice" distribution.
For a fixed interaction horizon \(T\), how to trade-off exploration and exploitation?
Design algorithms with low regret $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t) $$
$$R(T) = \sum_{t=1}^T \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t $$
ETC
The regret has two components
$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$
The regret has two components
$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$
Suppose that \(B\) bounds the norm of \(\varphi\) and \(\|\theta_\star\|\leq 1\).
Then we have \(R_1\leq 2BN\)
The regret has two components
$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$
Using sub-optimality result, with high probability \(R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }\)
Suppose that \(\max_t\beta_t\leq \beta\).
The regret is bounded with high probability by
$$R(T) \lesssim 2BN + 2BT\sqrt{\beta \frac{d}{N} }$$
Choosing \(N=T^{2/3}\) leads to sublinear regret
Adaptive perspective: optimism in the face of uncertainty
Instead of exploring randomly, focus exploration on promising actions
UCB
The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
Proof Sketch: \(R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
The regret is bounded with high probability by
$$R(T) \lesssim \sqrt{T}$$
After fall break: action in a dynamical world (optimal control)
Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari