Contextual Bandits
ML in Feedback Sys #13
Prof Sarah Dean
Reminders
- Final project proposal due October 7
- Upcoming paper presentations starting 10/24
- Meet with Atul in advance

model
\(f_t:\mathcal X\to\mathcal Y\)
observation
prediction
Prediction in a streaming world
\(x_t\)
accumulate
\(\{(x_t, y_t)\}\)
\(\hat y_{t}\)

policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Action in a streaming world
Linear Contextual Bandits
- for \(t=1,2,...\)
- receive context \(x_t\)
- take action \(a_t\in\mathcal A\)
- receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)$$
Linear Contextual Bandits
Linear Contextual Bandits
- for \(t=1,2,...\)
- receive action set \(\mathcal A_t\)
- take action \(\varphi_t\in\mathcal A_t\)
- receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi_t$$
Related Goals:
- choose best action: \(\qquad\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi\)
- predict reward of action: \(\quad\hat r_t \approx \theta_\star^\top \varphi_t\)
- estimate reward function: \(\quad\hat \theta \approx \theta_\star\)
The prediction error \((\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi \sim \mathcal N(0,\sigma^2\underbrace{\varphi^\top V_t^{-1}\varphi}_{ \|\varphi\|^2_{V_t^{-1}} } )\)
Prediction errors
How much should I trust my predicted reward \(\hat\theta^\top \varphi\) if observed rewards are corrupted by Gaussian noise?
$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\sum_{k=1}^t\varphi_k\varphi_k^\top $$
With probability \(1-\delta\), we have \(|(\hat\theta_t-\theta_\star)^\top \varphi| \leq \sigma\sqrt{2\log(2/\delta) }\|\varphi\|_{V_t^{-1}}\)
Correction: last lecture, expressions like \(\|Mx\|_2\) should have instead been $$\sqrt{x^\top Mx} = \|M^{1/2}x\|_2 =: \|M\|_V$$
For symmetric matrices with non-negative eigenvalues (i.e. "positive semi-definite matrices"), we know that \(M=V\Lambda V^\top\)
So the "square root" of a PSD matrix is $$ M^{1/2} = \Lambda^{1/2}V^\top $$
For a diagonal matrix \(D^{1/2}=\mathrm{diag}(\sqrt{\lambda_1},\dots,\sqrt{\lambda_n})\)
where notation \(M^{1/2}\) means a matrix such that $$(M^{1/2})^\top M^{1/2} = M$$
Confidence Ellipsoids
How much should I trust the estimate \(\hat \theta\)?
Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
For the right choice of \(\beta_t\), it is possible to guarantee \(\theta_\star\in\mathcal C_t\) with high probability
Exercise: For a fixed action \(\varphi\), show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$
$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$
example: \(K=2\) and we've pulled the arms 2 and 1 times respectively
\(V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}\)
\((\hat \mu_1,\hat\mu_2)\)
Confidence set
\(2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t\)
Pulling arm 1: \(\hat \mu_1 \pm \sqrt{\beta_t/2}\)
Pulling arm 2: \(\hat \mu_2 \pm \sqrt{\beta_t}\)
$$ \left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \} $$
example: \(d=2\) linear bandits
\(V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}\)
\(\hat \theta\)
Trying action \(a=[0,1]\):
- \(\hat \theta^\top a\pm \sqrt{\beta_t a^\top V_t^{-1}a}\)
- = \(\hat \theta_2 \pm \sqrt{3\beta_t/8}\)
$$ \left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \} $$
\(V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}\)
Exercise: For fixed \(\varphi\), show that best/worst case elements of \(\mathcal C_t\) are given by $$\theta = \pm\frac{\sqrt{\beta_t} V_t^{-1}\varphi}{\|a\|_{V_t^{-1}}}$$
Regularization
Now we have
Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top $$
$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$
To handle cases where \(\{\varphi_k\}\) are not full rank, we consider regularized LS
From Estimation to Action
How does estimation error affect suboptimality?
- Define the optimal action \(\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi\)
- Suppose we choose \(\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top \varphi\)
The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
- \(\leq \max_{\theta\in\mathcal C}\theta^\top \varphi_t^\star - \min_{\theta\in\mathcal C}\theta^\top \hat\varphi_t\qquad\) since we don't know \(\theta_\star\)
- \(\leq \hat\theta^\top\varphi_t^\star +\sqrt{\beta}\|\varphi_t^\star \|_{V^{-1}}-(\hat\theta^\top\hat\varphi_t -\sqrt{\beta}\|\hat\varphi_t\|_{V^{-1}})\quad\) from previous slide
- \(\leq \underbrace{\hat\theta^\top\varphi_t^\star - \hat\theta^\top\hat\varphi_t }_{\leq 0} +\sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\quad\) by choice of \(\hat\theta\)
From Estimation to Action
The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
- \(\leq \sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\)
- \(\leq 2\sqrt{\beta \max_{\varphi\in\mathcal A_t} \varphi^\top V^{-1}\varphi } \)
This perspective motivates techniques in experiment design
- e.g. select \(\{\varphi_k\}\) to ensure that \(V\) has a large minimum eigenvalue (so \(V^{-1}\) will have small eigenvalues)
When \(\{\varphi_k\}_{k=1}^N\) are chosen at random from "nice" distribution
\(\|V^{-1}\| = \frac{1}{\lambda_{\min}(V)} \lesssim \sqrt{\frac{d}{N}}\) with high probability.
From Estimation to Action
Then with high probability, the suboptimality
- \(\theta_\star^\top \varphi^\star - \theta_\star^\top \hat\varphi \lesssim 2B\sqrt{\beta_t \frac{d}{N} }\)
Informal Theorem: Let the norm of all \(\varphi\in\mathcal A_t\) be bounded by \(B\) for all \(t\). Suppose that \(\{\varphi_k\}_{k=1}^N\) are chosen at random from "nice" distribution.
Cumulative sub-optimality: regret
For a fixed interaction horizon \(T\), how to trade-off exploration and exploitation?
Design algorithms with low regret $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t) $$
$$R(T) = \sum_{t=1}^T \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t $$
Explore-then-commit
ETC
- For \(t=1,\dots,N\)
- play \(\varphi_t\) at random
- Estimate \(\hat\theta\) with least squares
- For \(t=N+1,\dots,T\)
- play \(\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi\)
The regret has two components
$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$
Explore-then-commit
The regret has two components
$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$
Suppose that \(B\) bounds the norm of \(\varphi\) and \(\|\theta_\star\|\leq 1\).
- \(\theta_\star^\top \varphi_\star - \theta_\star^\top \varphi \leq 2B\)
Then we have \(R_1\leq 2BN\)
Explore-then-commit
The regret has two components
$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$
Using sub-optimality result, with high probability \(R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }\)
Suppose that \(\max_t\beta_t\leq \beta\).
Explore-then-commit
The regret is bounded with high probability by
$$R(T) \lesssim 2BN + 2BT\sqrt{\beta \frac{d}{N} }$$
Choosing \(N=T^{2/3}\) leads to sublinear regret
- \(R(T) \lesssim T^{2/3}\)
Upper Confidence Bound
Adaptive perspective: optimism in the face of uncertainty
Instead of exploring randomly, focus exploration on promising actions
UCB
- Initialize \(V_0=\lambda I\), \(b_0=0\)
- For \(t=1,\dots,T\)
- play \(\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi\)
- update \(V_t = V_{t-1}+\varphi_t\varphi_t^\top\)
and \(b_t = b_{t-1}+r_t\varphi_t\)
- \(\hat\theta_t = V_t^{-1}b_t\)
- \(\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}\)
Sub-optimality of optimism
- Define the optimal action \(\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi\)
- Suppose we choose optimistic \(\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_t} \theta^\top \varphi\)
- let \(\tilde\theta_t = \arg\max_{\theta\in\mathcal C_t} \theta^\top \hat\varphi_t\)
The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
- \(\leq \tilde\theta_t^\top \hat\varphi_t - \theta_\star^\top \hat\varphi_t\quad \) by choice of \(\hat\varphi_t\)
- \(= (\tilde \theta_t - \theta_\star)^\top \hat\varphi_t\)
- \( = (V_{t-1}^{1/2}(\tilde \theta_t - \theta_\star))^\top(V_{t-1}^{-1/2}\hat\varphi_t )\)
- \(\leq \|\tilde \theta_t - \theta_\star\|_{V_{t-1}} \|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad\) by Cauchy-Schwarz
- \(\leq 2\sqrt{\beta_{t-1}}\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad \) using definition of confidence interval
Upper Confidence Bound
Proof Sketch: \(R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)
- \(\leq 2\sqrt{\beta}\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad \) from previous slide
- \(\leq 2\sqrt{T\beta\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}^2}\quad\) by Cauchy-Schwarz
- \(\lesssim \sqrt{T}\) following Lemma 19.4 in Bandit Algorithms
The regret is bounded with high probability by
$$R(T) \lesssim \sqrt{T}$$
After fall break: action in a dynamical world (optimal control)
Recap
- Linear contextual bandits
- Confidence bounds
- Sub-optimality & regret
- ETC (\(T^{2/3}\)) and UCB (\(\sqrt{T}\))
Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari
13 - Bandits - ML in Feedback Sys
By Sarah Dean