Contextual Bandits

ML in Feedback Sys #13

Prof Sarah Dean


  • Final project proposal due October 7
  • Upcoming paper presentations starting 10/24
    • Meet with Atul in advance


\(f_t:\mathcal X\to\mathcal Y\)



Prediction in a streaming world



\(\{(x_t, y_t)\}\)

\(\hat y_{t}\)


\(\pi_t:\mathcal X\to\mathcal A\)





\(\{(x_t, a_t, r_t)\}\)


Action in a streaming world

Linear Contextual Bandits

  • for \(t=1,2,...\)
    • receive context \(x_t\)
    • take action \(a_t\in\mathcal A\)
    • receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)$$

Linear Contextual Bandits

Linear Contextual Bandits

  • for \(t=1,2,...\)
    • receive action set \(\mathcal A_t\)
    • take action \(\varphi_t\in\mathcal A_t\)
    • receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi_t$$

Related Goals:

  • choose best action: \(\qquad\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi\)
  • predict reward of action: \(\quad\hat r_t \approx \theta_\star^\top \varphi_t\)
  • estimate reward function: \(\quad\hat \theta \approx \theta_\star\)

The prediction error \((\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi \sim \mathcal N(0,\sigma^2\underbrace{\varphi^\top V_t^{-1}\varphi}_{ \|\varphi\|^2_{V_t^{-1}} } )\)

Prediction errors

How much should I trust my predicted reward \(\hat\theta^\top \varphi\) if observed rewards are corrupted by Gaussian noise?

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\sum_{k=1}^t\varphi_k\varphi_k^\top $$

With probability \(1-\delta\), we have \(|(\hat\theta_t-\theta_\star)^\top \varphi| \leq  \sigma\sqrt{2\log(2/\delta) }\|\varphi\|_{V_t^{-1}}\)

Correction: last lecture, expressions like \(\|Mx\|_2\) should have instead been $$\sqrt{x^\top Mx} = \|M^{1/2}x\|_2 =: \|M\|_V$$

For symmetric matrices with non-negative eigenvalues (i.e. "positive semi-definite matrices"), we know that \(M=V\Lambda V^\top\)

So the "square root" of a PSD matrix is $$ M^{1/2} = \Lambda^{1/2}V^\top $$

For a diagonal matrix \(D^{1/2}=\mathrm{diag}(\sqrt{\lambda_1},\dots,\sqrt{\lambda_n})\)

where notation \(M^{1/2}\) means a matrix such that $$(M^{1/2})^\top M^{1/2} = M$$

Confidence Ellipsoids

How much should I trust the estimate \(\hat \theta\)?

Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

For the right choice of \(\beta_t\), it is possible to guarantee \(\theta_\star\in\mathcal C_t\) with high probability

Exercise: For a fixed action \(\varphi\), show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

example: \(K=2\) and we've pulled the arms 2 and 1 times respectively

\(V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}\)

\((\hat \mu_1,\hat\mu_2)\)

Confidence set

\(2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t\)

Pulling arm 1: \(\hat \mu_1 \pm \sqrt{\beta_t/2}\)

Pulling arm 2: \(\hat \mu_2 \pm \sqrt{\beta_t}\)

$$ \left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \} $$

example: \(d=2\) linear bandits

\(V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}\)

\(\hat \theta\)

Trying action \(a=[0,1]\):

  • \(\hat \theta^\top a\pm \sqrt{\beta_t a^\top V_t^{-1}a}\)
    • = \(\hat \theta_2 \pm \sqrt{3\beta_t/8}\)

$$ \left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \} $$

\(V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}\)

Exercise: For fixed \(\varphi\), show that best/worst case elements of \(\mathcal C_t\) are given  by $$\theta = \pm\frac{\sqrt{\beta_t} V_t^{-1}\varphi}{\|a\|_{V_t^{-1}}}$$


Now we have

Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top $$

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$

To handle cases where \(\{\varphi_k\}\) are not full rank, we consider regularized LS

From Estimation to Action

How does estimation error affect suboptimality?

  • Define the optimal action \(\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi\)
  • Suppose we choose \(\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top \varphi\)

The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq  \max_{\theta\in\mathcal C}\theta^\top \varphi_t^\star - \min_{\theta\in\mathcal C}\theta^\top \hat\varphi_t\qquad\) since we don't know \(\theta_\star\)
  • \(\leq  \hat\theta^\top\varphi_t^\star +\sqrt{\beta}\|\varphi_t^\star \|_{V^{-1}}-(\hat\theta^\top\hat\varphi_t -\sqrt{\beta}\|\hat\varphi_t\|_{V^{-1}})\quad\) from previous slide
  • \(\leq  \underbrace{\hat\theta^\top\varphi_t^\star - \hat\theta^\top\hat\varphi_t }_{\leq 0} +\sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\quad\) by choice of \(\hat\theta\)

From Estimation to Action

The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq \sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\)
  • \(\leq 2\sqrt{\beta \max_{\varphi\in\mathcal A_t} \varphi^\top V^{-1}\varphi } \)

This perspective motivates techniques in experiment design

  • e.g. select \(\{\varphi_k\}\) to ensure that \(V\) has a large minimum eigenvalue (so \(V^{-1}\) will have small eigenvalues)

When \(\{\varphi_k\}_{k=1}^N\) are chosen at random from "nice" distribution

\(\|V^{-1}\| = \frac{1}{\lambda_{\min}(V)} \lesssim \sqrt{\frac{d}{N}}\) with high probability.

From Estimation to Action

Then with high probability, the suboptimality

  • \(\theta_\star^\top \varphi^\star - \theta_\star^\top \hat\varphi \lesssim 2B\sqrt{\beta_t \frac{d}{N} }\)

Informal Theorem: Let the norm of all \(\varphi\in\mathcal A_t\) be bounded by \(B\) for all \(t\). Suppose that \(\{\varphi_k\}_{k=1}^N\) are chosen at random from "nice" distribution.

Cumulative sub-optimality: regret

For a fixed interaction horizon \(T\), how to trade-off exploration and exploitation?

Design algorithms with low regret $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t) $$

$$R(T) = \sum_{t=1}^T \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t $$



  • For \(t=1,\dots,N\)
    • play \(\varphi_t\) at random
  • Estimate \(\hat\theta\) with least squares
  • For \(t=N+1,\dots,T\)
    • play \(\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi\)

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$


The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Suppose that \(B\) bounds the norm of \(\varphi\) and \(\|\theta_\star\|\leq 1\).

  • \(\theta_\star^\top \varphi_\star - \theta_\star^\top \varphi \leq 2B\)

Then we have \(R_1\leq 2BN\)


The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Using sub-optimality result, with high probability \(R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }\)

Suppose that \(\max_t\beta_t\leq \beta\).


The regret is bounded with high probability by

$$R(T) \lesssim  2BN +  2BT\sqrt{\beta \frac{d}{N} }$$

Choosing \(N=T^{2/3}\) leads to sublinear regret

  • \(R(T) \lesssim T^{2/3}\)

Upper Confidence Bound

Adaptive perspective: optimism in the face of uncertainty

Instead of exploring randomly, focus exploration on promising actions


  • Initialize \(V_0=\lambda I\), \(b_0=0\)
  • For \(t=1,\dots,T\)
    • play \(\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi\)
    • update \(V_t = V_{t-1}+\varphi_t\varphi_t^\top\)
      and \(b_t = b_{t-1}+r_t\varphi_t\)
  • \(\hat\theta_t = V_t^{-1}b_t\)
  • \(\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}\)

Sub-optimality of optimism

  • Define the optimal action \(\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi\)
  • Suppose we choose optimistic \(\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_t} \theta^\top \varphi\)
    • let \(\tilde\theta_t = \arg\max_{\theta\in\mathcal C_t} \theta^\top \hat\varphi_t\)

The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq  \tilde\theta_t^\top \hat\varphi_t - \theta_\star^\top \hat\varphi_t\quad \) by choice of \(\hat\varphi_t\)
  • \(=  (\tilde \theta_t - \theta_\star)^\top \hat\varphi_t\)
  • \( = (V_{t-1}^{1/2}(\tilde \theta_t - \theta_\star))^\top(V_{t-1}^{-1/2}\hat\varphi_t )\)
  • \(\leq \|\tilde \theta_t - \theta_\star\|_{V_{t-1}} \|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad\) by Cauchy-Schwarz
  • \(\leq 2\sqrt{\beta_{t-1}}\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad \) using definition of confidence interval

Upper Confidence Bound

Proof Sketch: \(R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq 2\sqrt{\beta}\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad \) from previous slide
  • \(\leq 2\sqrt{T\beta\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}^2}\quad\) by Cauchy-Schwarz
  • \(\lesssim \sqrt{T}\) following Lemma 19.4 in Bandit Algorithms

The regret is bounded with high probability by

$$R(T) \lesssim  \sqrt{T}$$

After fall break: action in a dynamical world (optimal control)


  • Linear contextual bandits
  • Confidence bounds
  • Sub-optimality & regret
  • ETC (\(T^{2/3}\)) and UCB (\(\sqrt{T}\))

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

13 - Bandits - ML in Feedback Sys

By Sarah Dean


13 - Bandits - ML in Feedback Sys