Contextual Bandits

ML in Feedback Sys #11

Fall 2025, Prof Sarah Dean

policy

\(\pi_t:\mathcal X\to\mathcal A\)

observation

action

\(x_t\)

accumulate

\(\{(x_t, a_t, r_t)\}\)

\(a_{t}\)

Action in a streaming world

Contextual bandits

"What we do"

  • Initialize a model \(\hat r(x, a) = \hat\theta_0^\top \varphi(x,a)\) and \(V_0=\lambda I\)
  • For \(t=1,2,...\)
    • receive context \(x_t\)
    • take action \(a_t\) according to $$a_t = \arg\max_{a\in\mathcal A} \hat\theta_{t-1}^\top \varphi(x_t,a) + \sqrt{\beta_t}\|\varphi(x_t,a)\|_{V_{t-1}^{-1}} $$
    • receive reward \(r_t\)
    • update model \(\hat\theta_{t}\) and confidence parameters \(V_{t}\) according to least squares with \(((x_t,a_t),r_t)\)
  • This algorithm is called LinUCB (Upper Confidence Bound)

Notation: \(\|x\|^2_M = x^\top Mx\)

Contextual bandits

"Why we do it"

  • Fact 1: The linear reward model assumption can represent many settings $$\mathbb E[r_t|x_t,a_t] = r(x_t,a_t) = \theta_\star^\top \varphi(x_t,a_t)$$
  • Fact 2: As long as rewards are bounded and linear, then with high probability \(\|\theta_\star-\hat\theta_t\|_{V_{t-1}}^2\leq \beta_t \) and  $$\hat\theta_t^\top \varphi(x_t,a) + \sqrt{\beta_t}\|\varphi(x_t,a)\|_{V_{t-1}^{-1}} = \max_{\|\theta-\hat\theta_t\|_{V_{t-1}}^2\leq \beta_t} \theta^\top \varphi(x_t,a)$$
  • Fact 3: Under the assumptions above, the average reward of actions chosen by LinUCB converges to optimal for \(C\) depending only logarithmically on \(T\) $$ \frac{1}{T}\sum_{t=1}^T r(x_t,a_t) \geq \frac{1}{T}\sum_{t=1}^T \max_{a\in\mathcal A} r(x_t, a)  - \frac{C}{\sqrt{T}}$$

Linear reward model

e.g. personalization

  • Linear contextual bandits: contexts \(x_t\in\mathbb R^d\) and discrete action set \(\mathcal A = \{1,\dots,K\}\)
  • average reward \(\mathbb E[r_t\mid x, a] = r(x, a) = \theta_a^\top x\)
  • Linear bandits: continuous action set \(\mathcal A = \{a\mid\|a\|\leq 1\}\)
  • average reward \(\mathbb E[r_t\mid a]=r(a) = \theta_\star^\top a\)
  • \(K\)-armed bandits \(\mathcal A = \{1,\dots,K\}\)
  • Each action has average reward \(\mathbb E[r_t\mid a] = r(a) = \mu_a\)

Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t,\qquad \mathbb E[\varepsilon_t]=0,~~\mathbb E[\varepsilon_t^2] = \sigma^2$$

e.g. betting

Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t,\qquad \mathbb E[\varepsilon_t]=0,~~\mathbb E[\varepsilon_t^2] = \sigma^2$$

Linear reward model

  • \(K\) armed bandits
    • \(\mathbb E[r(a)] = \begin{bmatrix} \mu_1\\\vdots\\ \mu_K\end{bmatrix}^\top \begin{bmatrix} \mathbf 1\{a=1\}\\\vdots\\ \mathbf 1\{a=K\}\end{bmatrix} \)

Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$

  • \(K\) armed bandits
    • \(\varphi(a) = e_a\)
  • \(K\) armed contextual linear bandits
    • \(\mathbb E[r(x, a)] = \begin{bmatrix} \theta_1 \\\vdots\\ \theta_K \end{bmatrix}^\top \begin{bmatrix} x\mathbf 1\{a=1\}\\\vdots\\ x\mathbf 1\{a=K\}\end{bmatrix} \)

Linear reward model

Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$

  • \(K\) armed bandits
    • \(\varphi(a) = e_a\)
  • \(K\) armed contextual linear bandits
    • \(\varphi(x,a) = e_a \otimes x\)
  • Linear bandits
    • \(\mathbb E[r(a)] = \theta_\star^\top a\)

Linear reward model

Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$

  • \(K\) armed bandits
    • \(\varphi(a) = e_a\)
  • \(K\) armed contextual linear bandits
    • \(\varphi(x, a) = e_a \otimes x\)
  • Linear bandits
    • \(\varphi(a)=a\)
  • Linear contextual bandits
    •  \(\mathbb E[r(x, a)] = ( \Theta_\star x)^\top a\)
      • = \(\mathrm{vec}(\Theta_\star)^\top \mathrm{vec}(ax^\top)\) so \(\varphi(x, a)= \mathrm{vec}(ax^\top)\)

Linear reward model

Let \(\varphi_t = \varphi(x_t,a_t)\), then using data \(\{(\varphi_k, r_k)\}_{k=1}^t\)

Recap: least squares estimation

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2$$

assuming \(\varphi_k\) are full rank, $$\hat\theta_t ={\underbrace{ \left(\sum_{k=1}^t \varphi_k\varphi_k^\top\right)}_{V_t}}^{-1}\sum_{k=1}^t \varphi_k r_k $$

Last lecture we discussed \(1-\delta\) confidence interval on prediction \(\theta_\star^\top\varphi\) is $$\Big[\hat\theta_t^\top \varphi \pm  \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\Big]$$

Confidence ellipsoid

Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

Fact: For the right choice of \(\beta_t\), with high probability, \(\theta_\star\in\mathcal C_t\)

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

Exercise: For a fixed action \(\varphi\), show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

Last lecture we discussed \(1-\delta\) confidence interval on prediction \(\theta_\star^\top\varphi\) is $$\Big[\hat\theta_t^\top \varphi \pm  \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\Big]$$

example: \(K=2\) and we've pulled the arms 2 and 1 times respectively

\(V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}\)

\(\hat\theta = (\hat \mu_1,\hat\mu_2)\)

$$ \left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \} $$

  • Confidence ellipsoid $$\{ \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
    • \(2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t\)

  • Confidence interval $$\big[\hat\theta_t^\top \varphi \pm  \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\big]$$
    • Pulling arm 1: \(\hat \mu_1 \pm \sqrt{\beta_t/2}\)
    • Pulling arm 2:
      • \(\hat \mu_2 \pm \sqrt{\beta_t}\)

Example:

example: \(d=2\) linear bandits

\(\hat \theta\)

$$ \left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \} $$

Example:

\(V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}\)

\(V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}\)

  • Confidence ellipsoid $$\{ \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
  • Confidence interval $$\big[\hat\theta_t^\top \varphi \pm  \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\big]$$
    • Trying action \(a=[0,1]\):

    • \(\hat \theta^\top a\pm \sqrt{\beta_t a^\top V_t^{-1}a}\)
      • = \(\hat \theta_2 \pm \sqrt{3\beta_t/8}\)

Regularization

Now we have

Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top $$

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$

To handle cases where \(\{\varphi_k\}\) are not full rank, we consider regularized LS

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

Cumulative sub-optimality: regret

How to trade-off exploration and exploitation?

The regret of an algorithm choosing \(\{a_t\}\) is defined as $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t) $$

Notice that  $$ \frac{1}{T}\sum_{t=1}^T r(x_t,a_t) \geq \frac{1}{T}\sum_{t=1}^T \max_{a\in\mathcal A} r(x_t, a)  - \frac{R(T)}{{T}}$$

  • Sublinear regret implies convergence to optimal average reward

Explore-then-commit

ETC

  • For \(t=1,\dots,N\)
    • play \(\varphi_t\) at random
  • Estimate \(\hat\theta\) with least squares
  • For \(t=N+1,\dots,T\)
    • play \(\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi\)

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Explore-then-commit

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Suppose that \(B\) bounds the norm of \(\varphi\) and \(\|\theta_\star\|\leq 1\).

  • \(\theta_\star^\top \varphi_\star - \theta_\star^\top \varphi \leq 2B\)

Then we have \(R_1\leq 2BN\)

Explore-then-commit

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

  • Define the optimal action \(a^\star_t = \arg\max_{a\in\mathcal A} \theta_\star^\top \varphi(x_t, a)\) and \(\varphi_t^\star = \varphi(x_t,a^\star_t)\)
  • Suppose we choose \(\hat a_t = \arg\max_{a\in\mathcal A} \hat\theta^\top \varphi(x_t, a)\) and let \(\hat\varphi_t = \varphi(x_t,\hat a_t)\)

The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq  \max_{\theta\in\mathcal C}\theta^\top \varphi_t^\star - \min_{\theta\in\mathcal C}\theta^\top \hat\varphi_t\qquad\) since we don't know \(\theta_\star\)
  • \(\leq  \hat\theta^\top\varphi_t^\star +\sqrt{\beta}\|\varphi_t^\star \|_{V^{-1}}-(\hat\theta^\top\hat\varphi_t -\sqrt{\beta}\|\hat\varphi_t\|_{V^{-1}})\quad\)
  • \(\leq  \underbrace{\hat\theta^\top\varphi_t^\star - \hat\theta^\top\hat\varphi_t }_{\leq 0} +\sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\quad\) by choice of \(\hat\theta\)

Explore-then-commit

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T}  \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Using sub-optimality result, with high probability \(R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }\)

Suppose that \(\max_t\beta_t\leq \beta\).

Explore-then-commit

The regret is bounded with high probability by

$$R(T) \lesssim  2BN +  2BT\sqrt{\beta \frac{d}{N} }$$

Choosing \(N=T^{2/3}\) leads to sublinear regret

  • \(R(T) \lesssim T^{2/3}\)

Upper Confidence Bound

Adaptive perspective: optimism in the face of uncertainty

  • focus exploration on promising actions

UCB

  • Initialize \(V_0=\lambda I\), \(b_0=0\), \(\hat\theta_0=0\)
  • For \(t=1,\dots,T\)
    • play \(\displaystyle a_t = \arg\max_{a\in\mathcal A} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi(x_t,a)\)
    • let \(\varphi_t=\varphi(x_t,a_t)\)
    • update \(V_t = V_{t-1}+\varphi_t\varphi_t^\top\)
      and \(b_t = b_{t-1}+r_t\varphi_t\)
    • \(\hat\theta_t = V_t^{-1}b_t\) and  \(\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}\)

Sub-optimality of optimism

  • Define the optimal action \(a_t^\star = \arg\max_{a\in\mathcal A} \theta_\star^\top \varphi(x_t,a)\)
  • Suppose we choose optimistic \(\hat a_t = \arg\max_{a\in\mathcal A} \max_{\theta\in\mathcal C_t} \theta^\top \varphi(x_t,a)\)
    • let \(\tilde\theta_t = \arg\max_{\theta\in\mathcal C_t} \theta^\top \hat\varphi(x_t,a),\varphi_t^\star =\varphi(x_t,a^\star_t), \hat\varphi_t=\varphi(x_t,\hat a_t)\)

The suboptimality is \(\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq  \tilde\theta_t^\top \hat\varphi_t - \theta_\star^\top \hat\varphi_t\quad \) by choice of \(\hat\varphi_t\)
  • \(=  (\tilde \theta_t - \theta_\star)^\top \hat\varphi_t\)
  • \( = (V_{t-1}^{1/2}(\tilde \theta_t - \theta_\star))^\top(V_{t-1}^{-1/2}\hat\varphi_t )\)
  • \(\leq \|\tilde \theta_t - \theta_\star\|_{V_{t-1}} \|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad\) by Cauchy-Schwarz
  • \(\leq 2\sqrt{\beta_{t-1}}\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad \) using definition of confidence interval

Upper Confidence Bound

Proof Sketch: \(R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t\)

  • \(\leq 2\sqrt{\beta}\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad \) from previous slide
  • \(\leq 2\sqrt{T\beta\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}^2}\quad\) by Cauchy-Schwarz
  • \(\lesssim \sqrt{T}\) following Lemma 19.4 in Bandit Algorithms

The regret is bounded with high probability by

$$R(T) \lesssim  \sqrt{T}$$

Prediction vs. Action

  • Data format: labels \(y\in\mathcal Y\) vs. actions \(a\in\mathcal A\) and rewards \(r\in\mathbb R\) $$ \{(x_t, y_t)\}_{t=1}^T  \quad \text{vs.}\quad \{(x_t, a_t, r_t )\}_{t=1}^T $$
  • Observations \(x_t\) previously called features now called contexts
  • Goal: predict \(\hat y\) with high accuracy vs. chose \(a_t\) for high reward $$\min_{\hat y \in \mathcal Y} \mathbb E[\ell(y, \hat y)\mid x]\quad \text{vs.} \quad \max_{a\in\mathcal A} \mathbb E[r_t\mid x_t, a_t]$$
  • Key difference: dataset is no longer fixed, but depends on our actions
  • Today's lecture: assumed that reward depends only on current context and action (not past or future)

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

Recap

  • Linear contextual bandits
  • Confidence bounds
  • Sub-optimality & regret
  • UCB Algorithm

Next time: optimal action sequences

Announcements

  • Fifth assignment due Thursday
  • Thursday: we will discuss projects & paper presentations

11 - Bandits - ML in Feedback Sys F25

By Sarah Dean

11 - Bandits - ML in Feedback Sys F25

  • 47