Contextual Bandits

ML in Feedback Sys #11

Fall 2025, Prof Sarah Dean

Next Best Action platform: democratizing personalization with contextual bandits, Zillow

policy

$\pi_t:\mathcal X\to\mathcal A$

observation

action

$x_t$

accumulate

$\{(x_t, a_t, r_t)\}$

$a_{t}$

Action in a streaming world

Contextual bandits

"What we do"

Initialize a model $\hat r(x, a) = \hat\theta_0^\top \varphi(x,a)$ and $V_0=\lambda I$
For $t=1,2,...$
- receive context $x_t$
- take action $a_t$ according to $$a_t = \arg\max_{a\in\mathcal A} \hat\theta_{t-1}^\top \varphi(x_t,a) + \sqrt{\beta_t}\|\varphi(x_t,a)\|_{V_{t-1}^{-1}} $$
- receive reward $r_t$
- update model $\hat\theta_{t}$ and confidence parameters $V_{t}$ according to least squares with $((x_t,a_t),r_t)$
This algorithm is called LinUCB (Upper Confidence Bound)

Notation: $\|x\|^2_M = x^\top Mx$

Contextual bandits

"Why we do it"

Fact 1: The linear reward model assumption can represent many settings $$\mathbb E[r_t|x_t,a_t] = r(x_t,a_t) = \theta_\star^\top \varphi(x_t,a_t)$$
Fact 2: As long as rewards are bounded and linear, then with high probability $\|\theta_\star-\hat\theta_t\|_{V_{t-1}}^2\leq \beta_t $ and $$\hat\theta_t^\top \varphi(x_t,a) + \sqrt{\beta_t}\|\varphi(x_t,a)\|_{V_{t-1}^{-1}} = \max_{\|\theta-\hat\theta_t\|_{V_{t-1}}^2\leq \beta_t} \theta^\top \varphi(x_t,a)$$
Fact 3: Under the assumptions above, the average reward of actions chosen by LinUCB converges to optimal for $C$ depending only logarithmically on $T$ $$ \frac{1}{T}\sum_{t=1}^T r(x_t,a_t) \geq \frac{1}{T}\sum_{t=1}^T \max_{a\in\mathcal A} r(x_t, a) - \frac{C}{\sqrt{T}}$$

Linear reward model

e.g. personalization

Linear contextual bandits: contexts $x_t\in\mathbb R^d$ and discrete action set $\mathcal A = \{1,\dots,K\}$
average reward $\mathbb E[r_t\mid x, a] = r(x, a) = \theta_a^\top x$

Linear bandits: continuous action set $\mathcal A = \{a\mid\|a\|\leq 1\}$
average reward $\mathbb E[r_t\mid a]=r(a) = \theta_\star^\top a$

$K$-armed bandits $\mathcal A = \{1,\dots,K\}$
Each action has average reward $\mathbb E[r_t\mid a] = r(a) = \mu_a$

Taking action $a_t\in\mathcal A$ in context $x_t$ yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t,\qquad \mathbb E[\varepsilon_t]=0,~~\mathbb E[\varepsilon_t^2] = \sigma^2$$

e.g. betting

Linear reward model

$K$ armed bandits
- $\mathbb E[r(a)] = \begin{bmatrix} \mu_1\\\vdots\\ \mu_K\end{bmatrix}^\top \begin{bmatrix} \mathbf 1\{a=1\}\\\vdots\\ \mathbf 1\{a=K\}\end{bmatrix} $

Taking action $a_t\in\mathcal A$ in context $x_t$ yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$

$K$ armed bandits
- $\varphi(a) = e_a$
$K$ armed contextual linear bandits
- $\mathbb E[r(x, a)] = \begin{bmatrix} \theta_1 \\\vdots\\ \theta_K \end{bmatrix}^\top \begin{bmatrix} x\mathbf 1\{a=1\}\\\vdots\\ x\mathbf 1\{a=K\}\end{bmatrix} $

Linear reward model

Taking action $a_t\in\mathcal A$ in context $x_t$ yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$

$K$ armed bandits
- $\varphi(a) = e_a$
$K$ armed contextual linear bandits
- $\varphi(x,a) = e_a \otimes x$
Linear bandits
- $\mathbb E[r(a)] = \theta_\star^\top a$

Linear reward model

Taking action $a_t\in\mathcal A$ in context $x_t$ yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$

$K$ armed bandits
- $\varphi(a) = e_a$
$K$ armed contextual linear bandits
- $\varphi(x, a) = e_a \otimes x$
Linear bandits
- $\varphi(a)=a$
Linear contextual bandits
- $\mathbb E[r(x, a)] = ( \Theta_\star x)^\top a$
  - = $\mathrm{vec}(\Theta_\star)^\top \mathrm{vec}(ax^\top)$ so $\varphi(x, a)= \mathrm{vec}(ax^\top)$

Linear reward model

Let $\varphi_t = \varphi(x_t,a_t)$, then using data $\{(\varphi_k, r_k)\}_{k=1}^t$

Recap: least squares estimation

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2$$

assuming $\varphi_k$ are full rank, $$\hat\theta_t ={\underbrace{ \left(\sum_{k=1}^t \varphi_k\varphi_k^\top\right)}_{V_t}}^{-1}\sum_{k=1}^t \varphi_k r_k $$

Last lecture we discussed $1-\delta$ confidence interval on prediction $\theta_\star^\top\varphi$ is $$\Big[\hat\theta_t^\top \varphi \pm \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\Big]$$

Confidence ellipsoid

Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

Fact: For the right choice of $\beta_t$, with high probability, $\theta_\star\in\mathcal C_t$

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

Exercise: For a fixed action $\varphi$, show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

Last lecture we discussed $1-\delta$ confidence interval on prediction $\theta_\star^\top\varphi$ is $$\Big[\hat\theta_t^\top \varphi \pm \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\Big]$$

example: $K=2$ and we've pulled the arms 2 and 1 times respectively

$V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}$

$\hat\theta = (\hat \mu_1,\hat\mu_2)$

$$ \left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \} $$

Confidence ellipsoid $$\{ \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
- $2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t$
Confidence interval $$\big[\hat\theta_t^\top \varphi \pm \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\big]$$
- Pulling arm 1: $\hat \mu_1 \pm \sqrt{\beta_t/2}$
- Pulling arm 2:
  - $\hat \mu_2 \pm \sqrt{\beta_t}$

Example:

example: $d=2$ linear bandits

$\hat \theta$

$$ \left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \} $$

Example:

$V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}$

$V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}$

Confidence ellipsoid $$\{ \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
Confidence interval $$\big[\hat\theta_t^\top \varphi \pm \sqrt{\beta} \|\varphi\|_{V_t^{-1}}\big]$$
- Trying action $a=[0,1]$:
- $\hat \theta^\top a\pm \sqrt{\beta_t a^\top V_t^{-1}a}$
  - = $\hat \theta_2 \pm \sqrt{3\beta_t/8}$

Regularization

Now we have

Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top $$

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$

To handle cases where $\{\varphi_k\}$ are not full rank, we consider regularized LS

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

Cumulative sub-optimality: regret

How to trade-off exploration and exploitation?

The regret of an algorithm choosing $\{a_t\}$ is defined as $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t) $$

Notice that $$ \frac{1}{T}\sum_{t=1}^T r(x_t,a_t) \geq \frac{1}{T}\sum_{t=1}^T \max_{a\in\mathcal A} r(x_t, a) - \frac{R(T)}{{T}}$$

Sublinear regret implies convergence to optimal average reward

Explore-then-commit

ETC

For $t=1,\dots,N$
- play $\varphi_t$ at random
Estimate $\hat\theta$ with least squares
For $t=N+1,\dots,T$
- play $\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi$

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Explore-then-commit

The regret has two components

Suppose that $B$ bounds the norm of $\varphi$ and $\|\theta_\star\|\leq 1$.

$\theta_\star^\top \varphi_\star - \theta_\star^\top \varphi \leq 2B$

Then we have $R_1\leq 2BN$

Explore-then-commit

The regret has two components

Define the optimal action $a^\star_t = \arg\max_{a\in\mathcal A} \theta_\star^\top \varphi(x_t, a)$ and $\varphi_t^\star = \varphi(x_t,a^\star_t)$
Suppose we choose $\hat a_t = \arg\max_{a\in\mathcal A} \hat\theta^\top \varphi(x_t, a)$ and let $\hat\varphi_t = \varphi(x_t,\hat a_t)$

The suboptimality is $\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq \max_{\theta\in\mathcal C}\theta^\top \varphi_t^\star - \min_{\theta\in\mathcal C}\theta^\top \hat\varphi_t\qquad$ since we don't know $\theta_\star$
$\leq \hat\theta^\top\varphi_t^\star +\sqrt{\beta}\|\varphi_t^\star \|_{V^{-1}}-(\hat\theta^\top\hat\varphi_t -\sqrt{\beta}\|\hat\varphi_t\|_{V^{-1}})\quad$
$\leq \underbrace{\hat\theta^\top\varphi_t^\star - \hat\theta^\top\hat\varphi_t }_{\leq 0} +\sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\quad$ by choice of $\hat\theta$

Explore-then-commit

The regret has two components

Using sub-optimality result, with high probability $R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }$

Suppose that $\max_t\beta_t\leq \beta$.

Explore-then-commit

The regret is bounded with high probability by

$$R(T) \lesssim 2BN + 2BT\sqrt{\beta \frac{d}{N} }$$

Choosing $N=T^{2/3}$ leads to sublinear regret

$R(T) \lesssim T^{2/3}$

Upper Confidence Bound

Adaptive perspective: optimism in the face of uncertainty

focus exploration on promising actions

UCB

Initialize $V_0=\lambda I$, $b_0=0$, $\hat\theta_0=0$
For $t=1,\dots,T$
- play $\displaystyle a_t = \arg\max_{a\in\mathcal A} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi(x_t,a)$
- let $\varphi_t=\varphi(x_t,a_t)$
- update $V_t = V_{t-1}+\varphi_t\varphi_t^\top$
  and $b_t = b_{t-1}+r_t\varphi_t$
- $\hat\theta_t = V_t^{-1}b_t$ and $\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}$

Sub-optimality of optimism

Define the optimal action $a_t^\star = \arg\max_{a\in\mathcal A} \theta_\star^\top \varphi(x_t,a)$
Suppose we choose optimistic $\hat a_t = \arg\max_{a\in\mathcal A} \max_{\theta\in\mathcal C_t} \theta^\top \varphi(x_t,a)$
- let $\tilde\theta_t = \arg\max_{\theta\in\mathcal C_t} \theta^\top \hat\varphi(x_t,a),\varphi_t^\star =\varphi(x_t,a^\star_t), \hat\varphi_t=\varphi(x_t,\hat a_t)$

The suboptimality is $\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq \tilde\theta_t^\top \hat\varphi_t - \theta_\star^\top \hat\varphi_t\quad $ by choice of $\hat\varphi_t$
$= (\tilde \theta_t - \theta_\star)^\top \hat\varphi_t$
$ = (V_{t-1}^{1/2}(\tilde \theta_t - \theta_\star))^\top(V_{t-1}^{-1/2}\hat\varphi_t )$
$\leq \|\tilde \theta_t - \theta_\star\|_{V_{t-1}} \|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad$ by Cauchy-Schwarz
$\leq 2\sqrt{\beta_{t-1}}\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad $ using definition of confidence interval

Upper Confidence Bound

Proof Sketch: $R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq 2\sqrt{\beta}\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad $ from previous slide
$\leq 2\sqrt{T\beta\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}^2}\quad$ by Cauchy-Schwarz
$\lesssim \sqrt{T}$ following Lemma 19.4 in Bandit Algorithms

The regret is bounded with high probability by

$$R(T) \lesssim \sqrt{T}$$

Prediction vs. Action

Data format: labels $y\in\mathcal Y$ vs. actions $a\in\mathcal A$ and rewards $r\in\mathbb R$ $$ \{(x_t, y_t)\}_{t=1}^T \quad \text{vs.}\quad \{(x_t, a_t, r_t )\}_{t=1}^T $$
Observations $x_t$ previously called features now called contexts
Goal: predict $\hat y$ with high accuracy vs. chose $a_t$ for high reward $$\min_{\hat y \in \mathcal Y} \mathbb E[\ell(y, \hat y)\mid x]\quad \text{vs.} \quad \max_{a\in\mathcal A} \mathbb E[r_t\mid x_t, a_t]$$
Key difference: dataset is no longer fixed, but depends on our actions
Today's lecture: assumed that reward depends only on current context and action (not past or future)

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

Recap

Linear contextual bandits
Confidence bounds
Sub-optimality & regret
UCB Algorithm

Next time: optimal action sequences

Announcements

Fifth assignment due Thursday
Thursday: we will discuss projects & paper presentations

11 - Bandits - ML in Feedback Sys F25

By Sarah Dean

11 - Bandits - ML in Feedback Sys F25

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Contextual Bandits

ML in Feedback Sys #11

Action in a streaming world

Contextual bandits

Contextual bandits

Linear reward model

Linear reward model

Linear reward model

Linear reward model

Linear reward model

Recap: least squares estimation

Confidence ellipsoid

Example:

Example:

Regularization

Cumulative sub-optimality: regret

Explore-then-commit

Explore-then-commit

Explore-then-commit

Explore-then-commit

Explore-then-commit

Upper Confidence Bound

Sub-optimality of optimism

Upper Confidence Bound

Prediction vs. Action

Recap

Announcements

11 - Bandits - ML in Feedback Sys F25

More from Sarah Dean