CS 4/5789: Introduction to Reinforcement Learning

Lecture 21: Contextual Bandits

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 6 due tonight
- PSet 7 released tonight
- PA 4 due May 3
Final exam is Saturday 5/13 at 2pm
WICCxURMC Survey

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

Recap: Multi-Armed Bandit

A simplified setting for studying exploration

Recap: MAB Setting

Simplified RL setting with no state and no transitions
$\mathcal A=\{1,\dots,K\}$ $K$ discrete actions ("arms")
Stochastic rewards $r_t\sim r(a_t)$ with expectation $\mathbb E[r(a)] = \mu_a$
Finite time horizon $T\in\mathbb Z_+$

Multi-Armed Bandits

for $t=1,2,...,T$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$
  - $\mathbb E[r_t] = \mu_{a_t}$

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$
Update empirical means $\widehat \mu_{a,t}$ and counts $N_{a,t}$

Explore for $N \approx T^{2/3}$,

$R(T) \lesssim T^{2/3}$

$R(T) \lesssim \sqrt{T}$

Recap: MAB

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

Motivation: Contextual Bandits

Example: online advertising

Journalism

Programming

"Arms" are different job ads:

But consider different users:

CS Major

English Major

Motivation: Contextual Bandits

Example: online shopping

"Arms" are various products

But what about search queries, browsing history, items in cart?

Motivation: Contextual Bandits

Example: social media feeds

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

Contextual Bandits Setting

Simplified RL setting with context instead of state or transitions
Contexts $x_t\in\mathcal X$ drawn i.i.d. from distribution $\mathcal D\in\Delta(\mathcal X)$
$\mathcal A=\{1,\dots,K\}$ $K$ discrete actions ("arms")
Stochastic rewards $r_t\sim r(x_t, a_t)$ with expectation $\mathbb E[r(x, a)] = \mu_a(x)$
Finite time horizon $T\in\mathbb Z_+$

Contextual Bandits

for $t=1,2,...,T$
- observe context $x_t$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$ with $\mathbb E[r_t] = \mu_{a_t}(x_t)$

Comparison

What is the difference between contextual bandits and MDP? PollEV
- State $s$ vs. context $x$
- Transition $P$ and initial $\mu_0$ distribution vs. context distribution $\mathcal D$
Contexts are memoryless: independent of previous contexts and unaffected by actions

Contextual Bandits

for $t=1,2,...,T$
- observe context $x_t$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$ with $\mathbb E[r_t] = \mu_{a_t}(x_t)$

Optimal Policy and Regret

Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(x_t, a_t) \right] = \sum_{t=1}^T \mu_{a_t}(x_t)$$
Optimal policy $\pi_\star(x) = \arg\max_{a\in\mathcal A} \mu_a(x)$
Definition: The regret of an algorithm which chooses actions $a_1,\dots,a_T$ is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

If contexts $x_1$ and $x_2$ are similar, expect similar actions to achieve high reward
Linear assumption: context $x\in\mathbb R^d$ and $$\mathbb E[r(x,a)] = \mu_a(x) = \theta_a^\top x $$
Unknown parameters $\theta_a\in\mathbb R^d$ for $a\in[K]$
Example: music artist recommendation
- $\theta_a\in\mathbb R^d$ represents attributes of artists
- $x \in\mathbb R^d$ represents users affinities

Linear Reward Model

tempo

lyricism

In order to predict rewards, estimate $\hat\theta_a$
Observations so far make up data: $\{x_k, a_k, r_k\}_{k=1}^t$
Supervised learning problem: $$\hat\theta_a = \arg\min_\theta \sum_{k=1}^t \mathbf 1\{a_k=a \}(\theta^\top x_k - r_k)^2$$
Lemma: Assume that $\{x_k\}_{k:a_k=a}$ span $\mathbb R^d$. Then $$\hat\theta_a ={\underbrace{ \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)}_{A}}^{-1}\underbrace{\sum_{k:a_k=a} x_k r_k}_b = A^{-1} b $$

Linear Regression

Proof: First take the gradient of the objective
- $\nabla_\theta \sum_{k:a_k=a} (\theta^\top x_k - r_k)^2 = 2 \sum_{k:a_k=a} x_k(\theta^\top x_k - r_k)$
- Setting it equal to zero at the minimum $\hat \theta_a$ $$ \sum_{k:a_k=a} x_k x_k^\top \hat \theta_a = \sum_{k:a_k=a} x_k r_k \iff A\hat \theta_a = b$$
- Under the spanning assumption, $A$ is invertible
  - To see why, define $X$ containing stacked contexts such that $A=X^\top X$. Then the assumption $\implies \mathrm{rank}(X)=d \implies \mathrm{rank}(X^\top X)=d $
- Therefore, $\hat \theta_a=A^{-1} b$

Linear Regression

In order to predict rewards, estimate $\hat\theta_a$
Observations so far make up data: $\{x_k, a_k, r_k\}_{k=1}^t$
Supervised learning problem: $$\hat\theta_a = \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)^{-1} \sum_{k:a_k=a} x_k r_k = A^{-1} b $$
The context covariance matrix is $\Sigma = \mathbb E_{x\sim \mathcal D}[xx^\top]$
The matrix $A$ is related to the empirical covariance
- $\hat \Sigma = \frac{1}{N_a} \sum_{k:a_k=a} x_k x_k^\top $ approximates expectation with sum
- The relationship is $A = N_a \hat \Sigma$

Linear Regression

tempo

lyricism

Example

Suppose 6 observed contexts come from two users:
- User 1 (5x): loves fast songs, indifferent to lyrics
  - positive ratings (all $1$)
- User 2 (1x): loves lyrical songs, indifferent to tempo
  - negative ratings (all $-1$)
$A^{-1} = \begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}$
$b=\begin{bmatrix} 5 \\ -1\end{bmatrix}$
$\hat\theta = \begin{bmatrix}1\\ -1\end{bmatrix}$

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

LinUCB

An algorithm that adapts to linear confidence intervals
Need to keep track of:
- $A_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}x_k x_k^\top $ and $b_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}r_k x_k $
- $\hat\theta_{a,t} = A_{a,t}^{-1} b_{a,t} $

LinUCB

Initialize 0 mean and infinite confidence intervals
for $t=1,2,...,T$
- $a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$ # largest UCB
- update $A_{a,t+1}$, $b_{a,t+1}$, and $\hat \theta_{a,t+1}$

LinUCB

An algorithm that adapts to linear confidence intervals
Upper confidence bound is: $$\hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$$
Similar to UCB but with a mean and CI width that depend on the context $x_t$
Geometric intuition:
- First term is large if $x$ and $\hat \theta$ are aligned
- Second term is large if $x$ is not aligned with much historical data $$x^\top A x = x^\top (N \hat \Sigma)^{-1} x = \frac{1}{N} x^\top \hat \Sigma^{-1} x$$

tempo

lyricism

Example

Suppose the observed contexts come from two users:
- User 1 (5x): loves fast songs, indifferent to lyrics
  - positive ratings (all $1$)
- User 2 (1x): loves lyrical songs, indifferent to tempo
  - negative ratings (all $-1$)
$\hat\theta=\begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}\begin{bmatrix} 5 \\ -1\end{bmatrix}= \begin{bmatrix}1\\ -1\end{bmatrix}$
For new user $x=(m, \ell )$, UCB:
- $m-\ell + \sqrt{m^2/5 + \ell^2}$

Statistical Derivation

We can derive the form of the confidence intervals more formally using statistics
Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x} $$ where $\alpha$ depends on the failure probability and the variance of the rewards
Lemma: (Chebychev's inequality) For a random variable $a$ with $\mathbb E[u] = 0$, $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$

Statistical Derivation

Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x} $$ where $\alpha$ depends on the failure probability and the variance of the rewards
Lemma: (Chebychev's inequality) For a random variable $a$ with $\mathbb E[u] = 0$, $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
Proof of Claim: using Chebychev's we show that w.h.p. $$|\theta_a^\top x-\hat \theta_a^\top x|\leq \alpha \sqrt{x^\top A_a^{-1} x} $$

Statistical Derivation

Lemma: (Chebychev's inequality) For a random variable $a$ with $\mathbb E[u] = 0$, $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
Proof of Claim: using Chebychev's we show that w.h.p. $$|\underbrace{\theta_a^\top x-\hat \theta_a^\top x}_{u}|\leq \alpha\underbrace{ \sqrt{x^\top A_a^{-1} x}}_{\mathbb Eu^2} $$
1. Show that $\mathbb E[u] = 0$
2. Compute variance $\mathbb E[u^2]$

Statistical Derivation

Proof of Claim:
1. Show that $\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0$
  - Define $w_k = r_k - \mathbb E[r_k]$ so $r_k = \theta_{a_k}^\top x_k + w_k$
  - $\hat \theta_a = A_a^{-1} \sum_{k:a_k=a} (\theta_a^\top x_k + w_k) x_k $
    - $ = A_a^{-1} \sum_{k:a_k=a} x_k x_k^\top \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k $
    - $ = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k $
  - $\mathbb E[\theta_a^\top x-\hat \theta_a^\top x]=(A_a^{-1} \sum_{k:a_k=a} \mathbb E[w_k] x_k)^\top x =0$
2. Compute variance $\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]$

Statistical Derivation

Proof of Claim:
1. ✔ Show that $\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0$
  - Define $w_k = r_k - \mathbb E[r_k]$ so $r_k = \theta_{a_k}^\top x_k + w_k$
  - $\hat \theta_a = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k $
2. Compute variance $\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]$
  - $ = \mathbb E[((\hat\theta_a-\theta_a)^\top x)^2] = \mathbb E[((A_a^{-1} \sum_k w_k x_k)^\top x)^2]$
  - $= x^\top A_a^{-1} \sum_k \sum_\ell \mathbb E[ w_k w_\ell ] x_k x_\ell^\top A_a^{-1} x$
    - $\mathbb E[ w_k w_\ell ]=0$ if $k\neq \ell$, otherwise variance $\sigma^2$
  - $= x^\top A_a^{-1} \sum_k\sigma^2 x_k x_k^\top A_a^{-1} x$
  - $=\sigma^2 x^\top A_a^{-1} x$

Statistical Derivation

Proof of Claim:
1. ✔ Show that $\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0$
  - Define $w_k = r_k - \mathbb E[r_k]$ so $r_k = \theta_{a_k}^\top x_k + w_k$
  - $\hat \theta_a = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k $
2. ✔ Compute variance $\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]=\sigma^2 x^\top A_a^{-1} x$

LinUCB

for $t=1,2,...,T$
- $a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$ # largest UCB
- update $A_{a,t+1}$, $b_{a,t+1}$, and $\hat \theta_{a,t+1}$

Recap

PSet released/due tonight

Contextual Bandits
Linear Model
LinUCB

Next lecture: Exploration in MDPs

Sp23 CS 4/5789: Lecture 21

By Sarah Dean

Sp23 CS 4/5789: Lecture 21

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 21: Contextual Bandits

Reminders

Agenda

Recap: Multi-Armed Bandit

Recap: MAB Setting

Recap: MAB

Agenda

Motivation: Contextual Bandits

Motivation: Contextual Bandits

Motivation: Contextual Bandits

Contextual Bandits Setting

Comparison

Optimal Policy and Regret

Agenda

Linear Reward Model

Linear Regression

Linear Regression

Linear Regression

Example

Agenda

LinUCB

LinUCB

Example

Statistical Derivation

Statistical Derivation

Statistical Derivation

Statistical Derivation

Statistical Derivation

Statistical Derivation

Recap

Sp23 CS 4/5789: Lecture 21

More from Sarah Dean