CS 4/5789: Introduction to Reinforcement Learning
Lecture 21: Contextual Bandits
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 6 due tonight
- PSet 7 released tonight
- PA 4 due May 3
- Final exam is Saturday 5/13 at 2pm
- WICCxURMC Survey
Agenda
1. Recap: MAB
2. Contextual Bandits
3. Linear Model
4. LinUCB

Recap: Multi-Armed Bandit
A simplified setting for studying exploration
Recap: MAB Setting
- Simplified RL setting with no state and no transitions
- \(\mathcal A=\{1,\dots,K\}\) \(K\) discrete actions ("arms")
- Stochastic rewards \(r_t\sim r(a_t)\) with expectation \(\mathbb E[r(a)] = \mu_a\)
- Finite time horizon \(T\in\mathbb Z_+\)
Multi-Armed Bandits
- for \(t=1,2,...,T\)
- take action \(a_t\in\{1,\dots, K\}\)
- receive reward \(r_t\)
- \(\mathbb E[r_t] = \mu_{a_t}\)
Explore-then-Commit
- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):
Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)
Upper Confidence Bound
For \(t=1,...,T\):
- Pull \( a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
- Update empirical means \(\widehat \mu_{a,t}\) and counts \(N_{a,t}\)
Explore for \(N \approx T^{2/3}\),
\(R(T) \lesssim T^{2/3}\)
\(R(T) \lesssim \sqrt{T}\)
Recap: MAB
Agenda
1. Recap: MAB
2. Contextual Bandits
3. Linear Model
4. LinUCB
Motivation: Contextual Bandits
Example: online advertising




Journalism
Programming
"Arms" are different job ads:
But consider different users:
CS Major
English Major
Motivation: Contextual Bandits
Example: online shopping
"Arms" are various products
But what about search queries, browsing history, items in cart?

Motivation: Contextual Bandits
Example: social media feeds
"Arms" are various posts: images, videos
Personalized to each user based on demographics, behavioral data, etc

Contextual Bandits Setting
- Simplified RL setting with context instead of state or transitions
- Contexts \(x_t\in\mathcal X\) drawn i.i.d. from distribution \(\mathcal D\in\Delta(\mathcal X)\)
- \(\mathcal A=\{1,\dots,K\}\) \(K\) discrete actions ("arms")
- Stochastic rewards \(r_t\sim r(x_t, a_t)\) with expectation \(\mathbb E[r(x, a)] = \mu_a(x)\)
- Finite time horizon \(T\in\mathbb Z_+\)
Contextual Bandits
- for \(t=1,2,...,T\)
- observe context \(x_t\)
- take action \(a_t\in\{1,\dots, K\}\)
- receive reward \(r_t\) with \(\mathbb E[r_t] = \mu_{a_t}(x_t)\)
Comparison
- What is the difference between contextual bandits and MDP? PollEV
- State \(s\) vs. context \(x\)
- Transition \(P\) and initial \(\mu_0\) distribution vs. context distribution \(\mathcal D\)
- Contexts are memoryless: independent of previous contexts and unaffected by actions
Contextual Bandits
- for \(t=1,2,...,T\)
- observe context \(x_t\)
- take action \(a_t\in\{1,\dots, K\}\)
- receive reward \(r_t\) with \(\mathbb E[r_t] = \mu_{a_t}(x_t)\)
Optimal Policy and Regret
- Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(x_t, a_t) \right] = \sum_{t=1}^T \mu_{a_t}(x_t)$$
- Optimal policy \(\pi_\star(x) = \arg\max_{a\in\mathcal A} \mu_a(x)\)
- Definition: The regret of an algorithm which chooses actions \(a_1,\dots,a_T\) is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$
Agenda
1. Recap: MAB
2. Contextual Bandits
3. Linear Model
4. LinUCB
- If contexts \(x_1\) and \(x_2\) are similar, expect similar actions to achieve high reward
- Linear assumption: context \(x\in\mathbb R^d\) and $$\mathbb E[r(x,a)] = \mu_a(x) = \theta_a^\top x $$
- Unknown parameters \(\theta_a\in\mathbb R^d\) for \(a\in[K]\)
- Example: music artist recommendation
- \(\theta_a\in\mathbb R^d\) represents attributes of artists
- \(x \in\mathbb R^d\) represents users affinities
Linear Reward Model
tempo
lyricism
- In order to predict rewards, estimate \(\hat\theta_a\)
- Observations so far make up data: \(\{x_k, a_k, r_k\}_{k=1}^t\)
- Supervised learning problem: $$\hat\theta_a = \arg\min_\theta \sum_{k=1}^t \mathbf 1\{a_k=a \}(\theta^\top x_k - r_k)^2$$
- Lemma: Assume that \(\{x_k\}_{k:a_k=a}\) span \(\mathbb R^d\). Then $$\hat\theta_a ={\underbrace{ \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)}_{A}}^{-1}\underbrace{\sum_{k:a_k=a} x_k r_k}_b = A^{-1} b $$
Linear Regression
-
Proof: First take the gradient of the objective
- \(\nabla_\theta \sum_{k:a_k=a} (\theta^\top x_k - r_k)^2 = 2 \sum_{k:a_k=a} x_k(\theta^\top x_k - r_k)\)
- Setting it equal to zero at the minimum \(\hat \theta_a\) $$ \sum_{k:a_k=a} x_k x_k^\top \hat \theta_a = \sum_{k:a_k=a} x_k r_k \iff A\hat \theta_a = b$$
- Under the spanning assumption, \(A\) is invertible
- To see why, define \(X\) containing stacked contexts such that \(A=X^\top X\). Then the assumption \(\implies \mathrm{rank}(X)=d \implies \mathrm{rank}(X^\top X)=d \)
- Therefore, \(\hat \theta_a=A^{-1} b\)
Linear Regression
- In order to predict rewards, estimate \(\hat\theta_a\)
- Observations so far make up data: \(\{x_k, a_k, r_k\}_{k=1}^t\)
- Supervised learning problem: $$\hat\theta_a = \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)^{-1} \sum_{k:a_k=a} x_k r_k = A^{-1} b $$
- The context covariance matrix is \(\Sigma = \mathbb E_{x\sim \mathcal D}[xx^\top]\)
- The matrix \(A\) is related to the empirical covariance
- \(\hat \Sigma = \frac{1}{N_a} \sum_{k:a_k=a} x_k x_k^\top \) approximates expectation with sum
- The relationship is \(A = N_a \hat \Sigma\)
Linear Regression
tempo
lyricism
Example
- Suppose 6 observed contexts come from two users:
-
User 1 (5x): loves fast songs, indifferent to lyrics
- positive ratings (all \(1\))
-
User 2 (1x): loves lyrical songs, indifferent to tempo
- negative ratings (all \(-1\))
-
User 1 (5x): loves fast songs, indifferent to lyrics
- \(A^{-1} = \begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}\)
- \(b=\begin{bmatrix} 5 \\ -1\end{bmatrix}\)
- \(\hat\theta = \begin{bmatrix}1\\ -1\end{bmatrix}\)
X
X
Agenda
1. Recap: MAB
2. Contextual Bandits
3. Linear Model
4. LinUCB
LinUCB
- An algorithm that adapts to linear confidence intervals
- Need to keep track of:
- \(A_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}x_k x_k^\top \) and \(b_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}r_k x_k \)
- \(\hat\theta_{a,t} = A_{a,t}^{-1} b_{a,t} \)
LinUCB
- Initialize 0 mean and infinite confidence intervals
- for \(t=1,2,...,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}\) # largest UCB
- update \(A_{a,t+1}\), \(b_{a,t+1}\), and \(\hat \theta_{a,t+1}\)
LinUCB
- An algorithm that adapts to linear confidence intervals
- Upper confidence bound is: $$\hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$$
- Similar to UCB but with a mean and CI width that depend on the context \(x_t\)
- Geometric intuition:
- First term is large if \(x\) and \(\hat \theta\) are aligned
- Second term is large if \(x\) is not aligned with much historical data $$x^\top A x = x^\top (N \hat \Sigma)^{-1} x = \frac{1}{N} x^\top \hat \Sigma^{-1} x$$
tempo
lyricism
Example
- Suppose the observed contexts come from two users:
-
User 1 (5x): loves fast songs, indifferent to lyrics
- positive ratings (all \(1\))
-
User 2 (1x): loves lyrical songs, indifferent to tempo
- negative ratings (all \(-1\))
-
User 1 (5x): loves fast songs, indifferent to lyrics
- \(\hat\theta=\begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}\begin{bmatrix} 5 \\ -1\end{bmatrix}= \begin{bmatrix}1\\ -1\end{bmatrix}\)
- For new user \(x=(m, \ell )\), UCB:
- \(m-\ell + \sqrt{m^2/5 + \ell^2}\)
X
X
X
Statistical Derivation
- We can derive the form of the confidence intervals more formally using statistics
- Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x} $$ where \(\alpha\) depends on the failure probability and the variance of the rewards
- Lemma: (Chebychev's inequality) For a random variable \(a\) with \(\mathbb E[u] = 0\), $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
Statistical Derivation
- Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x} $$ where \(\alpha\) depends on the failure probability and the variance of the rewards
- Lemma: (Chebychev's inequality) For a random variable \(a\) with \(\mathbb E[u] = 0\), $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
- Proof of Claim: using Chebychev's we show that w.h.p. $$|\theta_a^\top x-\hat \theta_a^\top x|\leq \alpha \sqrt{x^\top A_a^{-1} x} $$
Statistical Derivation
- Lemma: (Chebychev's inequality) For a random variable \(a\) with \(\mathbb E[u] = 0\), $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
-
Proof of Claim: using Chebychev's we show that w.h.p. $$|\underbrace{\theta_a^\top x-\hat \theta_a^\top x}_{u}|\leq \alpha\underbrace{ \sqrt{x^\top A_a^{-1} x}}_{\mathbb Eu^2} $$
- Show that \(\mathbb E[u] = 0\)
- Compute variance \(\mathbb E[u^2]\)
Statistical Derivation
-
Proof of Claim:
- Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
- Define \(w_k = r_k - \mathbb E[r_k]\) so \(r_k = \theta_{a_k}^\top x_k + w_k\)
- \(\hat \theta_a = A_a^{-1} \sum_{k:a_k=a} (\theta_a^\top x_k + w_k) x_k \)
- \( = A_a^{-1} \sum_{k:a_k=a} x_k x_k^\top \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
- \( = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
- \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x]=(A_a^{-1} \sum_{k:a_k=a} \mathbb E[w_k] x_k)^\top x =0\)
- Compute variance \(\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]\)
- Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
Statistical Derivation
-
Proof of Claim:
- ✔ Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
- Define \(w_k = r_k - \mathbb E[r_k]\) so \(r_k = \theta_{a_k}^\top x_k + w_k\)
- \(\hat \theta_a = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
- Compute variance \(\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]\)
- \( = \mathbb E[((\hat\theta_a-\theta_a)^\top x)^2] = \mathbb E[((A_a^{-1} \sum_k w_k x_k)^\top x)^2]\)
- \(= x^\top A_a^{-1} \sum_k \sum_\ell \mathbb E[ w_k w_\ell ] x_k x_\ell^\top A_a^{-1} x\)
- \(\mathbb E[ w_k w_\ell ]=0\) if \(k\neq \ell\), otherwise variance \(\sigma^2\)
- \(= x^\top A_a^{-1} \sum_k\sigma^2 x_k x_k^\top A_a^{-1} x\)
- \(=\sigma^2 x^\top A_a^{-1} x\)
- ✔ Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
Statistical Derivation
-
Proof of Claim:
- ✔ Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
- Define \(w_k = r_k - \mathbb E[r_k]\) so \(r_k = \theta_{a_k}^\top x_k + w_k\)
- \(\hat \theta_a = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
- ✔ Compute variance \(\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]=\sigma^2 x^\top A_a^{-1} x\)
- ✔ Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
LinUCB
- for \(t=1,2,...,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}\) # largest UCB
- update \(A_{a,t+1}\), \(b_{a,t+1}\), and \(\hat \theta_{a,t+1}\)
Recap
- PSet released/due tonight
- Contextual Bandits
- Linear Model
- LinUCB
- Next lecture: Exploration in MDPs
Sp23 CS 4/5789: Lecture 21
By Sarah Dean