CS 4/5789: Introduction to Reinforcement Learning

Lecture 21: Contextual Bandits

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • 5789 Paper Reviews due weekly on Mondays
    • PSet 6 due tonight
    • PSet 7 released tonight
    • PA 4 due May 3
  • Final exam is Saturday 5/13 at 2pm
  • WICCxURMC Survey

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

Recap: Multi-Armed Bandit

A simplified setting for studying exploration

Recap: MAB Setting

  • Simplified RL setting with no state and no transitions
  • \(\mathcal A=\{1,\dots,K\}\) \(K\) discrete actions ("arms")
  • Stochastic rewards \(r_t\sim r(a_t)\) with expectation \(\mathbb E[r(a)] = \mu_a\)
  • Finite time horizon \(T\in\mathbb Z_+\)

Multi-Armed Bandits

  • for \(t=1,2,...,T\)
    • take action \(a_t\in\{1,\dots, K\}\)
    • receive reward \(r_t\)
      • \(\mathbb E[r_t] = \mu_{a_t}\)

Explore-then-Commit

  1. Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\):
        Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

Upper Confidence Bound

For \(t=1,...,T\):

  • Pull \( a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
  • Update empirical means \(\widehat \mu_{a,t}\) and counts \(N_{a,t}\)

Explore for \(N \approx T^{2/3}\),

\(R(T) \lesssim T^{2/3}\)

\(R(T) \lesssim \sqrt{T}\)

Recap: MAB

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

Motivation: Contextual Bandits

Example: online advertising

Journalism

Programming

"Arms" are different job ads:

But consider different users:

CS Major

English Major

Motivation: Contextual Bandits

Example: online shopping

"Arms" are various products

But what about search queries, browsing history, items in cart?

Motivation: Contextual Bandits

Example: social media feeds

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

Contextual Bandits Setting

  • Simplified RL setting with context instead of state or transitions
  • Contexts \(x_t\in\mathcal X\) drawn i.i.d. from distribution \(\mathcal D\in\Delta(\mathcal X)\)
  • \(\mathcal A=\{1,\dots,K\}\) \(K\) discrete actions ("arms")
  • Stochastic rewards \(r_t\sim r(x_t, a_t)\) with expectation \(\mathbb E[r(x, a)] = \mu_a(x)\)
  • Finite time horizon \(T\in\mathbb Z_+\)

Contextual Bandits

  • for \(t=1,2,...,T\)
    • observe context \(x_t\)
    • take action \(a_t\in\{1,\dots, K\}\)
    • receive reward \(r_t\) with \(\mathbb E[r_t] = \mu_{a_t}(x_t)\)

Comparison

  • What is the difference between contextual bandits and MDP? PollEV
    • State \(s\) vs. context \(x\)
    • Transition \(P\) and initial \(\mu_0\) distribution vs. context distribution \(\mathcal D\)
  • Contexts are memoryless: independent of previous contexts and unaffected by actions

Contextual Bandits

  • for \(t=1,2,...,T\)
    • observe context \(x_t\)
    • take action \(a_t\in\{1,\dots, K\}\)
    • receive reward \(r_t\) with \(\mathbb E[r_t] = \mu_{a_t}(x_t)\)

Optimal Policy and Regret

  • Goal: maximize cumulative reward $$  \mathbb E\left[\sum_{t=1}^T r(x_t, a_t)  \right] = \sum_{t=1}^T \mu_{a_t}(x_t)$$
  • Optimal policy \(\pi_\star(x) = \arg\max_{a\in\mathcal A} \mu_a(x)\)
  • Definition: The regret of an algorithm which chooses actions \(a_1,\dots,a_T\) is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t)  \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

  • If contexts \(x_1\) and \(x_2\) are similar, expect similar actions to achieve high reward
  • Linear assumption: context \(x\in\mathbb R^d\) and $$\mathbb E[r(x,a)] = \mu_a(x) = \theta_a^\top x $$
  • Unknown parameters \(\theta_a\in\mathbb R^d\) for \(a\in[K]\)
  • Example: music artist recommendation
    • \(\theta_a\in\mathbb R^d\) represents attributes of artists
    • \(x \in\mathbb R^d\) represents users affinities

Linear Reward Model

tempo

lyricism

  • In order to predict rewards, estimate \(\hat\theta_a\)
  • Observations so far make up data: \(\{x_k, a_k, r_k\}_{k=1}^t\)
  • Supervised learning problem: $$\hat\theta_a = \arg\min_\theta \sum_{k=1}^t \mathbf 1\{a_k=a \}(\theta^\top x_k - r_k)^2$$
  • Lemma: Assume that \(\{x_k\}_{k:a_k=a}\) span \(\mathbb R^d\). Then $$\hat\theta_a ={\underbrace{ \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)}_{A}}^{-1}\underbrace{\sum_{k:a_k=a} x_k r_k}_b = A^{-1} b $$

Linear Regression

  • Proof: First take the gradient of the objective
    • \(\nabla_\theta \sum_{k:a_k=a} (\theta^\top x_k - r_k)^2 = 2 \sum_{k:a_k=a} x_k(\theta^\top x_k - r_k)\)
    • Setting it equal to zero at the minimum \(\hat \theta_a\) $$ \sum_{k:a_k=a} x_k x_k^\top \hat \theta_a = \sum_{k:a_k=a} x_k r_k \iff A\hat \theta_a = b$$
    • Under the spanning assumption, \(A\) is invertible
      • To see why, define \(X\) containing stacked contexts such that \(A=X^\top X\). Then the assumption \(\implies \mathrm{rank}(X)=d \implies \mathrm{rank}(X^\top X)=d  \)
    • Therefore, \(\hat \theta_a=A^{-1} b\)

Linear Regression

  • In order to predict rewards, estimate \(\hat\theta_a\)
  • Observations so far make up data: \(\{x_k, a_k, r_k\}_{k=1}^t\)
  • Supervised learning problem: $$\hat\theta_a = \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)^{-1} \sum_{k:a_k=a} x_k r_k = A^{-1} b $$
  • The context covariance matrix is \(\Sigma = \mathbb E_{x\sim \mathcal D}[xx^\top]\)
  • The matrix \(A\) is related to the empirical covariance
    • \(\hat \Sigma = \frac{1}{N_a} \sum_{k:a_k=a} x_k x_k^\top \) approximates expectation with sum
    • The relationship is \(A = N_a \hat \Sigma\)

Linear Regression

tempo

lyricism

Example

  • Suppose 6 observed contexts come from two users:
    • User 1 (5x): loves fast songs, indifferent to lyrics
      • positive ratings (all \(1\))
    • User 2 (1x): loves lyrical songs, indifferent to tempo
      • negative ratings (all \(-1\))
  • \(A^{-1} = \begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}\)
  • \(b=\begin{bmatrix} 5 \\ -1\end{bmatrix}\)
  • \(\hat\theta = \begin{bmatrix}1\\ -1\end{bmatrix}\)

X

X

Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

LinUCB

  • An algorithm that adapts to linear confidence intervals
  • Need to keep track of:
    • \(A_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}x_k x_k^\top \) and \(b_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}r_k x_k \)
    • \(\hat\theta_{a,t} = A_{a,t}^{-1} b_{a,t} \)

LinUCB

  • Initialize 0 mean and infinite confidence intervals
  • for \(t=1,2,...,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}\) # largest UCB
    • update \(A_{a,t+1}\), \(b_{a,t+1}\), and \(\hat \theta_{a,t+1}\)

LinUCB

  • An algorithm that adapts to linear confidence intervals
  • Upper confidence bound is: $$\hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$$
  • Similar to UCB but with a mean and CI width that depend on the context \(x_t\)
  • Geometric intuition:
    • First term is large if \(x\) and \(\hat \theta\) are aligned
    • Second term is large if \(x\) is not aligned with much historical data $$x^\top A x = x^\top (N \hat \Sigma)^{-1} x = \frac{1}{N} x^\top \hat \Sigma^{-1} x$$

tempo

lyricism

Example

  • Suppose the observed contexts come from two users:
    • User 1 (5x): loves fast songs, indifferent to lyrics
      • positive ratings (all \(1\))
    • User 2 (1x): loves lyrical songs, indifferent to tempo
      • negative ratings (all \(-1\))
  • \(\hat\theta=\begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}\begin{bmatrix} 5 \\ -1\end{bmatrix}= \begin{bmatrix}1\\ -1\end{bmatrix}\)
  • For new user \(x=(m, \ell )\), UCB:
    • \(m-\ell + \sqrt{m^2/5 + \ell^2}\)

X

X

X

Statistical Derivation

  • We can derive the form of the confidence intervals more formally using statistics
  • Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x} $$ where \(\alpha\) depends on the failure probability and the variance of the rewards
  • Lemma: (Chebychev's inequality) For a random variable \(a\) with \(\mathbb E[u] = 0\), $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$

Statistical Derivation

  • Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x} $$ where \(\alpha\) depends on the failure probability and the variance of the rewards
  • Lemma: (Chebychev's inequality) For a random variable \(a\) with \(\mathbb E[u] = 0\), $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
  • Proof of Claim: using Chebychev's we show that w.h.p. $$|\theta_a^\top x-\hat \theta_a^\top x|\leq \alpha \sqrt{x^\top A_a^{-1} x} $$

Statistical Derivation

  • Lemma: (Chebychev's inequality) For a random variable \(a\) with \(\mathbb E[u] = 0\), $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
  • Proof of Claim: using Chebychev's we show that w.h.p. $$|\underbrace{\theta_a^\top x-\hat \theta_a^\top x}_{u}|\leq \alpha\underbrace{ \sqrt{x^\top A_a^{-1} x}}_{\mathbb Eu^2} $$
    1. Show that \(\mathbb E[u] = 0\)
    2. Compute variance  \(\mathbb E[u^2]\)

Statistical Derivation

  • Proof of Claim:
    1. Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
      • Define \(w_k = r_k - \mathbb E[r_k]\) so \(r_k = \theta_{a_k}^\top x_k + w_k\)
      • \(\hat \theta_a = A_a^{-1} \sum_{k:a_k=a} (\theta_a^\top x_k + w_k) x_k \)
        • \( = A_a^{-1} \sum_{k:a_k=a} x_k x_k^\top \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
        • \( =  \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
      • \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x]=(A_a^{-1} \sum_{k:a_k=a} \mathbb E[w_k] x_k)^\top x  =0\)
    2. Compute variance  \(\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]\)

Statistical Derivation

  • Proof of Claim:
    1. ✔ Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
      • Define \(w_k = r_k - \mathbb E[r_k]\) so \(r_k = \theta_{a_k}^\top x_k + w_k\)
      • \(\hat \theta_a =  \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
    2. Compute variance  \(\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]\)
      • \( = \mathbb E[((\hat\theta_a-\theta_a)^\top x)^2] = \mathbb E[((A_a^{-1} \sum_k w_k x_k)^\top x)^2]\)
      • \(= x^\top A_a^{-1} \sum_k \sum_\ell \mathbb E[ w_k  w_\ell ] x_k x_\ell^\top A_a^{-1} x\)
        • \(\mathbb E[ w_k  w_\ell ]=0\) if \(k\neq \ell\), otherwise variance \(\sigma^2\)
      • \(= x^\top A_a^{-1} \sum_k\sigma^2 x_k x_k^\top A_a^{-1} x\)
      • \(=\sigma^2 x^\top A_a^{-1} x\)

Statistical Derivation

  • Proof of Claim:
    1. ✔ Show that \(\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0\)
      • Define \(w_k = r_k - \mathbb E[r_k]\) so \(r_k = \theta_{a_k}^\top x_k + w_k\)
      • \(\hat \theta_a =  \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k \)
    2. ✔ Compute variance  \(\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]=\sigma^2 x^\top A_a^{-1} x\)

LinUCB

  • for \(t=1,2,...,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}\) # largest UCB
    • update \(A_{a,t+1}\), \(b_{a,t+1}\), and \(\hat \theta_{a,t+1}\)

Recap

  • PSet released/due tonight

 

  • Contextual Bandits
  • Linear Model
  • LinUCB

 

  • Next lecture: Exploration in MDPs

Sp23 CS 4/5789: Lecture 21

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 21