Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• 5789 Paper Reviews due weekly on Mondays
• PSet 6 due tonight
• PSet 7 released tonight
• PA 4 due May 3
• Final exam is Saturday 5/13 at 2pm
• WICCxURMC Survey

## Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

## Recap: Multi-Armed Bandit

A simplified setting for studying exploration

## Recap: MAB Setting

• Simplified RL setting with no state and no transitions
• $$\mathcal A=\{1,\dots,K\}$$ $$K$$ discrete actions ("arms")
• Stochastic rewards $$r_t\sim r(a_t)$$ with expectation $$\mathbb E[r(a)] = \mu_a$$
• Finite time horizon $$T\in\mathbb Z_+$$

Multi-Armed Bandits

• for $$t=1,2,...,T$$
• take action $$a_t\in\{1,\dots, K\}$$
• receive reward $$r_t$$
• $$\mathbb E[r_t] = \mu_{a_t}$$

Explore-then-Commit

1. Pull each arm $$N$$ times and compute empirical mean $$\widehat \mu_a$$
2. For $$t=NK+1,...,T$$:
Pull $$\widehat a^* = \arg\max_a \widehat \mu_a$$

Upper Confidence Bound

For $$t=1,...,T$$:

• Pull $$a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
• Update empirical means $$\widehat \mu_{a,t}$$ and counts $$N_{a,t}$$

Explore for $$N \approx T^{2/3}$$,

$$R(T) \lesssim T^{2/3}$$

$$R(T) \lesssim \sqrt{T}$$

## Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

## Motivation: Contextual Bandits

Journalism

Programming

But consider different users:

CS Major

English Major

## Motivation: Contextual Bandits

Example: online shopping

"Arms" are various products

But what about search queries, browsing history, items in cart?

## Motivation: Contextual Bandits

Example: social media feeds

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

## Contextual Bandits Setting

• Simplified RL setting with context instead of state or transitions
• Contexts $$x_t\in\mathcal X$$ drawn i.i.d. from distribution $$\mathcal D\in\Delta(\mathcal X)$$
• $$\mathcal A=\{1,\dots,K\}$$ $$K$$ discrete actions ("arms")
• Stochastic rewards $$r_t\sim r(x_t, a_t)$$ with expectation $$\mathbb E[r(x, a)] = \mu_a(x)$$
• Finite time horizon $$T\in\mathbb Z_+$$

Contextual Bandits

• for $$t=1,2,...,T$$
• observe context $$x_t$$
• take action $$a_t\in\{1,\dots, K\}$$
• receive reward $$r_t$$ with $$\mathbb E[r_t] = \mu_{a_t}(x_t)$$

## Comparison

• What is the difference between contextual bandits and MDP? PollEV
• State $$s$$ vs. context $$x$$
• Transition $$P$$ and initial $$\mu_0$$ distribution vs. context distribution $$\mathcal D$$
• Contexts are memoryless: independent of previous contexts and unaffected by actions

Contextual Bandits

• for $$t=1,2,...,T$$
• observe context $$x_t$$
• take action $$a_t\in\{1,\dots, K\}$$
• receive reward $$r_t$$ with $$\mathbb E[r_t] = \mu_{a_t}(x_t)$$

## Optimal Policy and Regret

• Goal: maximize cumulative reward $$\mathbb E\left[\sum_{t=1}^T r(x_t, a_t) \right] = \sum_{t=1}^T \mu_{a_t}(x_t)$$
• Optimal policy $$\pi_\star(x) = \arg\max_{a\in\mathcal A} \mu_a(x)$$
• Definition: The regret of an algorithm which chooses actions $$a_1,\dots,a_T$$ is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)]$$

## Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

• If contexts $$x_1$$ and $$x_2$$ are similar, expect similar actions to achieve high reward
• Linear assumption: context $$x\in\mathbb R^d$$ and $$\mathbb E[r(x,a)] = \mu_a(x) = \theta_a^\top x$$
• Unknown parameters $$\theta_a\in\mathbb R^d$$ for $$a\in[K]$$
• Example: music artist recommendation
• $$\theta_a\in\mathbb R^d$$ represents attributes of artists
• $$x \in\mathbb R^d$$ represents users affinities

## Linear Reward Model

tempo

lyricism

• In order to predict rewards, estimate $$\hat\theta_a$$
• Observations so far make up data: $$\{x_k, a_k, r_k\}_{k=1}^t$$
• Supervised learning problem: $$\hat\theta_a = \arg\min_\theta \sum_{k=1}^t \mathbf 1\{a_k=a \}(\theta^\top x_k - r_k)^2$$
• Lemma: Assume that $$\{x_k\}_{k:a_k=a}$$ span $$\mathbb R^d$$. Then $$\hat\theta_a ={\underbrace{ \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)}_{A}}^{-1}\underbrace{\sum_{k:a_k=a} x_k r_k}_b = A^{-1} b$$

## Linear Regression

• Proof: First take the gradient of the objective
• $$\nabla_\theta \sum_{k:a_k=a} (\theta^\top x_k - r_k)^2 = 2 \sum_{k:a_k=a} x_k(\theta^\top x_k - r_k)$$
• Setting it equal to zero at the minimum $$\hat \theta_a$$ $$\sum_{k:a_k=a} x_k x_k^\top \hat \theta_a = \sum_{k:a_k=a} x_k r_k \iff A\hat \theta_a = b$$
• Under the spanning assumption, $$A$$ is invertible
• To see why, define $$X$$ containing stacked contexts such that $$A=X^\top X$$. Then the assumption $$\implies \mathrm{rank}(X)=d \implies \mathrm{rank}(X^\top X)=d$$
• Therefore, $$\hat \theta_a=A^{-1} b$$

## Linear Regression

• In order to predict rewards, estimate $$\hat\theta_a$$
• Observations so far make up data: $$\{x_k, a_k, r_k\}_{k=1}^t$$
• Supervised learning problem: $$\hat\theta_a = \Big(\sum_{k:a_k=a} x_k x_k^\top \Big)^{-1} \sum_{k:a_k=a} x_k r_k = A^{-1} b$$
• The context covariance matrix is $$\Sigma = \mathbb E_{x\sim \mathcal D}[xx^\top]$$
• The matrix $$A$$ is related to the empirical covariance
• $$\hat \Sigma = \frac{1}{N_a} \sum_{k:a_k=a} x_k x_k^\top$$ approximates expectation with sum
• The relationship is $$A = N_a \hat \Sigma$$

tempo

lyricism

## Example

• Suppose 6 observed contexts come from two users:
• User 1 (5x): loves fast songs, indifferent to lyrics
• positive ratings (all $$1$$)
• User 2 (1x): loves lyrical songs, indifferent to tempo
• negative ratings (all $$-1$$)
• $$A^{-1} = \begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}$$
• $$b=\begin{bmatrix} 5 \\ -1\end{bmatrix}$$
• $$\hat\theta = \begin{bmatrix}1\\ -1\end{bmatrix}$$

X

X

## Agenda

1. Recap: MAB

2. Contextual Bandits

3. Linear Model

4. LinUCB

## LinUCB

• An algorithm that adapts to linear confidence intervals
• Need to keep track of:
• $$A_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}x_k x_k^\top$$ and $$b_{a,t} = \sum_{k=1}^t \mathbf 1\{a_k=a\}r_k x_k$$
• $$\hat\theta_{a,t} = A_{a,t}^{-1} b_{a,t}$$

LinUCB

• Initialize 0 mean and infinite confidence intervals
• for $$t=1,2,...,T$$
• $$a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$$ # largest UCB
• update $$A_{a,t+1}$$, $$b_{a,t+1}$$, and $$\hat \theta_{a,t+1}$$

## LinUCB

• An algorithm that adapts to linear confidence intervals
• Upper confidence bound is: $$\hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$$
• Similar to UCB but with a mean and CI width that depend on the context $$x_t$$
• Geometric intuition:
• First term is large if $$x$$ and $$\hat \theta$$ are aligned
• Second term is large if $$x$$ is not aligned with much historical data $$x^\top A x = x^\top (N \hat \Sigma)^{-1} x = \frac{1}{N} x^\top \hat \Sigma^{-1} x$$

tempo

lyricism

## Example

• Suppose the observed contexts come from two users:
• User 1 (5x): loves fast songs, indifferent to lyrics
• positive ratings (all $$1$$)
• User 2 (1x): loves lyrical songs, indifferent to tempo
• negative ratings (all $$-1$$)
• $$\hat\theta=\begin{bmatrix}\frac{1}{5} & \\ & 1\end{bmatrix}\begin{bmatrix} 5 \\ -1\end{bmatrix}= \begin{bmatrix}1\\ -1\end{bmatrix}$$
• For new user $$x=(m, \ell )$$, UCB:
• $$m-\ell + \sqrt{m^2/5 + \ell^2}$$

X

X

X

## Statistical Derivation

• We can derive the form of the confidence intervals more formally using statistics
• Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x}$$ where $$\alpha$$ depends on the failure probability and the variance of the rewards
• Lemma: (Chebychev's inequality) For a random variable $$a$$ with $$\mathbb E[u] = 0$$, $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$

## Statistical Derivation

• Claim: With high probability (over noisy rewards) $$\theta_a^\top x \leq \hat \theta_a^\top x + \alpha \sqrt{x^\top A_a^{-1} x}$$ where $$\alpha$$ depends on the failure probability and the variance of the rewards
• Lemma: (Chebychev's inequality) For a random variable $$a$$ with $$\mathbb E[u] = 0$$, $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
• Proof of Claim: using Chebychev's we show that w.h.p. $$|\theta_a^\top x-\hat \theta_a^\top x|\leq \alpha \sqrt{x^\top A_a^{-1} x}$$

## Statistical Derivation

• Lemma: (Chebychev's inequality) For a random variable $$a$$ with $$\mathbb E[u] = 0$$, $$|u|\leq \beta \sqrt{\mathbb E[u^2]}\quad\text{with probability}\quad 1-\frac{1}{\beta^2}$$
• Proof of Claim: using Chebychev's we show that w.h.p. $$|\underbrace{\theta_a^\top x-\hat \theta_a^\top x}_{u}|\leq \alpha\underbrace{ \sqrt{x^\top A_a^{-1} x}}_{\mathbb Eu^2}$$
1. Show that $$\mathbb E[u] = 0$$
2. Compute variance  $$\mathbb E[u^2]$$

## Statistical Derivation

• Proof of Claim:
1. Show that $$\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0$$
• Define $$w_k = r_k - \mathbb E[r_k]$$ so $$r_k = \theta_{a_k}^\top x_k + w_k$$
• $$\hat \theta_a = A_a^{-1} \sum_{k:a_k=a} (\theta_a^\top x_k + w_k) x_k$$
• $$= A_a^{-1} \sum_{k:a_k=a} x_k x_k^\top \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k$$
• $$= \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k$$
• $$\mathbb E[\theta_a^\top x-\hat \theta_a^\top x]=(A_a^{-1} \sum_{k:a_k=a} \mathbb E[w_k] x_k)^\top x =0$$
2. Compute variance  $$\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]$$

## Statistical Derivation

• Proof of Claim:
1. ✔ Show that $$\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0$$
• Define $$w_k = r_k - \mathbb E[r_k]$$ so $$r_k = \theta_{a_k}^\top x_k + w_k$$
• $$\hat \theta_a = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k$$
2. Compute variance  $$\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]$$
• $$= \mathbb E[((\hat\theta_a-\theta_a)^\top x)^2] = \mathbb E[((A_a^{-1} \sum_k w_k x_k)^\top x)^2]$$
• $$= x^\top A_a^{-1} \sum_k \sum_\ell \mathbb E[ w_k w_\ell ] x_k x_\ell^\top A_a^{-1} x$$
• $$\mathbb E[ w_k w_\ell ]=0$$ if $$k\neq \ell$$, otherwise variance $$\sigma^2$$
• $$= x^\top A_a^{-1} \sum_k\sigma^2 x_k x_k^\top A_a^{-1} x$$
• $$=\sigma^2 x^\top A_a^{-1} x$$

## Statistical Derivation

• Proof of Claim:
1. ✔ Show that $$\mathbb E[\theta_a^\top x-\hat \theta_a^\top x] = 0$$
• Define $$w_k = r_k - \mathbb E[r_k]$$ so $$r_k = \theta_{a_k}^\top x_k + w_k$$
• $$\hat \theta_a = \theta_a + A_a^{-1} \sum_{k:a_k=a} w_k x_k$$
2. ✔ Compute variance  $$\mathbb E[(\theta_a^\top x-\hat \theta_a^\top x)^2]=\sigma^2 x^\top A_a^{-1} x$$

LinUCB

• for $$t=1,2,...,T$$
• $$a_t=\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \alpha \sqrt{x_t A_{a,t}^{-1} x_t}$$ # largest UCB
• update $$A_{a,t+1}$$, $$b_{a,t+1}$$, and $$\hat \theta_{a,t+1}$$

## Recap

• PSet released/due tonight

• Contextual Bandits
• Linear Model
• LinUCB

• Next lecture: Exploration in MDPs

By Sarah Dean

Private