Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• 5789 Paper Reviews due weekly on Mondays
• PSet 6 due next Monday
• Final PA released later this week
• Final exam is Saturday 5/13 at 2pm

## Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

## Multi-Armed Bandit

A simplified setting for studying exploration

Multi-Armed Bandits

• for $$t=1,2,...,T$$
• take action $$a_t\in\{1,\dots, K\}$$
• receive reward $$r_t$$
• $$\mathbb E[r_t] = \mu_{a_t}$$

## MAB Setting

• Simplified RL setting with no state and no transitions
• $$\mathcal A=\{1,\dots,K\}$$ $$K$$ discrete actions ("arms")
• Stochastic rewards $$r_t\sim r(a_t)$$ with expectation $$\mathbb E[r(a)] = \mu_a$$
• Finite time horizon $$T\in\mathbb Z_+$$

Multi-Armed Bandits

• for $$t=1,2,...,T$$
• take action $$a_t\in\{1,\dots, K\}$$
• receive reward $$r_t$$
• $$\mathbb E[r_t] = \mu_{a_t}$$

## Optimal Action and Regret

• Goal: maximize cumulative reward $$\mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
• Optimal action $$a_\star = \arg\max_{a\in\mathcal A} \mu_a$$
• Definition: The regret of an algorithm which chooses actions $$a_1,\dots,a_T$$ is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$
• Good algorithms have sublinear regret$$R(T)\lesssim T^p\quad\text{for}\quad p<1 \qquad \left(\implies \lim_{T\to\infty} \frac{1}{T} R(T) \to 0\right)$$

## Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

## Explore-then-Commit

• First attempt: a simple algorithm that balances exploration and exploitation into two phases

Explore-then-Commit

• for $$t=1,2,...,N\cdot K$$
• $$a_t=t\mod K$$, store $$r_t$$       # try each $$N$$ times
• $$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$$                   # average reward
• for $$t=K+1,\dots,T$$
• $$a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$$ # commit to best

## Explore-then-Commit

• How to set $$N$$?
• The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
• PollEV
• Assuming that rewards are bounded, $$r_t\in[0,1]$$ $$R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})$$

## Confidence intervals

• Sub-optimality is bounded with high probability by the width of the confidence intervals: $$\mu^\star - \mu_{\hat a_\star}\lesssim \sqrt{\frac{\log(K/\delta)}{N}}$$

$$\mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]$$

• We derived confidence intervals using Hoeffding's bound ($$c$$ is a constant)

## Explore-then-Commit

• How to set $$N$$?
• $$R(T) =R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})$$
• $$\lesssim NK + T \sqrt{\frac{\log(K/\delta)}{N}}$$ with probability $$1-\delta$$
• Minimizing with respect to $$N$$
• set derivative to zero: $$K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0$$
• Regret minimizing choice $$N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}$$
• Results in $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

## Explore-then-Commit

• Theorem: For $$N\propto ((T/K)\sqrt{\log K/\delta})^{2/3}$$, the regret of ETC is bounded with probability $$1-\delta$$: $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

Explore-then-Commit

• for $$t=1,2,...,N\cdot K$$
• $$a_t=t\mod k$$, store $$r_t$$       # try each $$N$$ times
• $$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$$                   # average reward
• for $$t=K+1,\dots,T$$
• $$a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$$ # commit to best

## Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

## Upper Confidence Bound

• An algorithm that adapts to confidence intervals
• Idea: Pull the arm with the highest upper confidence bound
• Principle of optimism in the face of uncertainty

UCB

• Initialize $$\hat \mu_{a,1}$$ and $$N_{a,1}$$ for $$a\in[K]$$
• for $$t=1,2,...,T$$
• $$a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$$ # largest UCB
• update $$N_{a_t,t+1}$$ and $$\hat\mu_{a_t,t+1}$$

## Upper Confidence Bound

UCB

• Initialize $$\hat \mu_{a,1}$$ and $$N_{a,1}$$ for $$a\in[K]$$
• for $$t=1,2,...,T$$
• $$a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$$ # largest UCB
• update $$N_{a_t,t+1}$$ and $$\hat\mu_{a_t,t+1}$$
• number of pulls per arm:
$$N_{a,t} = \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\}$$
• average reward per arm:
$$\hat \mu_{a,t} = \frac{1}{N_{a,t}} \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\} r_k$$
• upper confidence bound:
$$u_{a,t} =\hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$$

## UCB Intuition

• Why does it work?
• Principle of optimism in the face of uncertainty
• Two reasons to pull an arm:
1. large confidence interval (explore)
2. a good arm (exploit)
• Two outcomes from acting optimistically:
1. we were correct $$\rightarrow$$ high reward
2. we were wrong $$\rightarrow$$ adjust estimates

## Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

## Sub-optimality at $$t$$

$$\mu_\star - \mu_{a_t}$$

• $$\leq u_{a_\star, t} - \mu_{a_t}$$
• $$\leq u_{a_t, t} - \mu_{a_t}$$
• $$= \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \mu_{a_t}$$
• $$\leq \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \left(\hat \mu_{a_t, t} - \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} \right)$$
• $$\leq 2\sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}}$$

$$a_t$$

$$a_\star$$

Claim: sub-optimality at $$t$$ is bounded by the width of $$a_t$$'s confidence interval

## Sublinear Regret

• Regret is cumulative sub-optimality
• $$R(T) = \sum_{t=1}^T \mu_\star - \mu_{a_t}$$
• $$\leq \sum_{t=1}^T 2\sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$$
• $$\leq 2\sqrt {\log(KT/\delta)} \sum_{t=1}^T \sqrt{1/{N_{a,t}}}$$
• Claim: Since we only pull one arm per round, $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2K\sqrt{T}$$
• Putting it all together, $$R(T) \lesssim K\sqrt {T \log(KT/\delta) }$$

## Proof of Claim

• Claim: since one arm per round, $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2 K\sqrt{T}$$
• Proof: $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}}$$
• $$=\sum_{t=1}^T \sum_{a=1}^K \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}}$$
• $$=\sum_{a=1}^K \sum_{t=1}^T \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}}$$ switching order
• $$=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}} \sqrt{1/n}$$ since $$N_{a,t}$$ increments by 1
• $$\leq \sum_{a=1}^K \sum_{n=1}^{T} \sqrt{1/n}$$ since at most $$T$$ pulls
• $$= K \sum_{n=1}^{T} \sqrt{1/n}$$
• $$\leq K\left(1+\int_{x=1}^T\sqrt{1/x} dx\right )$$ integral bounds sum
• $$= K\left(1+2\sqrt{T} - 2\sqrt{1} \right )$$
• $$\leq 2K\sqrt{T}$$

## Tighter bound

• Claim: since one arm per round, $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq \sqrt{KT}$$
• Proof: $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}}$$
• $$=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}} \sqrt{1/n}$$  same as previous
• $$\leq \sum_{a=1}^K \sqrt{N_{a,T}}$$ summation trick
• $$= K \cdot \frac{1}{K} \sum_{a=1}^K \sqrt{N_{a,T}}$$
• $$\leq K \cdot \sqrt{\frac{1}{K} \sum_{a=1}^K N_{a,T}}$$ Jensen's
• $$= K \cdot \sqrt{\frac{T}{K} } = \sqrt{KT}$$ at most $$T$$ pulls

Explore-then-Commit

1. Pull each arm $$N$$ times and compute empirical mean $$\widehat \mu_a$$
2. For $$t=NK+1,...,T$$:
Pull $$\widehat a^* = \arg\max_a \widehat \mu_a$$

Upper Confidence Bound

For $$t=1,...,T$$:

• Pull $$a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
• Update empirical means $$\widehat \mu_{a,t}$$ and counts $$N_{a,t}$$

Explore for $$N \approx T^{2/3}$$,

$$R(T) \lesssim T^{2/3}$$

$$R(T) \lesssim \sqrt{T}$$

## Preview: Contextual Bandits

Journalism

Programming

But consider different users:

CS Major

English Major

## Preview: Contextual Bandits

Example: online shopping

"Arms" are various products

But what about search queries, browsing history, items in cart?

## Preview: Contextual Bandits

Example: social media feeds

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

## Preview: Contextual Bandits

• The best action will depend on the context
• e.g. major, browsing history, demographics
• Thus we need a policy for mapping context to action

## Recap

• PSet released tonight

• Explore-then-Commit
• Upper Confidence Bound

• Next lecture: Policies & contextual bandits

By Sarah Dean

Private