CS 4/5789: Introduction to Reinforcement Learning

Lecture 19: Exploration in Multi-Armed Bandits

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Assignments
- PSet 7 released tonight, due next Monday
- Final PA released tonight, due in 2 weeks

Agenda

1. Recap Unit 2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

5. Explore-then-Commit

In Unit 2, discussed algorithms for:

Estimating labels and gradients
Updating the policy

Recap: Control/Data Feedback

action $a_t$

state $s_t$

reward $r_t$

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown in Unit 2

In Unit 2, discussed algorithms for:

Estimating labels and gradients
- Value-based RL: labels for Q/Value function
- Policy optimization: gradient (of cumulative reward)
Updating the policy
- So far: greedy, incremental, and $\epsilon$ greedy
- Unit 3: Exploration

Recap: Control/Data Feedback

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

5. Explore-then-Commit

Exploration in RL is hard!

Example: mountainCar rewarded only at flag

Multi-Armed Bandit

A simplified setting for studying exploration

Multi-Armed Bandits

for $t=1,2,...$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$
  - $\mathbb E[r_t] = \mu_{a_t}$

Online advertising

Applications of MAB

NYT Caption Contest

Medical Trials

Interactive Coding Demo and PollEV

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

5. Explore-then-Commit

MAB Setting

Simplified RL setting with no state and no transitions
$K$ discrete actions ("arms") $$\mathcal A=\{1,\dots,K\}$$
Stochastic rewards $r_t\sim r(a_t)$$$r:\mathcal A\to\Delta(\mathbb R)$$
- Expected reward per action $\mathbb E[r(a)] = \mu_a$
Finite time horizon $$T\in\mathbb Z_+$$
Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$

Optimal Action

Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
What is the optimal action?
- $a_\star = \arg\max_{a\in\mathcal A} \mu_a$
When the reward function is known, it is trivial to find the optimal action (unlike a general MDP)
When reward function is unknown, we must devise a strategy for balancing exploration (trying new actions) and exploitation (selecting high reward actions)

Regret

We measure the performance of an algorithm (i.e. strategy for selecting actions) by comparing its performance to the optimal action
Definition: The regret of an algorithm which chooses actions $a_1,\dots,a_T$ is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
Good algorithms have sublinear regret: $R(T)\lesssim T^p$ for $p<1$
Notation $f(T)\lesssim g(T)$ means $f(T)\leq c\cdot g(T)$ for constant $c$
Sublinear means that average sub-optimality $\to 0$ $$\lim_{T\to\infty} \frac{1}{T} R(T) = \lim_{T\to\infty} c\cdot \frac{T^p}{T} \to 0\quad \text{if}\quad p<1$$

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

5. Explore-then-Commit

Exploration & Exploitation

Consider the two algorithms: pure exploration and pure exploitation
Claim: Both suffer linear regret

Uniform

for $t=1,2,...,T$
- $a_t\sim \mathrm{Unif}(K)$

Greedy

for $t=1,2,...,K$
- $a_t=t$, store $r_t$
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} r_a$

Exploration & Exploitation

Claim: Both suffer linear regret
Proof sketch: Regret is
- $R(T)= \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] $
- $= \mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]$
- $=\sum_{t=1}^T \mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t})\right]$
- $\geq \sum_{t=1}^T \mathbb P\{a_t\neq a^* \}\min_{a\neq a^*} (\mu^*-\mu_{a} ) = c\cdot T$
Exercise: what is $ \mathbb P\{a_t\neq a^* \} $ for the two algorithms?

Uniform $a_t\sim \mathrm{Unif}(K)$

Greedy $a_t=\arg\max_{a\in[K]} r_a$

Expectation is over both randomness of reward function $r$ and over randomness of $a_t$ chosen by algorithm
$\mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]$
$=\sum_{t=1}^T \mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]$ linearity of expectation
$=\sum_{t=1}^T\mathbb E\left[\mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))|a_t\right] \right]$ tower rule
$=\sum_{t=1}^T\mathbb E\left[\mathbb 1\{a_t\neq a^* \} \mathbb E\left[ r(a^*)-r(a_t)|a_t\right] \right]$ independence
$=\sum_{t=1}^T\mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t} ) \right]$ linearity of expectation
$\geq \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} {\min_{a\neq a^*} (\mu^*-\mu_{a} )} \big]$
$= \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} \big] {\min_{a\neq a^*} (\mu^*-\mu_{a} )}$
$=\sum_{t=1}^T \mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )} = c\cdot T$
as long as $\mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )}\geq c$ for all $t$

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

5. Explore-then-Commit

Explore-then-Commit

First attempt: a simple algorithm that balances exploration and exploitation into two phases

Explore-then-Commit

for $t=1,2,...,N\cdot K$
- $a_t=t\mod K$, store $r_t$ # try each $N$ times
$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$ # average reward
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$ # commit to best

Explore-then-Commit

How to set $N$?
The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
Assume that rewards are bounded so $r_t\in[0,1]$
- $R_1 \leq \sum_{t=1}^{NK} 1 = NK$
$R_2$ depends on the suboptimality
- $R_2 = (T-NK) (\mu^\star - \mu_{\hat a_\star}) \leq T (\mu^\star - \mu_{\hat a_\star})$
This depends on the quality of the estimates $\hat\mu_a$

Explore-then-Commit

Lemma (Hoeffding's): Suppose $r_i\in[0,1]$ and $\mathbb E[r_i] = \mu$. Then for $r_1,\dots, r_N$ i.i.d., with probability $1-\delta$, $$\left|\frac{1}{N} \sum_{i=1}^N r_i - \mu \right| \lesssim \sqrt{\frac{\log 1/\delta}{N}} $$
(Proof out of scope)
Lemma (Explore): After the exploration phase, for all actions $a\in\{1,\dots, K\}$, $$|\hat \mu_a -\mu_a|\lesssim \sqrt{\frac{\log(K/\delta)}{N}}\quad\text{with probability}~~1-\delta$$
Proof: Hoeffding & union bound

Confidence intervals

Confidence intervals allow us to bound sub-optimality with high probability: $\mu^\star - \mu_{\hat a_\star}$
- $\leq \hat \mu_{a_\star} + c\sqrt{\frac{\log(K/\delta)}{N}}- \Big(\hat \mu_{\hat a_\star} - c\sqrt{\frac{\log(K/\delta)}{N}}\Big)$
- $=\hat \mu_{a_\star} - \hat \mu_{\hat a_\star} + 2c\sqrt{\frac{\log(K/\delta)}{N}} $
- $\leq 2c\sqrt{\frac{\log(K/\delta)}{N}} $ by definition of $\hat a_\star$

$ \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]$

Explore-then-Commit

How to set $N$?
$R(T) =R_1+R_2$
- $\leq NK + T (\mu^\star - \mu_{\hat a_\star})$
- $\leq NK + T\cdot 2c\sqrt{\frac{\log(K/\delta)}{N}}$ with probability $1-\delta$
Minimizing with respect to $N$:
- set derivative to zero: $K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0$
Regret minimizing choice $N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}$
Results in $R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$

Explore-then-Commit

Theorem: For $N\propto ((T/K)\sqrt{\log K/\delta})^{2/3}$, the regret of ETC is bounded with probability $1-\delta$: $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

Explore-then-Commit

for $t=1,2,...,N\cdot K$
- $a_t=t\mod k$, store $r_t$ # try each $N$ times
$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$ # average reward
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$ # commit to best

Recap

PSet and PA released tonight

Multi-Armed Bandits
Explore-then-Commit
Confidence Intervals

Next lecture: Using confidence intervals adaptively

Sp24 CS 4/5789: Lecture 19

By Sarah Dean

Sp24 CS 4/5789: Lecture 19

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 19: Exploration in Multi-Armed Bandits

Reminders

Agenda

Recap: Control/Data Feedback

Recap: Control/Data Feedback

Agenda

Exploration in RL is hard!

Multi-Armed Bandit

Applications of MAB

Agenda

MAB Setting

Optimal Action

Regret

Agenda

Exploration & Exploitation

Exploration & Exploitation

Agenda

Explore-then-Commit

Explore-then-Commit

Explore-then-Commit

Confidence intervals

Explore-then-Commit

Explore-then-Commit

Recap

Sp24 CS 4/5789: Lecture 19

More from Sarah Dean