Sp23 CS 4/5789: Lecture 19

CS 4/5789: Introduction to Reinforcement Learning

Lecture 19: Exploration in Multi-Armed Bandits

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 6 released tonight, due next Monday
- Final PA released later this week

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

In Unit 2, discussed algorithms for:

Constructing labels for supervised learning
Updating the policy based on learned quantities

Recap: Control/Data Feedback

action $a_t$

state $s_t$

reward $r_t$

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown in Unit 2

In Unit 2, discussed algorithms for:

Constructing labels for supervised learning
- Model-based RL: transitions/rewards
- Value-based RL: (Q) Value function
- Policy optimization: gradient (of Value)
Updating the policy based on learned quantities
- So far: greedy, incremental, and $\epsilon$ greedy
- Unit 3: Exploration

Recap: Control/Data Feedback

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

Exploration in RL is hard!

Example: mountainCar rewarded only at flag

Multi-Armed Bandit

A simplified setting for studying exploration

Multi-Armed Bandits

for $t=1,2,...$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$
  - $\mathbb E[r_t] = \mu_{a_t}$

Online advertising

Applications of MAB

NYT Caption Contest

Medical Trials

Interactive Coding Demo and PollEV

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

MAB Setting

Simplified RL setting with no state and no transitions
$K$ discrete actions ("arms") $$\mathcal A=\{1,\dots,K\}$$
Stochastic rewards $r_t\sim r(a_t)$$$r:\mathcal A\to\Delta(\mathbb R)$$
- Expected reward per action $\mathbb E[r(a)] = \mu_a$
Finite time horizon $$T\in\mathbb Z_+$$
Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$

Optimal Action

Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
What is the optimal action?
- $a_\star = \arg\max_{a\in\mathcal A} \mu_a$
When the setting is known, it is trivial to find the optimal action (unlike a general MDP)
When setting (i.e. reward function) is unknown, we must devise a strategy for balancing exploration (trying new actions) and exploitation (selecting high reward actions)

Regret

We measure the performance of an algorithm (i.e. strategy for selecting actions) by comparing its performance to the optimal action
Definition: The regret of an algorithm which chooses actions $a_1,\dots,a_T$ is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
Good algorithms have sublinear regret, so that the average sub-optimality converges to 0 $$\lim_{T\to\infty} \frac{1}{T} R(T) \to 0\quad \text{if}\quad R(T)\lesssim T^p\quad\text{for}\quad p<1$$

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

Exploration & Exploitation

Consider the two algorithms: pure exploration and pure exploitation
Claim: Both suffer linear regret

Uniform

for $t=1,2,...,T$
- $a_t\sim \mathrm{Unif}(K)$

Greedy

for $t=1,2,...,K$
- $a_t=t$, store $r_t$
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} r_a$

Exploration & Exploitation

Claim: Both suffer linear regret
Proof sketch: Regret is
- $R(T)= \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] $
- $= \mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]$
- $=\sum_{t=1}^T \mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t})\right]$
- $\geq \sum_{t=1}^T \mathbb P\{a_t\neq a^* \}\min_{a\neq a^*} (\mu^*-\mu_{a} ) = c\cdot T$
Exercise: what is $ \mathbb P\{a_t\neq a^* \} $ for the two algorithms?

Uniform $a_t\sim \mathrm{Unif}(K)$

Greedy $a_t=\arg\max_{a\in[K]} r_a$

Expectation is over both randomness of reward function $r$ and over randomness of $a_t$ chosen by algorithm
$\mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]$
$=\sum_{t=1}^T \mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]$ linearity of expectation
$=\sum_{t=1}^T\mathbb E\left[\mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))|a_t\right] \right]$ tower rule
$=\sum_{t=1}^T\mathbb E\left[\mathbb 1\{a_t\neq a^* \} \mathbb E\left[ r(a^*)-r(a_t)|a_t\right] \right]$ independence
$=\sum_{t=1}^T\mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t} ) \right]$ linearity of expectation
$\geq \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} {\min_{a\neq a^*} (\mu^*-\mu_{a} )} \big]$
$= \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} \big] {\min_{a\neq a^*} (\mu^*-\mu_{a} )}$
$=\sum_{t=1}^T \mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )} = c\cdot T$
as long as $\mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )}\geq c$ for all $t$

Explore-then-Commit

First attempt: a simple algorithm that balances exploration and exploitation into two phases

Explore-then-Commit

for $t=1,2,...,N\cdot K$
- $a_t=t\mod K$, store $r_t$ # try each $N$ times
$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$ # average reward
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$ # commit to best

Explore-then-Commit

How to set $N$?
The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
Assume that rewards are bounded so $r_t\in[0,1]$
- $R_1 \leq \sum_{t=1}^{NK} 1 = NK$
$R_2$ depends on the suboptimality
- $R_2 = (T-NK) (\mu^\star - \mu_{\hat a_\star}) \leq T (\mu^\star - \mu_{\hat a_\star})$
This depends on the quality of the estimates $\hat\mu_a$

Explore-then-Commit

Lemma (Hoeffding's): Suppose $r_i\in[0,1]$ and $\mathbb E[r_i] = \mu$. Then for $r_1,\dots, r_N$ i.i.d., with probability $1-\delta$, $$\left|\frac{1}{N} \sum_{i=1}^N r_i - \mu \right| \lesssim \sqrt{\frac{\log 1/\delta}{N}} $$
(Proof out of scope)
Lemma (Explore): After the exploration phase, for all actions $a\in\{1,\dots, K\}$, $$|\hat \mu_a -\mu_a|\lesssim \sqrt{\frac{\log(K/\delta)}{N}}\quad\text{with probability}~~1-\delta$$
Proof: Hoeffding & union bound

Confidence intervals

Confidence intervals allow us to bound sub-optimality with high probability: $\mu^\star - \mu_{\hat a_\star}$
- $\leq \hat \mu_{a_\star} + c\sqrt{\frac{\log(K/\delta)}{N}}- \Big(\hat \mu_{\hat a_\star} - c\sqrt{\frac{\log(K/\delta)}{N}}\Big)$
- $=\hat \mu_{a_\star} - \hat \mu_{\hat a_\star} + 2c\sqrt{\frac{\log(K/\delta)}{N}} $
- $\leq 2c\sqrt{\frac{\log(K/\delta)}{N}} $ by definition of $\hat a_\star$

$ \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]$

Explore-then-Commit

How to set $N$?
$R(T) =R_1+R_2$
- $\leq NK + T (\mu^\star - \mu_{\hat a_\star})$
- $\leq NK + T\cdot 2c\sqrt{\frac{\log(K/\delta)}{N}}$ with probability $1-\delta$
Minimizing with respect to $N$:
- set derivative to zero: $K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0$
Regret minimizing choice $N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}$
Results in $R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$

Explore-then-Commit

Theorem: For $N\propto ((T/K)\sqrt{\log K/\delta})^{2/3}$, the regret of ETC is bounded with probability $1-\delta$: $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

Explore-then-Commit

for $t=1,2,...,N\cdot K$
- $a_t=t\mod k$, store $r_t$ # try each $N$ times
$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$ # average reward
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$ # commit to best

Recap

PSet released tonight

Multi-Armed Bandits
Explore-then-Commit
Confidence Intervals

Next lecture: Using confidence intervals adaptively