CS 4/5789: Introduction to Reinforcement Learning
Lecture 19: Exploration in Multi-Armed Bandits
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 6 released tonight, due next Monday
- Final PA released later this week
Agenda
1. Recap Units 1&2
2. Motivation and Demo
3. Multi-Armed Bandit Setting
4. Exploration & Exploitation
In Unit 2, discussed algorithms for:
- Constructing labels for supervised learning
- Updating the policy based on learned quantities
Recap: Control/Data Feedback


action \(a_t\)
state \(s_t\)
reward \(r_t\)

policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
In Unit 2, discussed algorithms for:
- Constructing labels for supervised learning
- Model-based RL: transitions/rewards
- Value-based RL: (Q) Value function
- Policy optimization: gradient (of Value)
- Updating the policy based on learned quantities
- So far: greedy, incremental, and \(\epsilon\) greedy
- Unit 3: Exploration
Recap: Control/Data Feedback
Agenda
1. Recap Units 1&2
2. Motivation and Demo
3. Multi-Armed Bandit Setting
4. Exploration & Exploitation
Exploration in RL is hard!
Example: mountainCar rewarded only at flag



Multi-Armed Bandit
A simplified setting for studying exploration
Multi-Armed Bandits
- for \(t=1,2,...\)
- take action \(a_t\in\{1,\dots, K\}\)
- receive reward \(r_t\)
- \(\mathbb E[r_t] = \mu_{a_t}\)
Online advertising


Applications of MAB
NYT Caption Contest
Medical Trials
Interactive Coding Demo and PollEV
Agenda
1. Recap Units 1&2
2. Motivation and Demo
3. Multi-Armed Bandit Setting
4. Exploration & Exploitation
MAB Setting
- Simplified RL setting with no state and no transitions
- \(K\) discrete actions ("arms") $$\mathcal A=\{1,\dots,K\}$$
- Stochastic rewards \(r_t\sim r(a_t)\)$$r:\mathcal A\to\Delta(\mathbb R)$$
- Expected reward per action \(\mathbb E[r(a)] = \mu_a\)
- Finite time horizon $$T\in\mathbb Z_+$$
- Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
Optimal Action
- Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
- What is the optimal action?
- \(a_\star = \arg\max_{a\in\mathcal A} \mu_a\)
- When the setting is known, it is trivial to find the optimal action (unlike a general MDP)
- When setting (i.e. reward function) is unknown, we must devise a strategy for balancing exploration (trying new actions) and exploitation (selecting high reward actions)
Regret
- We measure the performance of an algorithm (i.e. strategy for selecting actions) by comparing its performance to the optimal action
- Definition: The regret of an algorithm which chooses actions \(a_1,\dots,a_T\) is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
- Good algorithms have sublinear regret, so that the average sub-optimality converges to 0 $$\lim_{T\to\infty} \frac{1}{T} R(T) \to 0\quad \text{if}\quad R(T)\lesssim T^p\quad\text{for}\quad p<1$$
Agenda
1. Recap Units 1&2
2. Motivation and Demo
3. Multi-Armed Bandit Setting
4. Exploration & Exploitation
Exploration & Exploitation
- Consider the two algorithms: pure exploration and pure exploitation
- Claim: Both suffer linear regret
Uniform
- for \(t=1,2,...,T\)
- \(a_t\sim \mathrm{Unif}(K)\)
Greedy
- for \(t=1,2,...,K\)
- \(a_t=t\), store \(r_t\)
- for \(t=K+1,\dots,T\)
- \(a_t=\arg\max_{a\in[K]} r_a\)
Exploration & Exploitation
- Claim: Both suffer linear regret
-
Proof sketch: Regret is
- \(R(T)= \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] \)
- \(= \mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]\)
- \(=\sum_{t=1}^T \mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t})\right]\)
- \(\geq \sum_{t=1}^T \mathbb P\{a_t\neq a^* \}\min_{a\neq a^*} (\mu^*-\mu_{a} ) = c\cdot T\)
- Exercise: what is \( \mathbb P\{a_t\neq a^* \} \) for the two algorithms?
Uniform \(a_t\sim \mathrm{Unif}(K)\)
Greedy \(a_t=\arg\max_{a\in[K]} r_a\)
- Expectation is over both randomness of reward function \(r\) and over randomness of \(a_t\) chosen by algorithm
- \(\mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]\)
- \(=\sum_{t=1}^T \mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t)) \right]\) linearity of expectation
- \(=\sum_{t=1}^T\mathbb E\left[\mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))|a_t\right] \right]\) tower rule
- \(=\sum_{t=1}^T\mathbb E\left[\mathbb 1\{a_t\neq a^* \} \mathbb E\left[ r(a^*)-r(a_t)|a_t\right] \right]\) independence
- \(=\sum_{t=1}^T\mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t} ) \right]\) linearity of expectation
- \(\geq \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} {\min_{a\neq a^*} (\mu^*-\mu_{a} )} \big]\)
- \(= \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} \big] {\min_{a\neq a^*} (\mu^*-\mu_{a} )}\)
- \(=\sum_{t=1}^T \mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )} = c\cdot T\)
- as long as \(\mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )}\geq c\) for all \(t\)
Explore-then-Commit
- First attempt: a simple algorithm that balances exploration and exploitation into two phases
Explore-then-Commit
- for \(t=1,2,...,N\cdot K\)
- \(a_t=t\mod K\), store \(r_t\) # try each \(N\) times
- \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\) # average reward
- for \(t=K+1,\dots,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star\) # commit to best
Explore-then-Commit
- How to set \(N\)?
- The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
- Assume that rewards are bounded so \(r_t\in[0,1]\)
- \(R_1 \leq \sum_{t=1}^{NK} 1 = NK\)
- \(R_2\) depends on the suboptimality
- \(R_2 = (T-NK) (\mu^\star - \mu_{\hat a_\star}) \leq T (\mu^\star - \mu_{\hat a_\star})\)
- This depends on the quality of the estimates \(\hat\mu_a\)
Explore-then-Commit
- Lemma (Hoeffding's): Suppose \(r_i\in[0,1]\) and \(\mathbb E[r_i] = \mu\). Then for \(r_1,\dots, r_N\) i.i.d., with probability \(1-\delta\), $$\left|\frac{1}{N} \sum_{i=1}^N r_i - \mu \right| \lesssim \sqrt{\frac{\log 1/\delta}{N}} $$
- (Proof out of scope)
- Lemma (Explore): After the exploration phase, for all actions \(a\in\{1,\dots, K\}\), $$|\hat \mu_a -\mu_a|\lesssim \sqrt{\frac{\log(K/\delta)}{N}}\quad\text{with probability}~~1-\delta$$
- Proof: Hoeffding & union bound
Confidence intervals
- Confidence intervals allow us to bound sub-optimality with high probability: \(\mu^\star - \mu_{\hat a_\star}\)
- \(\leq \hat \mu_{a_\star} + c\sqrt{\frac{\log(K/\delta)}{N}}- \Big(\hat \mu_{\hat a_\star} - c\sqrt{\frac{\log(K/\delta)}{N}}\Big)\)
- \(=\hat \mu_{a_\star} - \hat \mu_{\hat a_\star} + 2c\sqrt{\frac{\log(K/\delta)}{N}} \)
- \(\leq 2c\sqrt{\frac{\log(K/\delta)}{N}} \) by definition of \(\hat a_\star\)
\( \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]\)
Explore-then-Commit
- How to set \(N\)?
- \(R(T) =R_1+R_2\)
- \(\leq NK + T (\mu^\star - \mu_{\hat a_\star})\)
- \(\leq NK + T\cdot 2c\sqrt{\frac{\log(K/\delta)}{N}}\) with probability \(1-\delta\)
- Minimizing with respect to \(N\):
- set derivative to zero: \(K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0\)
- Regret minimizing choice \(N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}\)
- Results in \(R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)\)
Explore-then-Commit
- Theorem: For \(N\propto ((T/K)\sqrt{\log K/\delta})^{2/3}\), the regret of ETC is bounded with probability \(1-\delta\): $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$
Explore-then-Commit
- for \(t=1,2,...,N\cdot K\)
- \(a_t=t\mod k\), store \(r_t\) # try each \(N\) times
- \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\) # average reward
- for \(t=K+1,\dots,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star\) # commit to best
Recap
- PSet released tonight
- Multi-Armed Bandits
- Explore-then-Commit
- Confidence Intervals
- Next lecture: Using confidence intervals adaptively
Sp23 CS 4/5789: Lecture 19
By Sarah Dean