CS 4/5789: Introduction to Reinforcement Learning

Lecture 19: Exploration in Multi-Armed Bandits

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • 5789 Paper Reviews due weekly on Mondays
    • PSet 6 released tonight, due next Monday
    • Final PA released later this week

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

In Unit 2, discussed algorithms for:

  1. Constructing labels for supervised learning
  2. Updating the policy based on learned quantities

Recap: Control/Data Feedback

action \(a_t\)

state \(s_t\)

reward \(r_t\)

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown in Unit 2

In Unit 2, discussed algorithms for:

  1. Constructing labels for supervised learning
    • Model-based RL: transitions/rewards
    • Value-based RL: (Q) Value function
    • Policy optimization: gradient (of Value)
  2. Updating the policy based on learned quantities
    • So far: greedy, incremental, and \(\epsilon\) greedy
    • Unit 3: Exploration

Recap: Control/Data Feedback

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

Exploration in RL is hard!

Example: mountainCar rewarded only at flag

Multi-Armed Bandit

A simplified setting for studying exploration

Multi-Armed Bandits

  • for \(t=1,2,...\)
    • take action \(a_t\in\{1,\dots, K\}\)
    • receive reward \(r_t\)
      • \(\mathbb E[r_t] = \mu_{a_t}\)

Online advertising

Applications of MAB

NYT Caption Contest

Medical Trials

Interactive Coding Demo and PollEV

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

MAB Setting

  • Simplified RL setting with no state and no transitions
  • \(K\) discrete actions ("arms") $$\mathcal A=\{1,\dots,K\}$$
  • Stochastic rewards \(r_t\sim r(a_t)\)$$r:\mathcal A\to\Delta(\mathbb R)$$
    • Expected reward per action \(\mathbb E[r(a)] = \mu_a\)
  • Finite time horizon $$T\in\mathbb Z_+$$
  • Goal: maximize cumulative reward $$  \mathbb E\left[\sum_{t=1}^T r(a_t)  \right] = \sum_{t=1}^T \mu_{a_t}$$

Optimal Action

  • Goal: maximize cumulative reward $$  \mathbb E\left[\sum_{t=1}^T r(a_t)  \right] = \sum_{t=1}^T \mu_{a_t}$$
  • What is the optimal action?
    • \(a_\star = \arg\max_{a\in\mathcal A} \mu_a\)
  • When the setting is known, it is trivial to find the optimal action (unlike a general MDP)
  • When setting (i.e. reward function) is unknown, we must devise a strategy for balancing exploration (trying new actions) and exploitation (selecting high reward actions)

Regret

  • We measure the performance of an algorithm (i.e. strategy for selecting actions) by comparing its performance to the optimal action
  • Definition: The regret of an algorithm which chooses actions \(a_1,\dots,a_T\) is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t)  \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
  • Good algorithms have sublinear regret, so that the average sub-optimality converges to 0 $$\lim_{T\to\infty} \frac{1}{T} R(T) \to 0\quad \text{if}\quad R(T)\lesssim T^p\quad\text{for}\quad p<1$$

Agenda

1. Recap Units 1&2

2. Motivation and Demo

3. Multi-Armed Bandit Setting

4. Exploration & Exploitation

Exploration & Exploitation

  • Consider the two algorithms: pure exploration and pure exploitation
  • Claim: Both suffer linear regret

Uniform

  • for \(t=1,2,...,T\)
    • \(a_t\sim \mathrm{Unif}(K)\)

Greedy

  • for \(t=1,2,...,K\)
    • \(a_t=t\), store \(r_t\)
  • for \(t=K+1,\dots,T\)
    • \(a_t=\arg\max_{a\in[K]} r_a\)

Exploration & Exploitation

  • Claim: Both suffer linear regret
  • Proof sketch: Regret is
    • \(R(T)= \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t)  \right] \)
    • \(= \mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))  \right]\)
    • \(=\sum_{t=1}^T \mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t})\right]\)
    • \(\geq \sum_{t=1}^T \mathbb P\{a_t\neq a^* \}\min_{a\neq a^*} (\mu^*-\mu_{a} ) = c\cdot T\)
  • Exercise: what is \( \mathbb P\{a_t\neq a^* \} \) for the two algorithms?

Uniform \(a_t\sim \mathrm{Unif}(K)\)

Greedy \(a_t=\arg\max_{a\in[K]} r_a\)

  • Expectation is over both randomness of reward function \(r\) and over randomness of \(a_t\) chosen by algorithm
  •  \(\mathbb E\left[\sum_{t=1}^T \mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))  \right]\)
  •  \(=\sum_{t=1}^T \mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))  \right]\) linearity of expectation
  •  \(=\sum_{t=1}^T\mathbb E\left[\mathbb E\left[\mathbb 1\{a_t\neq a^* \} (r(a^*)-r(a_t))|a_t\right]  \right]\) tower rule
  •  \(=\sum_{t=1}^T\mathbb E\left[\mathbb 1\{a_t\neq a^* \}  \mathbb E\left[ r(a^*)-r(a_t)|a_t\right]  \right]\) independence
  •  \(=\sum_{t=1}^T\mathbb E\left[ \mathbb 1\{a_t\neq a^* \} (\mu^*-\mu_{a_t} )  \right]\) linearity of expectation
  •  \(\geq \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} {\min_{a\neq a^*} (\mu^*-\mu_{a} )} \big]\)
  •  \(= \sum_{t=1}^T\mathbb E\big[ \mathbb 1\{a_t\neq a^* \} \big] {\min_{a\neq a^*} (\mu^*-\mu_{a} )}\)
  • \(=\sum_{t=1}^T \mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )} = c\cdot T\)
  • as long as \(\mathbb P\{a_t\neq a^* \}{\min_{a\neq a^*} (\mu^*-\mu_{a} )}\geq c\) for all \(t\)

Explore-then-Commit

  • First attempt: a simple algorithm that balances exploration and exploitation into two phases

Explore-then-Commit

  • for \(t=1,2,...,N\cdot K\)
    • \(a_t=t\mod K\), store \(r_t\)       # try each \(N\) times
  • \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\)                   # average reward
  • for \(t=K+1,\dots,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \mu_a  = \hat a_\star\) # commit to best

Explore-then-Commit

  • How to set \(N\)?
  • The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
  • Assume that rewards are bounded so \(r_t\in[0,1]\)
    • \(R_1 \leq \sum_{t=1}^{NK} 1 = NK\)
  • \(R_2\) depends on the suboptimality
    • \(R_2 = (T-NK) (\mu^\star - \mu_{\hat a_\star}) \leq T (\mu^\star - \mu_{\hat a_\star})\)
  • This depends on the quality of the estimates \(\hat\mu_a\)

Explore-then-Commit

  • Lemma (Hoeffding's): Suppose \(r_i\in[0,1]\) and \(\mathbb E[r_i] = \mu\). Then for \(r_1,\dots, r_N\) i.i.d., with probability \(1-\delta\), $$\left|\frac{1}{N} \sum_{i=1}^N r_i - \mu \right| \lesssim \sqrt{\frac{\log 1/\delta}{N}} $$
  • (Proof out of scope)
  • Lemma (Explore): After the exploration phase, for all actions \(a\in\{1,\dots, K\}\), $$|\hat \mu_a -\mu_a|\lesssim \sqrt{\frac{\log(K/\delta)}{N}}\quad\text{with probability}~~1-\delta$$
  • Proof: Hoeffding & union bound

Confidence intervals

  • Confidence intervals allow us to bound sub-optimality with high probability: \(\mu^\star - \mu_{\hat a_\star}\)
    • \(\leq \hat \mu_{a_\star} + c\sqrt{\frac{\log(K/\delta)}{N}}- \Big(\hat \mu_{\hat a_\star} - c\sqrt{\frac{\log(K/\delta)}{N}}\Big)\)
    • \(=\hat \mu_{a_\star} - \hat \mu_{\hat a_\star} + 2c\sqrt{\frac{\log(K/\delta)}{N}} \)
    • \(\leq 2c\sqrt{\frac{\log(K/\delta)}{N}} \) by definition of \(\hat a_\star\)

\( \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]\)

Explore-then-Commit

  • How to set \(N\)?
  • \(R(T) =R_1+R_2\)
    • \(\leq  NK + T (\mu^\star - \mu_{\hat a_\star})\)
    • \(\leq  NK + T\cdot 2c\sqrt{\frac{\log(K/\delta)}{N}}\) with probability \(1-\delta\)
  • Minimizing with respect to \(N\):
    • set derivative to zero: \(K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0\)
  • Regret minimizing choice \(N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}\)
  • Results in \(R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)\)

Explore-then-Commit

  • Theorem: For \(N\propto  ((T/K)\sqrt{\log K/\delta})^{2/3}\), the regret of ETC is bounded with probability \(1-\delta\): $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

Explore-then-Commit

  • for \(t=1,2,...,N\cdot K\)
    • \(a_t=t\mod k\), store \(r_t\)       # try each \(N\) times
  • \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\)                   # average reward
  • for \(t=K+1,\dots,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \mu_a  = \hat a_\star\) # commit to best

Recap

  • PSet released tonight

 

  • Multi-Armed Bandits
  • Explore-then-Commit
  • Confidence Intervals

 

  • Next lecture: Using confidence intervals adaptively