1. Recap: Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

Multi-Armed Bandit

A simplified setting for studying exploration

Multi-Armed Bandits

  • for \(t=1,2,...,T\)
    • take action \(a_t\in\{1,\dots, K\}\)
    • receive reward \(r_t\)
      • \(\mathbb E[r_t] = \mu_{a_t}\)

MAB Setting

  • Simplified RL setting with no state and no transitions
  • \(\mathcal A=\{1,\dots,K\}\) \(K\) discrete actions ("arms")
  • Stochastic rewards \(r_t\sim r(a_t)\) with expectation \(\mathbb E[r(a)] = \mu_a\)
  • Finite time horizon \(T\in\mathbb Z_+\)

Multi-Armed Bandits

  • for \(t=1,2,...,T\)
    • take action \(a_t\in\{1,\dots, K\}\)
    • receive reward \(r_t\)
      • \(\mathbb E[r_t] = \mu_{a_t}\)

Optimal Action and Regret

  • Goal: maximize cumulative reward $$  \mathbb E\left[\sum_{t=1}^T r(a_t)  \right] = \sum_{t=1}^T \mu_{a_t}$$
  • Optimal action \(a_\star = \arg\max_{a\in\mathcal A} \mu_a\)
  • Definition: The regret of an algorithm which chooses actions \(a_1,\dots,a_T\) is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t)  \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
  • Good algorithms have sublinear regret \(R(T)\lesssim T^p\) for \(p<1\)
  • Notation \(f(T)\lesssim g(T)\) means \(f(T)\leq c\cdot g(T)\) for constant \(c\)


  • First attempt: a simple algorithm that balances exploration and exploitation into two phases


  • for \(t=1,2,...,N\cdot K\)
    • \(a_t=t\mod K\), store \(r_t\)       # try each \(N\) times
  • \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\)                   # average reward
  • for \(t=K+1,\dots,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \mu_a  = \hat a_\star\) # commit to best

Regret of ETC

  • How to set \(N\)?
  • The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
  • PollEV
  • Assuming that rewards are bounded, \(r_t\in[0,1]\) $$R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})$$

Confidence intervals

  • Sub-optimality is bounded with high probability by the width of the confidence intervals: $$ \mu^\star - \mu_{\hat a_\star}\lesssim \sqrt{\frac{\log(K/\delta)}{N}}  $$

\( \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]\)

  • We derived confidence intervals using Hoeffding's bound (\(c\) is a constant)

Regret of ETC

  • How to set \(N\)?
  • \(R(T) =R_1+R_2 \leq  NK + T (\mu^\star - \mu_{\hat a_\star})\)
    • \(\lesssim  NK + T \sqrt{\frac{\log(K/\delta)}{N}}\) with probability \(1-\delta\)
  • Minimizing with respect to \(N\)
    • set derivative to zero: \(K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0\)
  • Regret minimizing choice \(N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}\)
  • Results in \(R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)\)


  • Theorem: For \(N\propto  ((T/K)\sqrt{\log K/\delta})^{2/3}\), the regret of ETC is bounded with probability \(1-\delta\): $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

Explore-then-Commit (Interactive Demo)

  • for \(t=1,2,...,N\cdot K\)
    • \(a_t=t\mod k\), store \(r_t\)       # try each \(N\) times
  • \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\)                   # average reward
  • for \(t=K+1,\dots,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \mu_a  = \hat a_\star\) # commit to best


Upper Confidence Bound

  • An algorithm that adapts to confidence intervals
  • Idea: Pull the arm with the highest upper confidence bound
  • Principle of optimism in the face of uncertainty


  • Initialize \(\hat \mu_{a,1}\) and \(N_{a,1}\) for \(a\in[K]\)
  • for \(t=1,2,...,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}\) # largest UCB
    • update \(N_{a_t,t+1}\) and \(\hat\mu_{a_t,t+1}\)

Upper Confidence Bound


  • Initialize \(\hat \mu_{a,1}\) and \(N_{a,1}\) for \(a\in[K]\)
  • for \(t=1,2,...,T\)
    • \(a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}\) # largest UCB
    • update \(N_{a_t,t+1}\) and \(\hat\mu_{a_t,t+1}\)
  • number of pulls per arm:
    \(N_{a,t} = \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\}\)
  • average reward per arm:
    \(\hat \mu_{a,t} = \frac{1}{N_{a,t}} \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\} r_k\)
  • upper confidence bound:
    \(u_{a,t} =\hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}\)

UCB Intuition

  • Why does it work?
  • Principle of optimism in the face of uncertainty
  • Two reasons to pull an arm:
    1. large confidence interval (explore)
    2. a good arm (exploit)
  • Two outcomes from acting optimistically:
    1. we were correct \(\rightarrow\) high reward
    2. we were wrong \(\rightarrow\) adjust estimates


Sub-optimality at \(t\)

\(\mu_\star - \mu_{a_t} \)

  • \(\leq u_{a_\star, t} - \mu_{a_t}\)
  • \(\leq u_{a_t, t} - \mu_{a_t}\)
  • \(= \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \mu_{a_t}\)
  • \(\leq \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \left(\hat \mu_{a_t, t} - \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} \right)\)
  • \(\leq 2\sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} \)



Claim: sub-optimality at \(t\) is bounded by the width of \(a_t\)'s confidence interval

Sublinear Regret

  • Regret is cumulative sub-optimality
  • \(R(T) = \sum_{t=1}^T \mu_\star - \mu_{a_t} \)
    • \(\leq \sum_{t=1}^T 2\sqrt{\frac{\log(KT/\delta)}{N_{a,t}}} \)
    • \(\leq 2\sqrt {\log(KT/\delta)} \sum_{t=1}^T \sqrt{1/{N_{a,t}}} \)
  • Claim: Since we only pull one arm per round, $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2K\sqrt{T}$$
  • Putting it all together, $$R(T) \lesssim K\sqrt {T \log(KT/\delta) }$$

Proof of Claim

  • Claim: since one arm per round, \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2 K\sqrt{T}\)
  • Proof: \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \)
    • \(=\sum_{t=1}^T \sum_{a=1}^K \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}} \)
    • \(=\sum_{a=1}^K \sum_{t=1}^T  \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}} \) switching order
    • \(=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}}  \sqrt{1/n} \) since \(N_{a,t}\) increments by 1
    • \(\leq \sum_{a=1}^K \sum_{n=1}^{T}  \sqrt{1/n} \) since at most \(T\) pulls
    • \(= K \sum_{n=1}^{T}  \sqrt{1/n} \)
    • \(\leq K\left(1+\int_{x=1}^T\sqrt{1/x} dx\right ) \) integral bounds sum
    • \(= K\left(1+2\sqrt{T} - 2\sqrt{1} \right ) \)
    • \(\leq 2K\sqrt{T}\)

Tighter bound

  • Claim: since one arm per round, \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq \sqrt{KT}\)
  • Proof: \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \)
    • \(=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}}  \sqrt{1/n} \)  same as previous
    • \(\leq \sum_{a=1}^K \sqrt{N_{a,T}} \) summation trick
    • \(= K \cdot \frac{1}{K} \sum_{a=1}^K \sqrt{N_{a,T}} \)
    • \(\leq  K \cdot \sqrt{\frac{1}{K}  \sum_{a=1}^K  N_{a,T}} \) Jensen's
    • \(=  K \cdot \sqrt{\frac{T}{K} } = \sqrt{KT} \) at most \(T\) pulls


  1. Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\):
        Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

Upper Confidence Bound

For \(t=1,...,T\):

  • Pull \( a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
  • Update empirical means \(\widehat \mu_{a,t}\) and counts \(N_{a,t}\)

Explore for \(N \approx T^{2/3}\),

\(R(T) \lesssim T^{2/3}\)

\(R(T) \lesssim \sqrt{T}\)


Preview: Contextual Bandits

Example: online advertising



"Arms" are different job ads:

But consider different users:

CS Major

English Major

Preview: Contextual Bandits

Example: online shopping

"Arms" are various products

But what about search queries, browsing history,?

Preview: Contextual Bandits

Example: social media feeds

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

Preview: Contextual Bandits

  • The best action will depend on the context
    • e.g. major, browsing history, demographics
  • Thus we need a policy for mapping context to action


  • PSet, PA released


  • Explore-then-Commit
  • Upper Confidence Bound


  • Next lecture: Policies & contextual bandits

