CS 4/5789: Introduction to Reinforcement Learning
Lecture 20: Upper Confidence Bound
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 6 due next Monday
- Final PA released later this week
- Final exam is Saturday 5/13 at 2pm

Agenda
1. Multi-Armed Bandits
2. Explore-then-Commit
3. UCB Algorithm
4. UCB Analysis

Multi-Armed Bandit
A simplified setting for studying exploration
Multi-Armed Bandits
- for \(t=1,2,...,T\)
- take action \(a_t\in\{1,\dots, K\}\)
- receive reward \(r_t\)
- \(\mathbb E[r_t] = \mu_{a_t}\)
MAB Setting
- Simplified RL setting with no state and no transitions
- \(\mathcal A=\{1,\dots,K\}\) \(K\) discrete actions ("arms")
- Stochastic rewards \(r_t\sim r(a_t)\) with expectation \(\mathbb E[r(a)] = \mu_a\)
- Finite time horizon \(T\in\mathbb Z_+\)
Multi-Armed Bandits
- for \(t=1,2,...,T\)
- take action \(a_t\in\{1,\dots, K\}\)
- receive reward \(r_t\)
- \(\mathbb E[r_t] = \mu_{a_t}\)
Optimal Action and Regret
- Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
- Optimal action \(a_\star = \arg\max_{a\in\mathcal A} \mu_a\)
- Definition: The regret of an algorithm which chooses actions \(a_1,\dots,a_T\) is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
- Good algorithms have sublinear regret$$ R(T)\lesssim T^p\quad\text{for}\quad p<1 \qquad \left(\implies \lim_{T\to\infty} \frac{1}{T} R(T) \to 0\right)$$
Agenda
1. Multi-Armed Bandits
2. Explore-then-Commit
3. UCB Algorithm
4. UCB Analysis
Explore-then-Commit
- First attempt: a simple algorithm that balances exploration and exploitation into two phases
Explore-then-Commit
- for \(t=1,2,...,N\cdot K\)
- \(a_t=t\mod K\), store \(r_t\) # try each \(N\) times
- \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\) # average reward
- for \(t=K+1,\dots,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star\) # commit to best
Explore-then-Commit
- How to set \(N\)?
- The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
- PollEV
- Assuming that rewards are bounded, \(r_t\in[0,1]\) $$R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})$$
Confidence intervals
- Sub-optimality is bounded with high probability by the width of the confidence intervals: $$ \mu^\star - \mu_{\hat a_\star}\lesssim \sqrt{\frac{\log(K/\delta)}{N}} $$
\( \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]\)
- We derived confidence intervals using Hoeffding's bound (\(c\) is a constant)
Explore-then-Commit
- How to set \(N\)?
- \(R(T) =R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})\)
- \(\lesssim NK + T \sqrt{\frac{\log(K/\delta)}{N}}\) with probability \(1-\delta\)
- Minimizing with respect to \(N\)
- set derivative to zero: \(K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0\)
- Regret minimizing choice \(N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}\)
- Results in \(R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)\)
Explore-then-Commit
- Theorem: For \(N\propto ((T/K)\sqrt{\log K/\delta})^{2/3}\), the regret of ETC is bounded with probability \(1-\delta\): $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$
Explore-then-Commit
- for \(t=1,2,...,N\cdot K\)
- \(a_t=t\mod k\), store \(r_t\) # try each \(N\) times
- \(\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}\) # average reward
- for \(t=K+1,\dots,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star\) # commit to best
Agenda
1. Multi-Armed Bandits
2. Explore-then-Commit
3. UCB Algorithm
4. UCB Analysis
Upper Confidence Bound
- An algorithm that adapts to confidence intervals
- Idea: Pull the arm with the highest upper confidence bound
- Principle of optimism in the face of uncertainty
UCB
- Initialize \(\hat \mu_{a,1}\) and \(N_{a,1}\) for \(a\in[K]\)
- for \(t=1,2,...,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}\) # largest UCB
- update \(N_{a_t,t+1}\) and \(\hat\mu_{a_t,t+1}\)
Upper Confidence Bound
UCB
- Initialize \(\hat \mu_{a,1}\) and \(N_{a,1}\) for \(a\in[K]\)
- for \(t=1,2,...,T\)
- \(a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}\) # largest UCB
- update \(N_{a_t,t+1}\) and \(\hat\mu_{a_t,t+1}\)
- number of pulls per arm:
\(N_{a,t} = \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\}\) - average reward per arm:
\(\hat \mu_{a,t} = \frac{1}{N_{a,t}} \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\} r_k\) - upper confidence bound:
\(u_{a,t} =\hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}\)
UCB Intuition
- Why does it work?
- Principle of optimism in the face of uncertainty
- Two reasons to pull an arm:
- large confidence interval (explore)
- a good arm (exploit)
- Two outcomes from acting optimistically:
- we were correct \(\rightarrow\) high reward
- we were wrong \(\rightarrow\) adjust estimates
Agenda
1. Multi-Armed Bandits
2. Explore-then-Commit
3. UCB Algorithm
4. UCB Analysis
Sub-optimality at \(t\)
\(\mu_\star - \mu_{a_t} \)
- \(\leq u_{a_\star, t} - \mu_{a_t}\)
- \(\leq u_{a_t, t} - \mu_{a_t}\)
- \(= \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \mu_{a_t}\)
- \(\leq \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \left(\hat \mu_{a_t, t} - \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} \right)\)
- \(\leq 2\sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} \)
\(a_t\)
\(a_\star\)
Claim: sub-optimality at \(t\) is bounded by the width of \(a_t\)'s confidence interval
Sublinear Regret
- Regret is cumulative sub-optimality
- \(R(T) = \sum_{t=1}^T \mu_\star - \mu_{a_t} \)
- \(\leq \sum_{t=1}^T 2\sqrt{\frac{\log(KT/\delta)}{N_{a,t}}} \)
- \(\leq 2\sqrt {\log(KT/\delta)} \sum_{t=1}^T \sqrt{1/{N_{a,t}}} \)
- Claim: Since we only pull one arm per round, $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2K\sqrt{T}$$
- Putting it all together, $$R(T) \lesssim K\sqrt {T \log(KT/\delta) }$$
Proof of Claim
- Claim: since one arm per round, \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2 K\sqrt{T}\)
-
Proof: \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \)
- \(=\sum_{t=1}^T \sum_{a=1}^K \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}} \)
- \(=\sum_{a=1}^K \sum_{t=1}^T \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}} \) switching order
- \(=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}} \sqrt{1/n} \) since \(N_{a,t}\) increments by 1
- \(\leq \sum_{a=1}^K \sum_{n=1}^{T} \sqrt{1/n} \) since at most \(T\) pulls
- \(= K \sum_{n=1}^{T} \sqrt{1/n} \)
- \(\leq K\left(1+\int_{x=1}^T\sqrt{1/x} dx\right ) \) integral bounds sum
- \(= K\left(1+2\sqrt{T} - 2\sqrt{1} \right ) \)
- \(\leq 2K\sqrt{T}\)
Tighter bound
- Claim: since one arm per round, \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq \sqrt{KT}\)
-
Proof: \(\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \)
- \(=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}} \sqrt{1/n} \) same as previous
- \(\leq \sum_{a=1}^K \sqrt{N_{a,T}} \) summation trick
- \(= K \cdot \frac{1}{K} \sum_{a=1}^K \sqrt{N_{a,T}} \)
- \(\leq K \cdot \sqrt{\frac{1}{K} \sum_{a=1}^K N_{a,T}} \) Jensen's
- \(= K \cdot \sqrt{\frac{T}{K} } = \sqrt{KT} \) at most \(T\) pulls
Explore-then-Commit
- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):
Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)
Upper Confidence Bound
For \(t=1,...,T\):
- Pull \( a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
- Update empirical means \(\widehat \mu_{a,t}\) and counts \(N_{a,t}\)
Explore for \(N \approx T^{2/3}\),
\(R(T) \lesssim T^{2/3}\)
\(R(T) \lesssim \sqrt{T}\)
Comparison
Preview: Contextual Bandits
Example: online advertising




Journalism
Programming
"Arms" are different job ads:
But consider different users:
CS Major
English Major
Preview: Contextual Bandits
Example: online shopping
"Arms" are various products
But what about search queries, browsing history, items in cart?

Preview: Contextual Bandits
Example: social media feeds
"Arms" are various posts: images, videos
Personalized to each user based on demographics, behavioral data, etc

Preview: Contextual Bandits
- The best action will depend on the context
- e.g. major, browsing history, demographics
- Thus we need a policy for mapping context to action
Recap
- PSet released tonight
- Explore-then-Commit
- Upper Confidence Bound
- Next lecture: Policies & contextual bandits
Sp23 CS 4/5789: Lecture 20
By Sarah Dean