CS 4/5789: Introduction to Reinforcement Learning

Lecture 20: Upper Confidence Bound

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Assignments
- PSet 7 due Monday
- Final PA released
Final exam is Tuesday 5/14 at 2pm

Agenda

1. Recap: Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

Multi-Armed Bandit

A simplified setting for studying exploration

Multi-Armed Bandits

for $t=1,2,...,T$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$
  - $\mathbb E[r_t] = \mu_{a_t}$

MAB Setting

Simplified RL setting with no state and no transitions
$\mathcal A=\{1,\dots,K\}$ $K$ discrete actions ("arms")
Stochastic rewards $r_t\sim r(a_t)$ with expectation $\mathbb E[r(a)] = \mu_a$
Finite time horizon $T\in\mathbb Z_+$

Multi-Armed Bandits

for $t=1,2,...,T$
- take action $a_t\in\{1,\dots, K\}$
- receive reward $r_t$
  - $\mathbb E[r_t] = \mu_{a_t}$

Optimal Action and Regret

Goal: maximize cumulative reward $$ \mathbb E\left[\sum_{t=1}^T r(a_t) \right] = \sum_{t=1}^T \mu_{a_t}$$
Optimal action $a_\star = \arg\max_{a\in\mathcal A} \mu_a$
Definition: The regret of an algorithm which chooses actions $a_1,\dots,a_T$ is $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t} $$
Good algorithms have sublinear regret $R(T)\lesssim T^p$ for $p<1$
Notation $f(T)\lesssim g(T)$ means $f(T)\leq c\cdot g(T)$ for constant $c$

Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

Explore-then-Commit

First attempt: a simple algorithm that balances exploration and exploitation into two phases

Explore-then-Commit

for $t=1,2,...,N\cdot K$
- $a_t=t\mod K$, store $r_t$ # try each $N$ times
$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$ # average reward
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$ # commit to best

Regret of ETC

How to set $N$?
The regret decomposes $$R(T) =\sum_{t=1}^T \mu_\star - \mu_{a_t} = \underbrace{\sum_{t=1}^{NK} \mu_\star - \mu_{a_t}}_{R_1} + \underbrace{\sum_{t=NK+1}^T \mu^\star - \mu_{\hat a_\star}}_{R_2}$$
PollEV
Assuming that rewards are bounded, $r_t\in[0,1]$ $$R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})$$

Confidence intervals

Sub-optimality is bounded with high probability by the width of the confidence intervals: $$ \mu^\star - \mu_{\hat a_\star}\lesssim \sqrt{\frac{\log(K/\delta)}{N}} $$

$ \mu_{a} \in\left[ \hat \mu_{a} \pm c\sqrt{\frac{\log(K/\delta)}{N}}\right]$

We derived confidence intervals using Hoeffding's bound ($c$ is a constant)

Regret of ETC

How to set $N$?
$R(T) =R_1+R_2 \leq NK + T (\mu^\star - \mu_{\hat a_\star})$
- $\lesssim NK + T \sqrt{\frac{\log(K/\delta)}{N}}$ with probability $1-\delta$
Minimizing with respect to $N$
- set derivative to zero: $K - T c\sqrt{\frac{\log(K/\delta)}{4N^3}}=0$
Regret minimizing choice $N=\left (\frac{cT}{2K}\sqrt{\log K/\delta}\right)^{2/3}$
Results in $R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$

Explore-then-Commit

Theorem: For $N\propto ((T/K)\sqrt{\log K/\delta})^{2/3}$, the regret of ETC is bounded with probability $1-\delta$: $$R(T)\lesssim T^{2/3} K^{1/3} \log^{1/3}(K/\delta)$$

Explore-then-Commit (Interactive Demo)

for $t=1,2,...,N\cdot K$
- $a_t=t\mod k$, store $r_t$ # try each $N$ times
$\hat \mu_a = \frac{1}{N} \sum_{i=1}^N r_{K\cdot i+a}$ # average reward
for $t=K+1,\dots,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_a = \hat a_\star$ # commit to best

Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

Upper Confidence Bound

An algorithm that adapts to confidence intervals
Idea: Pull the arm with the highest upper confidence bound
Principle of optimism in the face of uncertainty

UCB

Initialize $\hat \mu_{a,1}$ and $N_{a,1}$ for $a\in[K]$
for $t=1,2,...,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$ # largest UCB
- update $N_{a_t,t+1}$ and $\hat\mu_{a_t,t+1}$

Upper Confidence Bound

UCB

Initialize $\hat \mu_{a,1}$ and $N_{a,1}$ for $a\in[K]$
for $t=1,2,...,T$
- $a_t=\arg\max_{a\in[K]} \hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$ # largest UCB
- update $N_{a_t,t+1}$ and $\hat\mu_{a_t,t+1}$

number of pulls per arm:
$N_{a,t} = \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\}$
average reward per arm:
$\hat \mu_{a,t} = \frac{1}{N_{a,t}} \sum_{k=1}^{t-1} \mathbf 1\{a_k=a\} r_k$
upper confidence bound:
$u_{a,t} =\hat \mu_{a,t} + \sqrt{\frac{\log(KT/\delta)}{N_{a,t}}}$

UCB Intuition

Why does it work?
Principle of optimism in the face of uncertainty
Two reasons to pull an arm:
1. large confidence interval (explore)
2. a good arm (exploit)
Two outcomes from acting optimistically:
1. we were correct $\rightarrow$ high reward
2. we were wrong $\rightarrow$ adjust estimates

Interactive Coding Demo

Agenda

1. Multi-Armed Bandits

2. Explore-then-Commit

3. UCB Algorithm

4. UCB Analysis

Sub-optimality at $t$

$\mu_\star - \mu_{a_t} $

$\leq u_{a_\star, t} - \mu_{a_t}$
$\leq u_{a_t, t} - \mu_{a_t}$
$= \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \mu_{a_t}$
$\leq \hat \mu_{a_t, t} + \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} - \left(\hat \mu_{a_t, t} - \sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} \right)$
$\leq 2\sqrt{\frac{\log(KT/\delta)}{N_{a_t,t}}} $

$a_t$

$a_\star$

Claim: sub-optimality at $t$ is bounded by the width of $a_t$'s confidence interval

Sublinear Regret

Regret is cumulative sub-optimality
$R(T) = \sum_{t=1}^T \mu_\star - \mu_{a_t} $
- $\leq \sum_{t=1}^T 2\sqrt{\frac{\log(KT/\delta)}{N_{a,t}}} $
- $\leq 2\sqrt {\log(KT/\delta)} \sum_{t=1}^T \sqrt{1/{N_{a,t}}} $
Claim: Since we only pull one arm per round, $$\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2K\sqrt{T}$$
Putting it all together, $$R(T) \lesssim K\sqrt {T \log(KT/\delta) }$$

Proof of Claim

Claim: since one arm per round, $\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq 2 K\sqrt{T}$
Proof: $\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} $
- $=\sum_{t=1}^T \sum_{a=1}^K \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}} $
- $=\sum_{a=1}^K \sum_{t=1}^T \mathbf 1\{a_t=a\} \sqrt{1/{N_{a,t}}} $ switching order
- $=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}} \sqrt{1/n} $ since $N_{a,t}$ increments by 1
- $\leq \sum_{a=1}^K \sum_{n=1}^{T} \sqrt{1/n} $ since at most $T$ pulls
- $= K \sum_{n=1}^{T} \sqrt{1/n} $
- $\leq K\left(1+\int_{x=1}^T\sqrt{1/x} dx\right ) $ integral bounds sum
- $= K\left(1+2\sqrt{T} - 2\sqrt{1} \right ) $
- $\leq 2K\sqrt{T}$

Tighter bound

Claim: since one arm per round, $\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} \leq \sqrt{KT}$
Proof: $\sum_{t=1}^T \sqrt{1/{N_{a_t,t}}} $
- $=\sum_{a=1}^K \sum_{n=1}^{N_{a,T}} \sqrt{1/n} $ same as previous
- $\leq \sum_{a=1}^K \sqrt{N_{a,T}} $ summation trick
- $= K \cdot \frac{1}{K} \sum_{a=1}^K \sqrt{N_{a,T}} $
- $\leq K \cdot \sqrt{\frac{1}{K} \sum_{a=1}^K N_{a,T}} $ Jensen's
- $= K \cdot \sqrt{\frac{T}{K} } = \sqrt{KT} $ at most $T$ pulls

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$
Update empirical means $\widehat \mu_{a,t}$ and counts $N_{a,t}$

Explore for $N \approx T^{2/3}$,

$R(T) \lesssim T^{2/3}$

$R(T) \lesssim \sqrt{T}$

Comparison

Preview: Contextual Bandits

Example: online advertising

Journalism

Programming

"Arms" are different job ads:

But consider different users:

CS Major

English Major

Preview: Contextual Bandits

Example: online shopping

"Arms" are various products

But what about search queries, browsing history, items in cart?

Preview: Contextual Bandits

Example: social media feeds

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

Preview: Contextual Bandits

The best action will depend on the context
- e.g. major, browsing history, demographics
Thus we need a policy for mapping context to action

Recap

PSet, PA released

Explore-then-Commit
Upper Confidence Bound

Next lecture: Policies & contextual bandits

Sp24 CS 4/5789: Lecture 20

By Sarah Dean

Sp24 CS 4/5789: Lecture 20

a year ago

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 20: Upper Confidence Bound

Reminders

Agenda

Multi-Armed Bandit

MAB Setting

Optimal Action and Regret

Agenda

Explore-then-Commit

Regret of ETC

Confidence intervals

Regret of ETC

Explore-then-Commit

Agenda

Upper Confidence Bound

Upper Confidence Bound

UCB Intuition

Agenda

Sub-optimality at \(t\)

Sublinear Regret

Proof of Claim

Tighter bound

Comparison

Preview: Contextual Bandits

Preview: Contextual Bandits

Preview: Contextual Bandits

Preview: Contextual Bandits

Recap

Sp24 CS 4/5789: Lecture 20

More from Sarah Dean