CS 4/5789: Introduction to Reinforcement Learning
Lecture 20
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Linear Contextual Bandits
2. Interactive Demo
3. LinUCB Algorithm
Announcements
My office hours today are cancelled
Prelim corrections due tomorrow - please list collaborators
5789 Paper Review Assignment (weekly pace suggested)
HW 3 released tonight, due in 2 weeks
Final exam Monday 5/16 at 7pm


Multi-Armed Bandit
A simplified setting for studying exploration
- Pull "arms" \(a\in \mathcal A =\{1,\dots,K\}\), get noisy reward \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$
MAB Recap
Explore-then-Commit
- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):
Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)
Upper Confidence Bound
For \(t=1,...,T\):
- Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
- Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)
Set exploration \(N \approx T^{2/3}\),
\(R(T) \lesssim T^{2/3}\)
\(R(T) \lesssim \sqrt{T}\)

Contextual Bandit
A (less) simplified setting for studying exploration
ex - machine make an model affect rewards, so context \(x=(\)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\()\)
- See context \(x\sim \mathcal D\), pull "arm" \(a\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x, a)] = \mu_a(x)\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
Explore-then-Commit
- Pull each arm \(N\) times and used supervised learning to estimate $$\widehat \mu_a(x) = \arg\min_{\mu\in\mathcal M} \sum_{i=1}^N (\mu(x_i^a) - r_i^a)^2$$
- For \(t=NK+1,...,T\):
Observe context \(x_t\)
Pull \(a_t = \widehat \pi(x_t) = \arg\max_a \widehat \mu_a(x_t)\)
Contextual Bandit
Set exploration \(N \approx T^{2/3}\),
we showed \(R(T) \lesssim T^{2/3}\) using prediction error guarantees \(\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]\)
Contextual Bandit
Set exploration \(N \approx T^{2/3}\),
we showed \(R(T) \lesssim T^{2/3}\) using prediction error guarantees \(\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]\)
For context-dependent confidence bounds, we need to understand
\(\mathbb E[|\widehat \mu_a(x) - \mu_a(x)|\mid x]\)
Agenda
0. Announcements & Recap
1. Linear Contextual Bandits
2. Interactive Demo
3. LinUCB Algorithm
CS 4/5789: Lecture 20
By Sarah Dean