CS 4/5789: Introduction to Reinforcement Learning
Lecture 19
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Motivation: Context
2. Setting: Contextual Bandits
3. Naive Approach
4. Function Approximation
Announcements
Prelim corrections due 4/12
5789 Paper Review Assignment (weekly pace suggested)



Multi-Armed Bandit
A simplified setting for studying exploration
Recap
- Pull "arm" \(a\) and get reward \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$
Explore-then-Commit
- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):
Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)
Upper Confidence Bound
For \(t=1,...,T\):
- Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
- Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)
Recap
Explore-then-Commit
- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):
Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)
Upper Confidence Bound
For \(t=1,...,T\):
- Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
- Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)
Set exploration \(N \approx T^{2/3}\),
\(R(T) \lesssim T^{2/3}\)
\(R(T) \lesssim \sqrt{T}\)
Example: online advertising
Motivation: Context




Journalism
Programming
"Arms" are different job ads:
But consider different users:
CS Major
English Major
Example: online shopping
Motivation: Context
"Arms" are various products
But what about search queries, browsing history, items in cart?

Example: social media feeds
Motivation: Context
"Arms" are various posts: images, videos
Personalized to each user based on demographics, behavioral data, etc

CS 4/5789: Lecture 19
By Sarah Dean