CS 4/5789: Introduction to Reinforcement Learning

Lecture 19

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements & Recap

1. Motivation: Context

2. Setting: Contextual Bandits

3. Naive Approach

4. Function Approximation

Announcements

Prelim corrections due 4/12

5789 Paper Review Assignment (weekly pace suggested)

Multi-Armed Bandit

A simplified setting for studying exploration

Recap

Pull "arm" $a$ and get reward $r_t\sim r(a_t)$ with $\mathbb E[r(a)] = \mu_a$
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$
Update empirical means $\widehat \mu_t^a$ and counts $N_t^a$

Recap

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$
Update empirical means $\widehat \mu_t^a$ and counts $N_t^a$

Set exploration $N \approx T^{2/3}$,

$R(T) \lesssim T^{2/3}$

$R(T) \lesssim \sqrt{T}$

Example: online advertising

Motivation: Context

Journalism

Programming

"Arms" are different job ads:

But consider different users:

CS Major

English Major

Example: online shopping

Motivation: Context

"Arms" are various products

But what about search queries, browsing history, items in cart?

Example: social media feeds

Motivation: Context

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

CS 4/5789: Lecture 19

By Sarah Dean

CS 4/5789: Lecture 19

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 19

Agenda

Announcements

Multi-Armed Bandit

Recap

Recap

Motivation: Context

Motivation: Context

Motivation: Context

CS 4/5789: Lecture 19

More from Sarah Dean