CS 4/5789: Introduction to Reinforcement Learning

Lecture 19

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

1. Motivation: Context

2. Setting: Contextual Bandits

3. Naive Approach

4. Function Approximation

Announcements

 

Prelim corrections due 4/12

5789 Paper Review Assignment (weekly pace suggested)

Multi-Armed Bandit

A simplified setting for studying exploration

Recap

  • Pull "arm" \(a\) and get reward \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
  • Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t)  \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Explore-then-Commit

  1. Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\):
        Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

Upper Confidence Bound

For \(t=1,...,T\):

  • Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
  • Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)

Recap

Explore-then-Commit

  1. Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\):
        Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

Upper Confidence Bound

For \(t=1,...,T\):

  • Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
  • Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)

Set exploration \(N \approx T^{2/3}\),

\(R(T) \lesssim T^{2/3}\)

\(R(T) \lesssim \sqrt{T}\)

Example: online advertising

Motivation: Context

Journalism

Programming

"Arms" are different job ads:

But consider different users:

CS Major

English Major

Example: online shopping

Motivation: Context

"Arms" are various products

But what about search queries, browsing history, items in cart?

Example: social media feeds

Motivation: Context

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

CS 4/5789: Lecture 19

By Sarah Dean

Private

CS 4/5789: Lecture 19