## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 19

Prof. Sarah Dean

MW 2:45-4pm

110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Motivation: Context

2. Setting: Contextual Bandits

3. Naive Approach

4. Function Approximation

## Announcements

Prelim corrections due 4/12

5789 Paper Review Assignment (weekly pace *suggested*)

## Multi-Armed Bandit

A simplified setting for studying exploration

## Recap

- Pull "arm" \(a\) and get reward \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

**Explore-then-Commit**

- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):

Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

**Upper Confidence Bound**

For \(t=1,...,T\):

- Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
- Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)

## Recap

**Explore-then-Commit**

- Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\):

Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

**Upper Confidence Bound**

For \(t=1,...,T\):

- Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
- Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)

Set exploration \(N \approx T^{2/3}\),

\(R(T) \lesssim T^{2/3}\)

\(R(T) \lesssim \sqrt{T}\)

Example: online advertising

## Motivation: Context

Journalism

Programming

"Arms" are different job ads:

But consider different users:

CS Major

English Major

Example: online shopping

## Motivation: Context

"Arms" are various products

But what about search queries, browsing history, items in cart?

Example: social media feeds

## Motivation: Context

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

#### CS 4/5789: Lecture 19

By Sarah Dean