## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 19

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Motivation: Context

2. Setting: Contextual Bandits

3. Naive Approach

4. Function Approximation

## Announcements

Prelim corrections due 4/12

5789 Paper Review Assignment (weekly pace suggested)

## Multi-Armed Bandit

A simplified setting for studying exploration

## Recap

• Pull "arm" $$a$$ and get reward $$r_t\sim r(a_t)$$ with $$\mathbb E[r(a)] = \mu_a$$
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Explore-then-Commit

1. Pull each arm $$N$$ times and compute empirical mean $$\widehat \mu_a$$
2. For $$t=NK+1,...,T$$:
Pull $$\widehat a^* = \arg\max_a \widehat \mu_a$$

Upper Confidence Bound

For $$t=1,...,T$$:

• Pull $$a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$$
• Update empirical means $$\widehat \mu_t^a$$ and counts $$N_t^a$$

## Recap

Explore-then-Commit

1. Pull each arm $$N$$ times and compute empirical mean $$\widehat \mu_a$$
2. For $$t=NK+1,...,T$$:
Pull $$\widehat a^* = \arg\max_a \widehat \mu_a$$

Upper Confidence Bound

For $$t=1,...,T$$:

• Pull $$a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$$
• Update empirical means $$\widehat \mu_t^a$$ and counts $$N_t^a$$

Set exploration $$N \approx T^{2/3}$$,

$$R(T) \lesssim T^{2/3}$$

$$R(T) \lesssim \sqrt{T}$$

## Motivation: Context

Journalism

Programming

But consider different users:

CS Major

English Major

Example: online shopping

## Motivation: Context

"Arms" are various products

But what about search queries, browsing history, items in cart?

Example: social media feeds

## Motivation: Context

"Arms" are various posts: images, videos

Personalized to each user based on demographics, behavioral data, etc

By Sarah Dean

Private