## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 20

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Linear Contextual Bandits

2. Interactive Demo

3. LinUCB Algorithm

## Announcements

My office hours today are cancelled

Prelim corrections due tomorrow - please list collaborators

5789 Paper Review Assignment (weekly pace suggested)

HW 3 released tonight, due in 2 weeks

Final exam Monday 5/16 at 7pm

## Multi-Armed Bandit

A simplified setting for studying exploration

• Pull "arms" $$a\in \mathcal A =\{1,\dots,K\}$$, get noisy reward $$r_t\sim r(a_t)$$ with $$\mathbb E[r(a)] = \mu_a$$
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

## MAB Recap

Explore-then-Commit

1. Pull each arm $$N$$ times and compute empirical mean $$\widehat \mu_a$$
2. For $$t=NK+1,...,T$$:
Pull $$\widehat a^* = \arg\max_a \widehat \mu_a$$

Upper Confidence Bound

For $$t=1,...,T$$:

• Pull $$a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$$
• Update empirical means $$\widehat \mu_t^a$$ and counts $$N_t^a$$

Set exploration $$N \approx T^{2/3}$$,

$$R(T) \lesssim T^{2/3}$$

$$R(T) \lesssim \sqrt{T}$$

## Contextual Bandit

A (less) simplified setting for studying exploration

ex - machine make an model affect rewards, so context $$x=($$$$,$$$$,$$$$,$$$$,$$$$,$$$$,$$$$,$$$$)$$

• See context $$x\sim \mathcal D$$, pull "arm" $$a$$ and get reward $$r_t\sim r(x_t, a_t)$$ with $$\mathbb E[r(x, a)] = \mu_a(x)$$
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$

Explore-then-Commit

1. Pull each arm $$N$$ times and used supervised learning to estimate $$\widehat \mu_a(x) = \arg\min_{\mu\in\mathcal M} \sum_{i=1}^N (\mu(x_i^a) - r_i^a)^2$$
2. For $$t=NK+1,...,T$$:
Observe context $$x_t$$
Pull $$a_t = \widehat \pi(x_t) = \arg\max_a \widehat \mu_a(x_t)$$

## Contextual Bandit

Set exploration $$N \approx T^{2/3}$$,

we showed $$R(T) \lesssim T^{2/3}$$ using prediction error guarantees $$\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]$$

## Contextual Bandit

Set exploration $$N \approx T^{2/3}$$,

we showed $$R(T) \lesssim T^{2/3}$$ using prediction error guarantees $$\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]$$

For context-dependent confidence bounds, we need to understand

$$\mathbb E[|\widehat \mu_a(x) - \mu_a(x)|\mid x]$$

## Agenda

0. Announcements & Recap

1. Linear Contextual Bandits

2. Interactive Demo

3. LinUCB Algorithm

By Sarah Dean

Private