CS 4/5789: Introduction to Reinforcement Learning

Lecture 20

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements & Recap

1. Linear Contextual Bandits

2. Interactive Demo

3. LinUCB Algorithm

Announcements

My office hours today are cancelled

Prelim corrections due tomorrow - please list collaborators

5789 Paper Review Assignment (weekly pace suggested)

HW 3 released tonight, due in 2 weeks

Final exam Monday 5/16 at 7pm

Multi-Armed Bandit

A simplified setting for studying exploration

Pull "arms" $a\in \mathcal A =\{1,\dots,K\}$, get noisy reward $r_t\sim r(a_t)$ with $\mathbb E[r(a)] = \mu_a$
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

MAB Recap

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$
Update empirical means $\widehat \mu_t^a$ and counts $N_t^a$

Set exploration $N \approx T^{2/3}$,

$R(T) \lesssim T^{2/3}$

$R(T) \lesssim \sqrt{T}$

Contextual Bandit

A (less) simplified setting for studying exploration

ex - machine make an model affect rewards, so context $x=($•$, $•$, $•$, $•$, $•$, $•$, $•$, $•$)$

See context $x\sim \mathcal D$, pull "arm" $a$ and get reward $r_t\sim r(x_t, a_t)$ with $\mathbb E[r(x, a)] = \mu_a(x)$
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$

Explore-then-Commit

Pull each arm $N$ times and used supervised learning to estimate $$\widehat \mu_a(x) = \arg\min_{\mu\in\mathcal M} \sum_{i=1}^N (\mu(x_i^a) - r_i^a)^2$$
For $t=NK+1,...,T$:
Observe context $x_t$
Pull $a_t = \widehat \pi(x_t) = \arg\max_a \widehat \mu_a(x_t)$

Contextual Bandit

Set exploration $N \approx T^{2/3}$,

we showed $R(T) \lesssim T^{2/3}$ using prediction error guarantees $\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]$

Contextual Bandit

Set exploration $N \approx T^{2/3}$,

we showed $R(T) \lesssim T^{2/3}$ using prediction error guarantees $\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]$

For context-dependent confidence bounds, we need to understand

$\mathbb E[|\widehat \mu_a(x) - \mu_a(x)|\mid x]$

Agenda

0. Announcements & Recap

1. Linear Contextual Bandits

2. Interactive Demo

3. LinUCB Algorithm

CS 4/5789: Lecture 20

By Sarah Dean

CS 4/5789: Lecture 20

3 years ago

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 20

Agenda

Announcements

Multi-Armed Bandit

MAB Recap

Contextual Bandit

Contextual Bandit

Contextual Bandit

Agenda

CS 4/5789: Lecture 20

More from Sarah Dean