CS 4/5789: Introduction to Reinforcement Learning

Lecture 21

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

Announcements

5789 Paper Review Assignment (weekly pace suggested)

HW 3 released Monday, due 4/25

Final exam Monday 5/16 at 7pm

Recap: Contextual Bandit

A (less) simplified setting for studying exploration

ex - machine make and model affect rewards, so context $x=($•$, $•$, $•$, $•$, $•$, $•$, $•$, $•$)$

See context $x\sim \mathcal D$, pull "arm" $a$ and get reward $r_t\sim r(x_t, a_t)$ with $\mathbb E[r(x, a)] = \mu_a(x)$, in the linear case $=\theta_a^\top x$
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$

LinUCB Algorithm

For $t=1,...,T$:

Observe context $x_t$
Pull $a_t = \arg\max_a (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}$
Update $\widehat \theta^{a_t}$ and $A^{a_t}$

Contextual Bandit

Claim: with high probability, $\theta_a^\top x \leq (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}$
Chebychev's: If $\mathbb E[u]=0$ then with probability $1-1/\beta^2$, $$|u| \leq \beta \mathbb E[u^2]$$

Contextual Bandit

Claim: with high probability, $\theta_a^\top x \leq (\widehat \theta^a_t)^\top x + \alpha \sqrt{x^\top (A_t^a)^{-1} x}$
Chebychev's: If $\mathbb E[u]=0$ then with probability $1-1/\beta^2$, $$|u| \leq \beta \mathbb E[u^2]$$
Proof of Claim:
- Strategy: apply Chebychev's with $u=\theta_a^\top x - (\widehat \theta^a_t)^\top x$
- Last lecture: showed $\mathbb E[(\theta_a - \widehat \theta^a_t)^\top x]=0$

Agenda

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

Recap: Tabular MBRL

Algorithm:

Query each $(s,a)$ pair $\frac{N}{SA}$ times, record sample $s'\sim P(s,a)$
Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
Design $\widehat \pi$ as if $\widehat P$ is true

Analysis: $\widehat \pi$ vs. $\pi^*$

Compare $\widehat P$ and $P$ (Hoeffding's)
Compare $\widehat V^\pi$ and $V^\pi$ (Simulation Lemma)
Compare $\widehat V^{\widehat \pi}$ and $V^{\pi^*}$

Agenda

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

CS 4/5789: Lecture 21

By Sarah Dean

CS 4/5789: Lecture 21

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 21

Agenda

Announcements

Recap: Contextual Bandit

Contextual Bandit

Contextual Bandit

Agenda

Recap: Tabular MBRL

Agenda

CS 4/5789: Lecture 21

More from Sarah Dean