CS 4/5789: Introduction to Reinforcement Learning

Lecture 22: Exploration in MDPs

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 7 due Monday
- PA 4 due May 3
Final exam is Saturday 5/13 at 2pm
WICCxURMC Survey

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

Recap: Bandits

A simplified setting for studying exploration

ex - machine make and model affect rewards, so context $x=($•$, $•$, $•$, $•$, $•$, $•$, $•$, $•$)$

See context $x_t\sim \mathcal D$, pull "arm" $a_t$ and get reward $r_t\sim r(x_t, a_t)$ with $\mathbb E[r(x_, a)] = \mu_a(x)$, in the linear case $=\theta_a^\top x$
Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$

Recap: UCB

UCB-type Algorithms

for $t=1,2,...,T$
- Observe context $x_t$
- Pull arm $a_t$ optimistically
  - MAB: $\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$
  - Linear: $\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$
- update estimates and confidence intervals

Recap: Tabular MBRL

Algorithm:

Query each $(s,a)$ pair $\frac{N}{SA}$ times, record sample $s'\sim P(s,a)$
Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
Design $\widehat \pi$ as if $\widehat P$ is true

Analysis: $\widehat \pi$ vs. $\pi^*$

Compare $\widehat P$ and $P$ (Hoeffding's)
Compare $\widehat V^\pi$ and $V^\pi$ (Simulation Lemma)
Compare $\widehat V^{\widehat \pi}$ and $V^{\pi^*}$

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

MBRL with Exploration

Finite horizon tabular MDP with given initial state $$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, s_0\}$$
Finite states $|\mathcal S| = S$ and actions $|\mathcal A|=A$
Transition probabilities $P$ unknown
- for simplicity assume reward function is noiseless, thus perfectly known once $s,a$ is visited
Unlike in previous Unit, initial state is fixed!
- No longer possible to query $s'\sim P(s,a)$ for any $s,a$

Example

Consider the deterministic chain MDP
- $\mathcal S = \{0,\dots,H-1\}$ and $A = \{1, \dots, A\}$
- $s_{t+1} = s_t + \mathbf 1\{a_t=1\} - \mathbf 1\{a_t\neq 1\}$ except at endpoints
Suppose $s_0=0$, finite horizon $H$, and reward function $$r(s,a) = \mathbf 1\{s=H-1\}$$
Uniformly random policy for exploration, observe $(r_t)_{0\leq t\leq H-1}$
- $\mathbb P\{r_t= 0\forall t\} = 1-1/A^H$

$\neq 1$

$1$

$0$

$1$

$2$

$H-1$

...

$\neq 1$

$1$

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

UCB-VI

Initialize $\hat P_0$, $\hat r_0=0$, and $b_0$
for $i=1,2,...,T$
- Design optimistic policy $\hat \pi^i = VI(\hat P_i, \hat r_i+b_i)$
- Rollout $\hat\pi^i$ and observe trajectory
- Update $\hat P_{i+1}$, $\hat r_i$, and $b_{i+1}$

Optimistic MBRL

Design a reward bonus to incentivize exploration of unseen states and actions

Model Estimation

Using the dataset at iteration $i$: $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $$
Number of times we took action $a$ in state $s$ $$N_i(s,a) = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} $$
Number of times we transitioned to $s'$ after $s,a$ $$N_i(s,a, s') = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a, s^k_{t+1}=s'\} $$
Estimate transition probability $\hat P_i(s'|s,a) = \frac{N_i(s,a,s')}{N_i(s,a)}$
Reward of $s,a$: $\hat r_i(s,a) = \frac{1}{N_i(s,a) } \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} r_t^k$

$\neq 1$

$1$

$0$

$1$

$2$

$H-1$

...

$\neq 1$

$1$

Uniformly random policy:
- e.g. $a_{0:8} = (1,3,2,2,3,1,1,3,2)$
- then $s_{0:9} = (0,1,0,0,0,0,1,2,1,0)$
$N(s,a) = ?$
$\hat P(s'|s,a)=$ PollEV
$\hat r(s,a)=0$

Example

Reward Bonus

Using the dataset at iteration $i$: $\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $
Number of times $s,a$: $N_i(s,a)$
Number of times $s,a\to s'$: $N_i(s,a, s')$
Reward of $s,a$: $\hat r_i(s,a)$
Reward bonus $$ b_i(s,a) = H\sqrt{\frac{\alpha}{N_i(s,a)}} $$

Initialize $\hat V^i_H(s)=0$
for $t=H-1, ..., 0$
- $\hat Q^i_t(s,a)=\hat r_i(s,a)+b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$
- $\hat\pi^i(s) = \arg\max_a \hat Q^i_t(s,a)$
- $\hat V^i_{t}(s)=\hat Q^i_t(s,\hat\pi^i(s) )$

Optimistic DP

In finite horizon setting, VI is just DP

$\neq 1$

$1$

$0$

$1$

$2$

$H-1$

...

$\neq 1$

$1$

Iteration 1: uniformly random policy for exploration
- trajectory contains $s,a$ with probability $\propto 1/A^s$
Iteration 2: reward bonus incentivizes upward exploration
Eventually reach $s=H-1$ and converge to $\pi^\star_t(s) = 1$

Example

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

The exploration bonus bounds, with high probability, the difference $$ |\mathbb E_{s'\sim\hat P(s,a)}[V(s')]-\mathbb E_{s'\sim P(s,a)}[V(s')]| \quad\forall ~~s,a $$
- similar to confidence intervals bounding $|\mu_a-\hat\mu_a|$
The exploration bonus leads to optimism $$ \hat V_t^i(s) \geq V_t^\star(s) $$
- similar to showing that $\hat\mu_{\hat a_\star} \geq \mu_\star$

Analysis: Two Key Facts

These two properties are key to proving a regret bound $$R(T) = \mathbb E\left[\sum_{i=1}^T V_0^\star (s_0)-V^{\pi^i}(s_0) \right] $$
Above, define regret as cumulative sub-optimality (over episodes)
- now sub-optimality itself is cumulative reward over time (i.e. value)
Argument follows UCB proof structure:
1. By optimism, $V^\star_0(s_0) - V_0^{\pi^i}(s_0) \leq \hat V_0^i(s_0) - V^{\pi^i}_0(s_0)$
2. Simulation Lemma to compare $\hat V_0^i(s_0)$ & $V^{\pi^i}_0(s_0)$
Regret proof is out of scope for this class (see 6789)

Regret in RL Setting

$\neq 1$

$1$

...

$\neq 1$

$1$

Regret bonus enables the "right amount" of exploration

Example

Analysis: Exploration Bonus

Lemma: For any fixed function $V:\mathcal S\to [0,H]$, whp, $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| \leq H\sqrt{\frac{\alpha}{N_i(s,a)}}=b_i(s,a)$$ where $\alpha$ depends on $S,A,H$ and failure probability
Proof: $|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| =$
- $=\left|\sum_{s'\in\mathcal S} (\hat P_i(s'|s,a) - P_i(s'|s,a) )V(s')\right|$
- $\leq \sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) | |V(s')|$
- $\leq \left(\sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) |\right)\max_{s'} |V(s')| $
- $\leq \sqrt{\frac{\alpha}{N_i(s,a)}} \underbrace{\max_{s'} |V(s')|}_{\leq H}$ using result from Lecture 11

Analysis: Optimism

Lemma: as long as $r(s,a)\in[0,1]$, $ \hat V_t^i(s) \geq V_t^\star(s)$ for all $t,i,s$
Proof: By induction.
1. Base case $\hat V_H^i(s)=0=V_H^\star(s)$
2. Assume that $ \hat V_{t+1}^i(s) \geq V_{t+1}^\star(s)$
Then $\hat Q_t^i(s,a) - Q_t^\star (s,a)$
$ = \hat r_i(s,a) + b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]- r(s,a) - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] $
$ \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] $ (noiseless $r_t$)
$ \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[V^\star_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] $ (assumption)
$ \geq b_i(s,a)-b_i(s,a) $ (previous Lemma)
Thus $\hat Q_t^i(s,a) \geq Q_t^\star (s,a)$ which implies $\hat V_t^i(s) \geq V_t^\star (s)$ (exercise)

Recap

PSet due Monday

Exploration problem
UCB-VI Algorithm

Next lecture: Revisiting imitation learning

Sp23 CS 4/5789: Lecture 22

By Sarah Dean

Sp23 CS 4/5789: Lecture 22

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 22: Exploration in MDPs

Reminders

Agenda

Recap: Bandits

Recap: UCB

Recap: Tabular MBRL

Agenda

MBRL with Exploration

Example

Agenda

Optimistic MBRL

Model Estimation

Example

Reward Bonus

Optimistic DP

Example

Agenda

Analysis: Two Key Facts

Regret in RL Setting

Example

Analysis: Exploration Bonus

Analysis: Optimism

Recap

Sp23 CS 4/5789: Lecture 22

More from Sarah Dean