CS 4/5789: Introduction to Reinforcement Learning

Lecture 22: Exploration in MDPs

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

My OH rescheduled to today after lecture
Homework
- 5789 Paper Assignments
- PSet 8 due Monday
- Final PA due next week
Final exam is Tuesday 5/14 at 2pm in Ives 305
Prelim grades released. Out of 48 points,
- A range: 42+ points
- B range: 33+ points
- C range: 28+ points

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

Recap: Bandits

A simplified setting for studying exploration

ex - machine make and model affect rewards, so context $x=($•$, $•$, $•$, $•$, $•$, $•$, $•$, $•$)$

See context $x_t\sim \mathcal D$, pull "arm" $a_t$ and get reward $r_t\sim r(x_t, a_t)$ with $\mathbb E[r(x_, a)] = \mu_a(x)$, in the linear case $=\theta_a^\top x$
Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$

Recap: UCB

UCB-type Algorithms

for $t=1,2,...,T$
- Observe context $x_t$
- Pull arm $a_t$ optimistically
  - MAB: $\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$
  - Linear: $\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$
- update estimates and confidence intervals

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

Model Based RL

Finite horizon tabular MDP with given initial state $$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, s_0\}$$
Finite states $|\mathcal S| = S$ and actions $|\mathcal A|=A$
Transition probabilities $P$ unknown
- for simplicity assume reward function is noiseless, thus perfectly known once $s,a$ is visited
Model-based RL has two simple steps:
1. Use data to estimate $\hat P$
2. Use $\hat P$ with algorithms from Unit 1

Transition Estimation

Consider a dataset $\mathcal D=\{s_0,a_0,s_1,a_1,...\}$
How to estimate $\hat P(\cdot|s,a)$?
- Collect all time steps $t_i$ such $(s_{t_i},a_{t_i})=(s,a)$
- Define the dataset: $$\mathcal D(s,a) = \{(s_{t_i}, a_{t_i}, s_{t_{i}+1})\}_{i=1}^{N(s,a)}$$
- $N(s,a)$ is the number of times $s,a$ is visited

Consider biased coin which is heads with probability $p$ for an unknown value of $p\in[0,1]$
How to estimate from trials?
- Flip coin $N$ times $$\hat p =\frac{\mathsf{\# heads}}{N} $$
Consider $S$ sided die which is side $s$ with probability $p_s$ for $s\in\{1,\dots,S\}=[S]$, where the $p_s$ are unknown
How to estimate from trials?
- Roll dice $N$ times $$\hat p_s =\frac{\mathsf{\# times~land~on}~s}{N} $$

Warmup: Coin & Dice

For the weighted coin,
- Estimate $\hat p$ is Binomial$(p, N)$
- $|p-\hat p| \leq\sqrt{\frac{\log(2/\delta)}{N}}$ with probability $1-\delta$
For the $S$ side die, with probability $1-\delta$, $$\max_{s\in[S]} |p_s-\hat p_s| \leq \sqrt{\frac{\log(2S/\delta)}{N}} $$
Alternatively, the total variation distance w.p. $1-\delta$ $$ \sum_{s\in[S]} |p_s-\hat p_s| \leq \sqrt{\frac{S\log(2/\delta)}{N}} $$
Why? Unbiased $\mathbb E[\hat p]=p$ & concentration (Hoeffding's)

Estimation Errors

This slide is not in scope. Hoeffding's inequality states that for independent random variables $X_1,\dots,X_n$ with $a_i\leq X_i\leq b_i$, let $S_n=\sum_{i=1}^n X_i$, then $$ \mathbb P\{|S_n-\mathbb E[S_n]|\geq t\} \leq 2\exp\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)$$
Rearranging, this is equivalent to $$ |\frac{1}{n}S_n-\mathbb E[\frac{1}{n} S_n]|\leq \frac{1}{n}\sqrt{\frac{1}{2}\log(2/\delta)\sum_{i=1}^n (b_i-a_i)^2} \quad\text{w.p. at least}\quad 1-\delta$$

Hoeffding's Inequality

Transition Estimation

How to estimate $P(s,a)$? $$ \mathcal D(s,a) = \{(s_{t_i}, a_{t_i}, s_{t_{i}+1})\}_{i=1}^{N(s,a)} $$
Estimate by counting! $$\hat P(s'|s,a) =\frac{\mathsf{\# times~}~s_{t_i+1}=s'}{N(s,a)} \quad\forall~~s,a$$
Lemma: Estimation error of above, with probability $1-\delta$ $$\sum_{s'\in\mathcal S} |P(s'|s,a)-\hat P(s'|s,a)| \leq \sqrt{\frac{\alpha}{N(s,a)}} $$ for $\alpha=S^2A \log(2/\delta)$
- Proof out of scope, but details in slides below
How can we make sure $N(s,a)$ is large enough?

Exploration Example

Consider the deterministic chain MDP
- $\mathcal S = \{0,\dots,H-1\}$ and $A = \{1, \dots, A\}$
- $s_{t+1} = s_t + \mathbf 1\{a_t=1\} - \mathbf 1\{a_t\neq 1\}$ except at endpoints
Suppose $s_0=0$, finite horizon $H$, and reward function $$r(s,a) = \mathbf 1\{s=H-1\}$$
Uniformly random policy for exploration, observe $(r_t)_{0\leq t\leq H-1}$
- $\mathbb P\{r_t= 0\forall t\} = 1-1/A^H$

$\neq 1$

$1$

$0$

$1$

$2$

$H-1$

...

$\neq 1$

$1$

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

UCB-VI

Initialize $\hat P_0$, $\hat r_0=0$, and $b_0$
for $i=1,2,...,T$
- Design optimistic policy $\hat \pi^i = VI(\hat P_i, \hat r_i+b_i)$
- Rollout $\hat\pi^i$ and observe trajectory
- Update $\hat P_{i+1}$, $\hat r_i$, and $b_{i+1}$

Optimistic MBRL

Design a reward bonus to incentivize exploration of unseen states and actions

Model Estimation

Using the dataset at iteration $i$: $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $$
Number of times we took action $a$ in state $s$ $$N_i(s,a) = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} $$
Number of times we transitioned to $s'$ after $s,a$ $$N_i(s,a, s') = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a, s^k_{t+1}=s'\} $$
Estimate transition probability $\hat P_i(s'|s,a) = \frac{N_i(s,a,s')}{N_i(s,a)}$
Reward of $s,a$: $\hat r_i(s,a) = \frac{1}{N_i(s,a) } \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} r_t^k$

$\neq 1$

$1$

$0$

$1$

$2$

$H-1$

...

$\neq 1$

$1$

Uniformly random policy:
- e.g. $a_{0:8} = (1,3,2,2,3,1,1,3,2)$
- then $s_{0:9} = (0,1,0,0,0,0,1,2,1,0)$
$N(s,a) = ?$
$\hat P(s'|s,a)=$ PollEV
$\hat r(s,a)=0$

Example

Reward Bonus

Using the dataset at iteration $i$: $\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $
Number of times $s,a$: $N_i(s,a)$
Number of times $s,a\to s'$: $N_i(s,a, s')$
Reward of $s,a$: $\hat r_i(s,a)$
Reward bonus $$ b_i(s,a) = H\sqrt{\frac{\alpha}{N_i(s,a)}} $$

Initialize $\hat V^i_H(s)=0$
for $t=H-1, ..., 0$
- $\hat Q^i_t(s,a)=\hat r_i(s,a)+b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$
- $\hat\pi^i(s) = \arg\max_a \hat Q^i_t(s,a)$
- $\hat V^i_{t}(s)=\hat Q^i_t(s,\hat\pi^i(s) )$

Optimistic DP

In finite horizon setting, VI is just DP

$\neq 1$

$1$

$0$

$1$

$2$

$H-1$

...

$\neq 1$

$1$

Iteration 1: uniformly random policy for exploration
- trajectory contains $s,a$ with probability $\propto 1/A^s$
Iteration 2: reward bonus incentivizes upward exploration
Eventually reach $s=H-1$ and converge to $\pi^\star_t(s) = 1$

Example

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

The exploration bonus bounds, with high probability, the difference $$ |\mathbb E_{s'\sim\hat P(s,a)}[V(s')]-\mathbb E_{s'\sim P(s,a)}[V(s')]| \quad\forall ~~s,a $$
- similar to confidence intervals bounding $|\mu_a-\hat\mu_a|$
The exploration bonus leads to optimism $$ \hat V_t^i(s) \geq V_t^\star(s) $$
- similar to showing that $\hat\mu_{\hat a_\star} \geq \mu_\star$

Analysis: Two Key Facts

These two properties are key to proving a regret bound $$R(T) = \mathbb E\left[\sum_{i=1}^T V_0^\star (s_0)-V_0^{\pi^i}(s_0) \right] \lesssim H^2\sqrt{SA\cdot T}$$
Above, define regret as cumulative sub-optimality (over episodes)
- Note: sub-optimality in terms of Value is difference in cumulative reward (over time)
Argument follows UCB proof structure:
1. By optimism, $V^\star_0(s_0) - V_0^{\pi^i}(s_0) \leq \hat V_0^i(s_0) - V^{\pi^i}_0(s_0)$
2. Simulation Lemma to compare $\hat V_0^i(s_0)$ & $V^{\pi^i}_0(s_0)$
Regret proof is out of scope for this class (see 6789)

Regret in RL Setting

$\neq 1$

$1$

...

$\neq 1$

$1$

Regret bonus enables the "right amount" of exploration

Example

Analysis: Exploration Bonus

Lemma: For any fixed function $V:\mathcal S\to [0,H]$, whp, $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| \leq H\sqrt{\frac{\alpha}{N_i(s,a)}}=b_i(s,a)$$ where $\alpha$ depends on $S,A,H$ and failure probability
Proof: $|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| =$
- $=\left|\sum_{s'\in\mathcal S} (\hat P_i(s'|s,a) - P_i(s'|s,a) )V(s')\right|$
- $\leq \sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) | |V(s')|$
- $\leq \left(\sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) |\right)\max_{s'} |V(s')| $
- $\leq \sqrt{\frac{\alpha}{N_i(s,a)}} \underbrace{\max_{s'} |V(s')|}_{\leq H}$ using result from previous Lemma

Analysis: Optimism

Lemma: as long as $r(s,a)\in[0,1]$, $ \hat V_t^i(s) \geq V_t^\star(s)$ for all $t,i,s$
Proof: By induction.
1. Base case $\hat V_H^i(s)=0=V_H^\star(s)$
2. Assume that $ \hat V_{t+1}^i(s) \geq V_{t+1}^\star(s)$
Then $\hat Q_t^i(s,a) - Q_t^\star (s,a)$
$ = \hat r_i(s,a) + b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]- r(s,a) - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] $
$ = b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] $ (noiseless $r_t$)
$ \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[V^\star_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] $ (assumption)
$ \geq b_i(s,a)-b_i(s,a) $ (previous Lemma)
Thus $\hat Q_t^i(s,a) \geq Q_t^\star (s,a)$ which implies $\hat V_t^i(s) \geq V_t^\star (s)$ (exercise)

Recap

PSet due Monday

Exploration problem
UCB-VI Algorithm

Next lecture: Imitation learning

Sp24 CS 4/5789: Lecture 22

By Sarah Dean

Sp24 CS 4/5789: Lecture 22

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 22: Exploration in MDPs

Reminders

Agenda

Recap: Bandits

Recap: UCB

Agenda

Model Based RL

Transition Estimation

Warmup: Coin & Dice

Estimation Errors

Hoeffding's Inequality

Transition Estimation

Exploration Example

Agenda

Optimistic MBRL

Model Estimation

Example

Reward Bonus

Optimistic DP

Example

Agenda

Analysis: Two Key Facts

Regret in RL Setting

Example

Analysis: Exploration Bonus

Analysis: Optimism

Recap

Sp24 CS 4/5789: Lecture 22

More from Sarah Dean