Sp23 CS 4/5789: Lecture 11

Reminders

Homework
- PSet 2 regrade requests today-Friday
- PA 2 due Friday
- PSet 4 released next week
- 5789 Paper Review Assignment
Midterm 3/15 during lecture
- Let us know conflicts/accomodations ASAP! (EdStem)
- Review Lecture on 3/12 (last year's slides/recording)
- Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4
  - also: equation sheet (next week), 2023 notes, PAs

Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

Recap: MDPs and Control

So far, we have algorithms (VI, PI, DP, LQR) for when transitions $P$ or $f$ are known and tractable to optimize
In Unit 2, we develop algorithms for unknown (or intractable) transitions

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

Infinite horizon discounted MDP with finite states and actions

maximize $\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$

s.t. $s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$

$\pi$

minimize $\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

Finite horizon deterministic MDP with continuous states/actions

Recap: MDPs and Control

action

state

$a_t$

reward

$s_t$

$r_t$

In Unit 2, we develop algorithms for unknown (or intractable) transitions

Recap: Distributions

Recall the state distribution for a policy $\pi$ (Lecture 2)
- $d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\}$
We showed that it can be written as $d^{\pi}_{\mu_0,t} = P_\pi^\top d^{\pi}_{\mu_0,t-1} = (P_\pi^t)^\top \mu_0$
The discounted distribution (PSet) $d_{\mu_0}^{\pi} = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}}$

In Unit 2 we no longer know distributions
- e.g. a Binomial $p(x;n,p) = \binom{n}{x}p^x(1-p)^{n-x}$
Instead, we observe samples
- e.g. we $x_1, x_2, x_3,\dots$ drawn from Binomial

When the initial state is fixed to a known $s_0$ , i.e. $\mu_0=e_{s_0}$ we write $d_{s_0,t}^{\pi}$

Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

Query model (also called generative model)
- setting where we can query, for any $s$ and $a$ , the transition/dynamics to sample $s'\sim P(s,a)$
This is black-box sampling access
- Directly applicable to games, physic simulators
Simple starting point to understand sample complexity: how many samples are required for good performance?

Query Model

Algorithm: MBRL with Queries

Inputs: sample points $\{s_i,a_i\}_{i=1}^N$
For $i=1,\dots, N$ :
- sample $s'_i \sim P(s_i, a_i)$ and record $(s'_i,s_i,a_i)$
Fit transition model $\hat P$ from data $\{(s'_i,s_i,a_i)\}_{i=1}^N$
Design $\hat \pi$ using $\hat P$

Model-based RL

Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

Example: Sub-Optimality

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

The reward is:
- $+1$ for $s=0$ and $-\frac{1}{2}$ for
  $a=$ switch
- Let $\gamma=\frac{1}{2}$
Recall from Lecture 4 that the policy $\pi(s)=$ stay is optimal if $p_2\leq \frac{2p_1}{2- p_1}+\frac{1}{4}$
If we mis-estimate $\hat p_1,\hat p_2$ , may choose the wrong policy

Model error in $\hat P$ leads to sub-optimal policies
Notation:
- $P$ is the true transition function
- $V^{\pi}$ is the true value of a policy $\pi$
  - i.e., $\mathsf{PolicyEval}(P,r,\pi)$ "value on $P$ "
- $\hat V^{\pi}$ is the estimated value of a policy $\pi$
  - i.e., $\mathsf{PolicyEval}(\hat P,r,\pi)$ "value on $\hat P$ "
- $V^\star$ and $\pi^\star$ are the true optimal value and policy
Assumption for today's lecture: $0\leq r(s,a) \leq 1$ for all states and actions

Sub-Optimality

Suppose Policy Iteration on $\hat P, r$ converges to a fixed policy $\hat \pi^\star$
- What do we know about $\hat\pi^\star$ ? PollEv
The sub-optimality is
- $V^\star(s_0) - V^{\hat\pi^\star}(s_0)$
  - $=V^\star(s_0) - \hat V^{\hat\pi^\star}(s_0) + \hat V^{\hat\pi^\star}(s_0) - V^{\hat\pi^\star}(s_0)$
  - $\leq V^\star(s_0) - \hat V^{\pi^\star}(s_0) + \hat V^{\hat\pi^\star}(s_0) - V^{\hat\pi^\star}(s_0)$
    - ( $\hat\pi^\star$ optimal for $\hat P$ )
- Two terms: value of $\pi^\star$ and value of $\hat \pi^\star$ on $P$ vs. $\hat P$

Error in Policies

Simulation Lemma: For a deterministic policy $\pi$ , $|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi}^{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$

Difference in Value

For a fixed policy, what is the difference in value when computed using $P$ vs. when using $\hat P$ ?

error in predicted next state
averaged over discounted state distribution

$\underbrace{\qquad\qquad}$

$\sum_{s'\in\mathcal S} |\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s))|$

total variation distance on distribution over $s'$

Simulation Lemma: For a deterministic policy, $|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$

Simulation Lemma Proof

Recursive argument: $\hat V^\pi-V^\pi =\gamma \sum_{\ell=0}^\infty \gamma^\ell P_\pi^\ell (\hat P_\pi - P_\pi) \hat V^\pi$
$\hat V^\pi(s_0) - V^\pi(s_0) = (\hat V^\pi-V^\pi)^\top e_{s_0}$
- $=\gamma\sum_{\ell=0}^\infty \gamma^\ell [(\hat P_\pi - P_\pi) \hat V^\pi]^\top (P_\pi^\ell)^\top e_{s_0}$
- $= \gamma [(\hat P_\pi - P_\pi) \hat V^\pi]^\top \sum_{\ell=0}^\infty \gamma^\ell (P_\pi^\ell)^\top e_{s_0}$
- $=\frac{\gamma}{1-\gamma} [(\hat P_\pi - P_\pi) \hat V^\pi]^\top d_\pi^{s_0}$ (definition of discounted distribution)
- $= \frac{\gamma}{1-\gamma}\sum_{s\in\mathcal S} \left((\hat P_\pi - P_\pi) \hat V^\pi\right) [s] d_\pi^{s_0}[s]$
- $=\frac{\gamma}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left [\left((\hat P_\pi - P_\pi) \hat V^\pi\right) [s] \right ]$ (definition of expectation)

Simulation Lemma: For a deterministic policy, $|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$

Simulation Lemma Proof

Recursive argument: $\hat V^\pi-V^\pi =\gamma \sum_{\ell=0}^\infty \gamma^\ell P_\pi^\ell (\hat P_\pi - P_\pi) \hat V^\pi$
$\hat V^\pi(s_0) - V^\pi(s_0) = \frac{\gamma}{1-\gamma}\mathbb E_{s\sim d_\pi^{s_0}}\left [\left((\hat P_\pi - P_\pi) \hat V^\pi\right) [s] \right ]$
- $= \frac{\gamma}{1-\gamma}\mathbb E_{s\sim d_\pi^{s_0}}\left [(\hat P_\pi - P_\pi) [s]^\top \hat V^\pi \right ]$ (defn of matrix multiplication)
- $= \frac{\gamma}{1-\gamma}\mathbb E_{s\sim d_\pi^{s_0}}\left [\sum_{s'\in\mathcal S}(\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s))\hat V^\pi(s') \right ]$
$|\hat V^\pi(s_0) - V^\pi(s_0)|\leq\frac{\gamma}{1-\gamma} \mathbb E_{s\sim d_\pi^{s_0}}\left [\sum_{s'\in\mathcal S}|\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s)||\hat V^\pi(s')| \right ]$
- $\leq\frac{\gamma}{1-\gamma} \mathbb E_{s\sim d_\pi^{s_0}}\left [\sum_{s'\in\mathcal S}|\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s)|\frac{1}{1-\gamma} \right ]$ (bounded reward)

Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

Consider biased coin which is heads with probability $p$ for an unknown value of $p\in[0,1]$
How to estimate from trials?
- Flip coin $N$ times $\hat p =\frac{\mathsf{\# heads}}{N}$
Consider $S$ sided die which is side $s$ with probability $p_s$ for $s\in\{1,\dots,S\}=[S]$ , where the $p_s$ are unknown
How to estimate from trials?
- Roll dice $N$ times $\hat p_s =\frac{\mathsf{\# times~land~on}~s}{N}$

Warmup: Coin & Dice

Transition Estimation

How to estimate $P(s,a)$ from queries?
- Query $s_i'\sim P(s,a)$ for $i=1,\dots n$ samples $\hat P(s'|s,a) =\frac{\mathsf{\# times~}~s'_i=s'}{n}$
- Repeat procedure for all $s,a$
- Total number of samples $N=SA\cdot n$
Lemma: Estimation error of above, with probability $1-\delta$ $\max_{s,a }\sum_{s'\in\mathcal S} |P(s'|s,a)-\hat P(s'|s,a)| \leq \sqrt{\frac{S^2A \log(2SA/\delta)}{N}}$

The sub-optimality of $\hat \pi^\star=\mathsf{PolicyIteration}(\hat P, r)$
$V^\star(s_0) - V^{\hat\pi^\star}(s_0)$
- $\leq V^\star(s_0) - \hat V^{\pi^\star}(s_0) + \hat V^{\hat\pi^\star}(s_0) - V^{\hat\pi^\star}(s_0)$
- $\leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi^\star}^{s_0}}\left[ \|\hat P(\cdot |s,\pi^\star(s)) - P(\cdot|s,\pi^\star(s))\|_1\right]$
  $+\frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\hat\pi^\star}^{s_0}}\left[ \|\hat P(\cdot |s,\hat\pi^\star(s)) - P(\cdot|s,\hat\pi^\star(s))\|_1\right]$ (Simulation Lemma x2)
- $\leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi^\star}^{s_0}}\left[ \max_a \|\hat P(\cdot |s,a) - P(\cdot|s,a)\|_1\right]$
  $+\frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\hat\pi^\star}^{s_0}}\left[ \max_a\|\hat P(\cdot |s,a) - P(\cdot|s,a)\|_1\right]$
- $\leq \frac{2\gamma}{(1-\gamma)^2} \max_{s,a} \|\hat P(\cdot |s,a) - P(\cdot|s,a)\|_1$

Sub-Optimality

Recap

PA due Friday
Prelim in class 3/15

Query Setting of MBRL
Simulation Lemma
Estimation

Next lecture: Approximate Policy Iteration

CS 4/5789: Introduction to Reinforcement Learning

Lecture 11: Model-Based RL

Reminders

Agenda

Recap: MDPs and Control

Recap: MDPs and Control

Recap: Distributions

Agenda

Query Model

Model-based RL

Agenda

Example: Sub-Optimality

Sub-Optimality

Error in Policies

Difference in Value

Simulation Lemma Proof

Simulation Lemma Proof

Simulation Lemma Proof

Agenda

Easy Case: Deterministic Transitions

Easy Case: Deterministic Transitions

Warmup: Coin & Dice

Estimation Errors

Hoeffding's Inequality

Transition Estimation

Sub-Optimality

Sample Complexity

Recap