Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PSet 2 regrade requests today-Friday
• PA 2 due Friday
• PSet 4 released next week
• 5789 Paper Review Assignment
• Midterm 3/15 during lecture
• Let us know conflicts/accomodations ASAP! (EdStem)
• Review Lecture on 3/12 (last year's slides/recording)
• Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4
• also: equation sheet (next week), 2023 notes, PAs

## Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

## Recap: MDPs and Control

• So far, we have algorithms (VI, PI, DP, LQR) for when transitions $$P$$ or $$f$$ are known and tractable to optimize
• In Unit 2, we develop algorithms for unknown (or intractable) transitions

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$

Infinite horizon discounted MDP with finite states and actions

maximize   $$\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

minimize   $$\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$$

s.t.   $$s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$$

$$\pi$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$$

Finite horizon deterministic MDP with continuous states/actions

## Recap: MDPs and Control

action

state

$$a_t$$

reward

$$s_t$$

$$r_t$$

• In Unit 2, we develop algorithms for unknown (or intractable) transitions

## Recap: Distributions

• Recall the state distribution for a policy $$\pi$$ (Lecture 2)
• $$d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\}$$
• We showed that it can be written as $$d^{\pi}_{\mu_0,t} = P_\pi^\top d^{\pi}_{\mu_0,t-1} = (P_\pi^t)^\top \mu_0$$
• The discounted distribution (PSet) $$d_{\mu_0}^{\pi} = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}}$$
• In Unit 2 we no longer know distributions
• e.g. a Binomial $$p(x;n,p) = \binom{n}{x}p^x(1-p)^{n-x}$$
• e.g. we $$x_1, x_2, x_3,\dots$$ drawn from Binomial

When the initial state is fixed to a known $$s_0$$, i.e. $$\mu_0=e_{s_0}$$ we write $$d_{s_0,t}^{\pi}$$

## Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

• Query model (also called generative model)
• setting where we can query, for any $$s$$ and $$a$$, the transition/dynamics to sample $$s'\sim P(s,a)$$
• This is black-box sampling access
• Directly applicable to games, physic simulators
• Simple starting point to understand sample complexity: how many samples are required for good performance?

## Query Model

Algorithm: MBRL with Queries

• Inputs: sample points $$\{s_i,a_i\}_{i=1}^N$$
• For $$i=1,\dots, N$$:
• sample $$s'_i \sim P(s_i, a_i)$$ and record $$(s'_i,s_i,a_i)$$
• Fit transition model $$\hat P$$ from data $$\{(s'_i,s_i,a_i)\}_{i=1}^N$$
• Design $$\hat \pi$$ using $$\hat P$$

## Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

## Example: Sub-Optimality

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• The reward is:
• $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for
$$a=$$ switch
• Let $$\gamma=\frac{1}{2}$$
• Recall from Lecture 4 that the policy $$\pi(s)=$$stay is optimal if $$p_2\leq \frac{2p_1}{2- p_1}+\frac{1}{4}$$
• If we mis-estimate $$\hat p_1,\hat p_2$$, may choose the wrong policy
• Model error in $$\hat P$$ leads to sub-optimal policies
• Notation:
• $$P$$ is the true transition function
• $$V^{\pi}$$ is the true value of a policy $$\pi$$
• i.e., $$\mathsf{PolicyEval}(P,r,\pi)$$ "value on $$P$$"
• $$\hat V^{\pi}$$ is the estimated value of a policy $$\pi$$
• i.e., $$\mathsf{PolicyEval}(\hat P,r,\pi)$$ "value on $$\hat P$$"
• $$V^\star$$ and $$\pi^\star$$ are the true optimal value and policy
• Assumption for today's lecture: $$0\leq r(s,a) \leq 1$$ for all states and actions

## Sub-Optimality

• Suppose Policy Iteration on $$\hat P, r$$ converges to a fixed policy $$\hat \pi^\star$$
• What do we know about $$\hat\pi^\star$$? PollEv
• The sub-optimality is
• $$V^\star(s_0) - V^{\hat\pi^\star}(s_0)$$
• $$=V^\star(s_0) - \hat V^{\hat\pi^\star}(s_0) + \hat V^{\hat\pi^\star}(s_0) - V^{\hat\pi^\star}(s_0)$$
• $$\leq V^\star(s_0) - \hat V^{\pi^\star}(s_0) + \hat V^{\hat\pi^\star}(s_0) - V^{\hat\pi^\star}(s_0)$$
• ($$\hat\pi^\star$$ optimal for $$\hat P$$)
• Two terms: value of $$\pi^\star$$ and value of $$\hat \pi^\star$$ on $$P$$ vs. $$\hat P$$

## Error in Policies

Simulation Lemma: For a deterministic policy $$\pi$$, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi}^{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$

## Difference in Value

For a fixed policy, what is the difference in value when computed using $$P$$ vs. when using $$\hat P$$?

• error in predicted next state
• averaged over discounted state distribution

$$\underbrace{\qquad\qquad}$$

$$\sum_{s'\in\mathcal S} |\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s))|$$

total variation distance on distribution over $$s'$$

Simulation Lemma: For a deterministic policy, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$

## Simulation Lemma Proof

• First, derive a recursion:
• $$\hat V^\pi-V^\pi = R^\pi + \gamma \hat P_\pi \hat V^\pi - R^\pi - \gamma P_\pi V^\pi$$ (Bellman Eq)
• $$=\gamma (\hat P_\pi \hat V^\pi - P_\pi \hat V^\pi + P_\pi \hat V^\pi - P_\pi V^\pi)$$ (simplify and add zero)
• $$=\gamma (\hat P_\pi - P_\pi) \hat V^\pi + \gamma P_\pi (\hat V^\pi - V^\pi)$$
• Iterating expression $$k$$ times:
• $$\hat V^\pi-V^\pi =\sum_{\ell=1}^k \gamma^{\ell} P_\pi^{\ell-1} (\hat P_\pi - P_\pi) \hat V^\pi + \gamma^k P_\pi^k (\hat V^\pi - V^\pi)$$
• $$\hat V^\pi-V^\pi =\sum_{\ell=1}^\infty \gamma^\ell P_\pi^{\ell-1} (\hat P_\pi - P_\pi) \hat V^\pi$$    (limit $$k\to\infty$$)

For alternative proof without vector notation

Simulation Lemma: For a deterministic policy, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$

## Simulation Lemma Proof

• Recursive argument: $$\hat V^\pi-V^\pi =\gamma \sum_{\ell=0}^\infty \gamma^\ell P_\pi^\ell (\hat P_\pi - P_\pi) \hat V^\pi$$
• $$\hat V^\pi(s_0) - V^\pi(s_0) = (\hat V^\pi-V^\pi)^\top e_{s_0}$$
• $$=\gamma\sum_{\ell=0}^\infty \gamma^\ell [(\hat P_\pi - P_\pi) \hat V^\pi]^\top (P_\pi^\ell)^\top e_{s_0}$$
• $$= \gamma [(\hat P_\pi - P_\pi) \hat V^\pi]^\top \sum_{\ell=0}^\infty \gamma^\ell (P_\pi^\ell)^\top e_{s_0}$$
• $$=\frac{\gamma}{1-\gamma} [(\hat P_\pi - P_\pi) \hat V^\pi]^\top d_\pi^{s_0}$$ (definition of discounted distribution)
• $$= \frac{\gamma}{1-\gamma}\sum_{s\in\mathcal S} \left((\hat P_\pi - P_\pi) \hat V^\pi\right) [s] d_\pi^{s_0}[s]$$
• $$=\frac{\gamma}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left [\left((\hat P_\pi - P_\pi) \hat V^\pi\right) [s] \right ]$$ (definition of expectation)

Simulation Lemma: For a deterministic policy, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$

## Simulation Lemma Proof

• Recursive argument: $$\hat V^\pi-V^\pi =\gamma \sum_{\ell=0}^\infty \gamma^\ell P_\pi^\ell (\hat P_\pi - P_\pi) \hat V^\pi$$
• $$\hat V^\pi(s_0) - V^\pi(s_0) = \frac{\gamma}{1-\gamma}\mathbb E_{s\sim d_\pi^{s_0}}\left [\left((\hat P_\pi - P_\pi) \hat V^\pi\right) [s] \right ]$$
• $$= \frac{\gamma}{1-\gamma}\mathbb E_{s\sim d_\pi^{s_0}}\left [(\hat P_\pi - P_\pi) [s]^\top \hat V^\pi \right ]$$ (defn of matrix multiplication)
• $$= \frac{\gamma}{1-\gamma}\mathbb E_{s\sim d_\pi^{s_0}}\left [\sum_{s'\in\mathcal S}(\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s))\hat V^\pi(s') \right ]$$
• $$|\hat V^\pi(s_0) - V^\pi(s_0)|\leq\frac{\gamma}{1-\gamma} \mathbb E_{s\sim d_\pi^{s_0}}\left [\sum_{s'\in\mathcal S}|\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s)||\hat V^\pi(s')| \right ]$$
• $$\leq\frac{\gamma}{1-\gamma} \mathbb E_{s\sim d_\pi^{s_0}}\left [\sum_{s'\in\mathcal S}|\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s)|\frac{1}{1-\gamma} \right ]$$ (bounded reward)

## Agenda

1. Recap: MDPs and Control

2. MBRL with Query Model

3. Sub-Optimality

4. Model Error

• If transitions are deterministic in finite MDP,
• sample every state and every action once
• exactly reconstruct deterministic $$P$$
• compute optimal policy
• Sample complexity?
• $$SA$$
• If transition are stochastic, need to estimate probabilities

## Easy Case: Deterministic Transitions

• If transitions are deterministic in linear continuous MDP,
• sample $$s'=f(s,a)$$ for every standard basis vector $$\{(e_i,0)\}_{i=1}^{n_s}\cup \{(0,e_j)\}_{j=1}^{n_a}$$
• exactly reconstruct $$A$$ and $$B$$
• $$s'_i = Ae_i$$ is a column of $$A$$
• $$s'_j = Be_j$$ is a column of $$B$$
• compute optimal LQR policy
• Sample complexity?
• $$n_sn_a$$

## Easy Case: Deterministic Transitions

• Consider biased coin which is heads with probability $$p$$ for an unknown value of $$p\in[0,1]$$
• How to estimate from trials?
• Flip coin $$N$$ times $$\hat p =\frac{\mathsf{\# heads}}{N}$$
• Consider $$S$$ sided die which is side $$s$$ with probability $$p_s$$ for $$s\in\{1,\dots,S\}=[S]$$, where the $$p_s$$ are unknown
• How to estimate from trials?
• Roll dice $$N$$ times $$\hat p_s =\frac{\mathsf{\# times~land~on}~s}{N}$$

## Warmup: Coin & Dice

• For the weighted coin,
• Estimate $$\hat p$$ is Binomial$$(p, N)$$
• $$|p-\hat p| \leq\sqrt{\frac{\log(2/\delta)}{N}}$$ with probability $$1-\delta$$
• For the $$S$$ side die, with probability $$1-\delta$$, $$\max_{s\in[S]} |p_s-\hat p_s| \leq \sqrt{\frac{\log(2S/\delta)}{N}}$$
• Alternatively, the total variation distance w.p. $$1-\delta$$ $$\sum_{s\in[S]} |p_s-\hat p_s| \leq \sqrt{\frac{S\log(2/\delta)}{N}}$$
• Why? Unbiased $$\mathbb E[\hat p]=p$$ & concentration (Hoeffding's)

## Estimation Errors

• This slide is not in scope. Hoeffding's inequality states that for independent random variables $$X_1,\dots,X_n$$ with $$a_i\leq X_i\leq b_i$$, let $$S_n=\sum_{i=1}^n X_i$$, then $$\mathbb P\{|S_n-\mathbb E[S_n]|\geq t\} \leq 2\exp\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)$$
• Rearranging, this is equivalent to $$|\frac{1}{n}S_n-\mathbb E[\frac{1}{n} S_n]|\leq \frac{1}{n}\sqrt{\frac{1}{2}\log(2/\delta)\sum_{i=1}^n (b_i-a_i)^2} \quad\text{w.p. at least}\quad 1-\delta$$

## Transition Estimation

• How to estimate $$P(s,a)$$ from queries?
• Query $$s_i'\sim P(s,a)$$ for $$i=1,\dots n$$ samples $$\hat P(s'|s,a) =\frac{\mathsf{\# times~}~s'_i=s'}{n}$$
• Repeat procedure for all $$s,a$$
• Total number of samples $$N=SA\cdot n$$
• Lemma: Estimation error of above, with probability $$1-\delta$$ $$\max_{s,a }\sum_{s'\in\mathcal S} |P(s'|s,a)-\hat P(s'|s,a)| \leq \sqrt{\frac{S^2A \log(2SA/\delta)}{N}}$$
• The sub-optimality of $$\hat \pi^\star=\mathsf{PolicyIteration}(\hat P, r)$$
• $$V^\star(s_0) - V^{\hat\pi^\star}(s_0)$$
• $$\leq V^\star(s_0) - \hat V^{\pi^\star}(s_0) + \hat V^{\hat\pi^\star}(s_0) - V^{\hat\pi^\star}(s_0)$$
• $$\leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi^\star}^{s_0}}\left[ \|\hat P(\cdot |s,\pi^\star(s)) - P(\cdot|s,\pi^\star(s))\|_1\right]$$
$$+\frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\hat\pi^\star}^{s_0}}\left[ \|\hat P(\cdot |s,\hat\pi^\star(s)) - P(\cdot|s,\hat\pi^\star(s))\|_1\right]$$ (Simulation Lemma x2)
• $$\leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi^\star}^{s_0}}\left[ \max_a \|\hat P(\cdot |s,a) - P(\cdot|s,a)\|_1\right]$$
$$+\frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\hat\pi^\star}^{s_0}}\left[ \max_a\|\hat P(\cdot |s,a) - P(\cdot|s,a)\|_1\right]$$
• $$\leq \frac{2\gamma}{(1-\gamma)^2} \max_{s,a} \|\hat P(\cdot |s,a) - P(\cdot|s,a)\|_1$$

## Sub-Optimality

Theorem: For $$1\leq \delta\leq 1$$, run MBRL Algorithm with $$N \geq \frac{4\gamma^2 S^2 A\log(2SA/\delta)}{\epsilon^2(1-\gamma)^4}$$. Then with probability at least $$1-\delta$$, $$V^\star(s)-V^{\hat \pi^\star}(s) \leq \epsilon\quad\forall~~s\in\mathcal S$$

## Sample Complexity

Algorithm: Tabular MBRL with Queries

• Inputs: total number of samples $$N$$
• For all $$(s,a)\in\mathcal S\times \mathcal A$$:
• For $$i=1,\dots, \frac{N}{SA}$$:
• sample $$s'_i \sim P(s, a)$$ and record $$(s'_i,s,a)$$
• Fit transition model $$\hat P$$ and design $$\hat \pi^\star$$ with PI

Proof Outline:

• From Suboptimality slide and Model Error Lemma, with probability $$1-\delta$$ $$V^\star(s_0) - V^{\hat\pi^\star}(s_0) \leq \frac{2\gamma}{(1-\gamma)^2} \sqrt{\frac{S^2A \log(2SA/\delta)}{N}}$$
• Set RHS equal to $$\epsilon$$, solve for $$N$$, and use the fact that $$\gamma<1$$.

## Recap

• PA due Friday
• Prelim in class 3/15

• Query Setting of MBRL
• Simulation Lemma
• Estimation

• Next lecture: Approximate Policy Iteration

By Sarah Dean

Private