Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
Infinite horizon discounted MDP with finite states and actions
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
minimize \(\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)\)
s.t. \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
Finite horizon deterministic MDP with continuous states/actions
action
state
\(a_t\)
reward
\(s_t\)
\(r_t\)
When the initial state is fixed to a known \(s_0\), i.e. \(\mu_0=e_{s_0}\) we write \(d_{s_0,t}^{\pi}\)
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
Algorithm: MBRL with Queries
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Simulation Lemma: For a deterministic policy \(\pi\), $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d_{\pi}^{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
For a fixed policy, what is the difference in value when computed using \(P\) vs. when using \(\hat P\)?
\(\underbrace{\qquad\qquad}\)
\(\sum_{s'\in\mathcal S} |\hat P(s'|s,\pi(s)) - P(s'|s,\pi(s))| \)
total variation distance on distribution over \( s'\)
Simulation Lemma: For a deterministic policy, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
For alternative proof without vector notation
Simulation Lemma: For a deterministic policy, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
Simulation Lemma: For a deterministic policy, $$|\hat V^\pi(s_0) - V^\pi(s_0)| \leq \frac{\gamma}{(1-\gamma)^2} \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
Theorem: For \(1\leq \delta\leq 1\), run MBRL Algorithm with \(N \geq \frac{4\gamma^2 S^2 A\log(2SA/\delta)}{\epsilon^2(1-\gamma)^4}\). Then with probability at least \(1-\delta\), $$V^\star(s)-V^{\hat \pi^\star}(s) \leq \epsilon\quad\forall~~s\in\mathcal S$$
Algorithm: Tabular MBRL with Queries
Proof Outline:
By Sarah Dean