Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Model-based RL with Queries
Tabular MBRL
Theorem: Tabular MBRL with \(N \gtrsim \frac{S^2 A}{\epsilon^2}\) is \(\epsilon\) sub-optimal with high probability
Policy Iteration
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
action \(a_t\)
state \(s_t\)
reward \(r_t\)
Control feedback
policy \(\pi\)
transitions \(P,f\)
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
Data feedback
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
MBRL (Lec 11)
this week
after prelim
after prelim
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
...
...
...
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Algorithm: Data collection
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Timestep sampling with discount/geometric distribution: set \(h_i=h\geq 0\) with probability \((1-\gamma)\gamma^h\)
Algorithm: Data collection
Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)
Notation note: \(s,a\sim d_{\mu_0}^\pi\) is a compact way of writing \(s\sim d_{\mu_0}^\pi\) and \(a\sim\pi\).
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Approximate Policy Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)
The advantage function:
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)
By Sarah Dean