Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap: Fitted VI
2. Fitted Policy Iteration
3. Fitted Approx. Policy Evaluation
4. Fitted Direct Policy Evaluation
5. Performance Difference Lemma
Fitted Q-Value Iteration
Fixed point iteration using a fixed dataset and supervised learning
features x | labels y |
---|---|
\(s_0,a_0\)
\(r(s_0,a_0)+\gamma \max _{a} Q^k (s_{1}, a))\)
\(s_1,a_1\)
\(s_2,a_2\)
\(r(s_1,a_1)+\gamma \max _{a} Q^k (s_{2}, a))\)
\(r(s_2,a_2)+\gamma \max _{a} Q^k (s_{3}, a))\)
1. Recap: Fitted VI
2. Fitted Policy Iteration
3. Fitted Approx. Policy Evaluation
4. Fitted Direct Policy Evaluation
5. Performance Difference Lemma
Policy Iteration
Approximate Policy Evaluation:
Exact Policy Evaluation:
Q-Policy Iteration
Policy Evaluation:
Q-Policy Iteration
Policy Evaluation:
Q-Policy Iteration
Policy Evaluation:
Brainstorm!
Fitted Policy Iteration
1. Recap: Fitted VI
2. Fitted Policy Iteration
3. Fitted Approx. Policy Evaluation
4. Fitted Direct Policy Evaluation
5. Performance Difference Lemma
Approximate Policy Evaluation:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Fitted Approximate Policy Evaluation (TD):
1. Recap: Fitted VI
2. Fitted Policy Iteration
3. Fitted Approx. Policy Evaluation
4. Fitted Direct Policy Evaluation
5. Performance Difference Lemma
...
...
...
Fitted Direct Policy Evaluation (MC):
Fitted Policy Iteration
1. Recap: Fitted VI
2. Fitted Policy Iteration
3. Fitted Approx. Policy Evaluation
4. Fitted Direct Policy Evaluation
5. Performance Difference Lemma
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
Why incremental updates in Fitted PI?
Define: the advantage function \(A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)\)
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
By Sarah Dean