Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
Algorithm: Data collection
Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)
Proof in future PSet
Approximate Policy Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
Performance Difference Lemma: For two policies, $$V^{\pi_{i+1}}(s_0) - V^{\pi_{i}}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{s_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)\)
Conservative Policy Iteration
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(i=0\)
\(i=1\)
\(i=2\)
\(\bar \pi\)
\(i=1\)
\(i=2\)
\(i=0\)
Case: \(p_1\) and \(p_2\) both close to \(1\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\pi^i(\cdot |1)\)
\(1\)
\(\hat Q^{\pi^i}(1,\)switch\()\)
\(\hat Q^{\pi^i}(1,\)stay\()\)
\(\pi^{i}\)
\(\bar\pi\)
\(\pi^{i+1}\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(\pi^i(\cdot |1)\)
\(1\)
\(i=0\)
\(i=1\)
\(i=2\)
\(i=1\)
\(i=2\)
\(i=0\)
\(d^{\pi^i}_1(0)\)
\(1\)
\(d^{\pi^i}_1(1)\)
\(i=0\)
\(i=1\)
\(i=2\)
\(i=1\)
\(i=2\)
\(i=0\)
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
Assumptions:
Both assumptions can be relaxed; improvement would then depend on \(\epsilon\) and reward bounds
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(\pi^i(\cdot |1)\)
\(Q^{\pi^i}(1,\)switch\()\)
\( Q^{\pi^i}(1,\)stay\()\)
\(1\)
\(\bar \pi\)
\(V^{\pi^i}(1)\)
\(\mathbb A_i\begin{cases}\end{cases}\)
\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)
CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$
\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)
CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$
\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)
CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
Labels via rollouts of \(\pi\):
...
...
...
Other key equations can inspire labels:
...
...
...
By Sarah Dean