Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap: VI
2. Policy Iteration
3. PI Convergence
4. VI/PI Comparison
Value Iteration
Q Value Iteration
We can think of the Q function as an \(S\times A\) array or an \(SA\) vector
Define the Bellman Operator \(\mathcal J_\star:\mathbb R^S\to\mathbb R^S\) as $$\mathcal J_\star[V](s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]~~\forall~s$$
VI: For \(i=0,\dots,N-1\):
Convergence: For iterates \(V_i\) of VI, \(\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty\)
Theorem (Suboptimality): For policy \(\hat\pi\) from \(N\) steps of VI, \(\forall s\) $$ V^\star(s) - V^{\hat\pi}(s) \leq \frac{2\gamma}{1-\gamma} \cdot \gamma^N \|V_0-V^\star\|_\infty$$
\(0\)
\(1\)
stay: \(1\)
move: \(1\)
stay: \(p_1\)
move: \(1-p_2\)
stay: \(1-p_1\)
move: \(p_2\)
Four possible policies:
\(V(0)\)
\(V(1)\)
\(\frac{1}{1-\gamma}\)
1. Recap: VI
2. Policy Iteration
3. PI Convergence
4. VI/PI Comparison
Policy Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16
Policy Iteration
1. Recap: VI
2. Policy Iteration
3. PI Convergence
4. VI/PI Comparison
Lemma (Monotonic Improvement): For iterates \(\pi^i\) of PI, the value monotonically improves, i.e. \( V^{\pi^{i+1}} \geq V^{\pi^{i}}\)
Proof:
What about VI? PollEv
Consider vectors \(V,V'\) and matrix \(P\) with nonnegative entries.
In homework, you will show that if \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries (inequalities hold entrywise).
You will also show that \((P^\pi)^k\) is bounded when \(P^\pi\) is a stochastic matrix.
Theorem (PI Convergence): For \(\pi^i\) from PI, $$ \|V^{\pi^{i}}-V^\star\|_\infty \leq \gamma^i \|V^{\pi^{0}}-V^\star\|_\infty$$
Proof:
Theorem (PI Convergence): For \(\pi^i\) from PI, $$ \|V^{\pi^{i}}-V^\star\|_\infty \leq \gamma^i \|V^{\pi^{0}}-V^\star\|_\infty$$
Proof:
1. Recap: VI
2. Policy Iteration
3. PI Convergence
4. VI/PI Comparison
Policy Iteration
Value Iteration
PI finds an exactly optimal policy in a finite number of iterations
PI finds an exactly optimal policy in a finite number of iterations
PI finds an exactly optimal policy in a finite number of iterations
Finite Horizon
Infinite Horizon
By Sarah Dean