Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Bellman Optimality
3. Value Iteration
4. VI Convergence
5. Proof of BOE
Accumulate discounted reward on infinite horizon: $$V^\pi(s) = \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t,a_t) \mid s_0=s, s_{t+1}\sim P(s_t,a_t), a_t\sim \pi(s_t) \right]$$
Bellman Consistency Equation leads to Exact & Approximate Policy Evaluation (PE) algorithms
Approximate Policy Evaluation is a fixed point iteration of the Bellman Operator, which is a contraction $$\mathcal J_\pi[V] = R^\pi + \gamma P_\pi V$$
assuming deterministic reward function and stationary, state-dependent policy (possibly stochastic)
Optimal policies uniformly dominate in value
i.e. they have the highest value \(V^\star(s)\) for all \(s\)
The finite horizon Bellman Optimality Equation (BOE) enables efficient policy optimization with dynamic programming
The optimal policy is greedy with respect to the optimal value \(V^\star\)
1. Recap
2. Bellman Optimality
3. Value Iteration
4. VI Convergence
5. Proof of BOE
How to efficiently find a policy that maximizes expected discounted reward?
\(a_t=\pi_t(s_t)\)
\(r_t= r(s_t, a_t)\)
\(s_{t}\sim P(s_{t-1}, a_{t-1})\)
\(\Pi\) is all possible policies (including stochastic, history-dependent)
Theorem (Bellman Optimality):
shorthand \(Q^\star(s,a)\)
\(\underbrace{\qquad\qquad\qquad\qquad}{}\)
\(0\)
\(1\)
stay: \(1\)
move: \(1\)
stay: \(p_1\)
move: \(1-p_2\)
stay: \(1-p_1\)
move: \(p_2\)
\(0\)
\(1\)
1. Recap
2. Bellman Optimality
3. Value Iteration
4. VI Convergence
5. Proof of BOE
Value Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16
1. Recap
2. Bellman Optimality
3. Value Iteration
4. VI Convergence
5. Proof of BOE
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
To show that Value Iteration converges, we use a contraction argument
Theorem (Convergence): For iterates \(V_i\) of VI, $$\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty$$
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
Proof
\(\max_x f_1(x)\)
\(\max_x f_2(x)\)
\(\max_x |f_1(x)-f_2(x)|\)
\(\mathbb E[ f_1(x)]\)
\(\mathbb E[ f_2(x)]\)
\( f_1\)
\( f_2\)
\( f_1-f_2\)
Proof
Theorem (Convergence): For iterates \(V_i\) of VI, $$\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty$$
Theorem (Suboptimality): For policy \(\hat\pi\) from VI, \(\forall s\) $$ V^\star(s) - V^{\hat\pi}(s) \leq \frac{2\gamma}{1-\gamma} \cdot \gamma^N \|V_0-V^\star\|_\infty$$
This is optional material. In the proof we use \(t\) in place of \(i\).
Proof of Claim:
\(V^{\pi_t}(s) - V^\star(s) =\)
1. Recap
2. Bellman Optimality
3. Value Iteration
4. VI Convergence
5. Proof of BOE
Consider vectors \(V,V'\) and matrix \(P\).
If \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries, where inequalities hold entrywise.
To see why this is true, consider each entry:
\([PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i\)
The middle inequalities holds due to \(V_j\leq V_i\) and the fact that all \(P_{ij}\) are positive.
\((P^\pi)^k\) is bounded because \(P^\pi\) is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.
2. (\(\impliedby\)) If \(V^\pi\) satisfies BOE, then \(\pi\) is an optimal policy
2. (\(\impliedby\)) If \(V^\pi\) satisfies BOE, then \(\pi\) is an optimal policy
By Sarah Dean