Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Bellman Optimality
3. Value Iteration
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(0\)
\(1\)
Bellman Optimality Equation (BOE): \(\forall s\), $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Theorem (Bellman Optimality):
Bellman Expectation Equation: \(\forall s\),
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
The value of a state \(s\) under a policy \(\pi\) denoted \(V^\pi(s)\) is the expected cumulative discounted reward starting from that state
1. Recap
2. Bellman Optimality
3. Value Iteration
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Consider vectors \(V,V'\) and matrix \(P\).
If \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries, where inequalities hold entrywise.
To see why this is true, consider each entry:
\([PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i\)
The middle inequalities holds due to \(V_j\leq V_i\) and the fact that all \(P_{ij}\) are positive.
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
\((P^\pi)^k\) is bounded because \(P^\pi\) is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s\)
\(\max_x f_1(x)\)
\(\max_x f_2(x)\)
\(\max_x |f_1(x)-f_2(x)|\)
\(\mathbb E[ f_1(x)]\)
\(\mathbb E[ f_2(x)]\)
\( f_1\)
\( f_2\)
\( f_1-f_2\)
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s\)
1. Recap
2. Bellman Optimality
3. Value Iteration
Value Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
To show that Value Iteration converges, we use a contraction argument
Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
Proof
Proof
Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
Proof
Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{t+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$
Proof of Claim:
\(V^{\pi_t}(s) - V^\star(s) =\)
Policy Iteration
Policy Iteration
By Sarah Dean