CS 4/5789: Introduction to Reinforcement Learning
Lecture 5: Value Iteration
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Announcements
- Questions about waitlist/enrollment?
- Homework this week
- Problem Set 1 due TONIGHT
- Problem Set 2 released tonight due 2/13
- Programming Assignment 1 due 2/15
- My office hours: Tuesdays 10:30-11:30am in Gates 416A, Wednesday 4-4:50pm in Olin 255 (right after lecture)
Agenda
1. Recap
2. Bellman Optimality
3. Value Iteration
Recap: Optimal Policy
- The value of a state \(s\) under a policy \(\pi\) denoted \(V^\pi(s)\) is the expected cumulative discounted reward starting from that state
- An optimal policy \(\pi_\star\) is one with a dominating value,
- i.e. \(V^{\pi_\star} \geq V^{\pi}\) for all policies \(\pi\)
- All optimal policies achieve the same value \(V^\star\)
Recap: Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Reward: \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
- Consider the policy \(\pi(s)=\)stay for all \(s\)
- Optimal if \(p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}\), i.e. effectiveness of "switch" small compared with "stickiness" of 1 and discount factor

\(0\)
\(1\)




Recap: Bellman Equations
Bellman Optimality Equation (BOE): \(\forall s\), $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Theorem (Bellman Optimality):
- If \(\pi_\star\) is an optimal policy, then \(V^{\pi_\star}\) satisfies the BOE
- If \(V^\pi\) satisfies the BOE, then \(\pi\) is an optimal policy
Bellman Expectation Equation: \(\forall s\),
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
The value of a state \(s\) under a policy \(\pi\) denoted \(V^\pi(s)\) is the expected cumulative discounted reward starting from that state
Agenda
1. Recap
2. Bellman Optimality
3. Value Iteration
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \)
- \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] \)
- \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)
- \(\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]\)
- Writing the above expression in vector form:
- \(V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] \)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}\)
- \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}\) (subtract from both sides)
- \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}\) (Bellman Expectation Eq)
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})\) (\(\star\))
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})\) (apply (\(\star\)) to RHS)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Consider vectors \(V,V'\) and matrix \(P\).
If \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries, where inequalities hold entrywise.
To see why this is true, consider each entry:
\([PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i\)
The middle inequalities holds due to \(V_j\leq V_i\) and the fact that all \(P_{ij}\) are positive.
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})\) (\(\star\))
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi )^2(V^{\pi_\star}- V^{\hat \pi})\) (apply (\(\star\)) to RHS)
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi )^k(V^{\pi_\star}- V^{\hat \pi})\) (apply (\(\star\)) k times)
- \(V^{\pi_\star} - V^{\hat\pi} \leq 0\) (limit \(k\to\infty\))
- Therefore, \(V^{\pi_\star} = V^{\hat\pi}\)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
\((P^\pi)^k\) is bounded because \(P^\pi\) is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
- We showed that \(V^{\pi_\star} = V^{\hat\pi}\)
- this means \(\hat \pi(s)\) is an optimal policy!
- By definition of \(\hat\pi\) and the Bellman Expectation Equation, \(V^{\hat \pi}\) satisfies the Bellman Optimality Equation
- Therefore, \(V^{\pi_\star}\) must also satisfy it.
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$
- If we know the optimal value \(V^\star\) then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
- Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] $$
- \(\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)\)
Bellman Optimality
Bellman Optimality Proof
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s\)
- Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
- By part 1, we know that \(V^{\pi_\star}\) satisfies BOE
- We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
- \(=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\) (BOE by assumption and part 1)
- \(\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\)
- PollEV basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
- \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
\(\max_x f_1(x)\)
Basic Inequalities
\(\max_x f_2(x)\)
\(\max_x |f_1(x)-f_2(x)|\)
\(\mathbb E[ f_1(x)]\)
\(\mathbb E[ f_2(x)]\)
\( f_1\)
\( f_2\)
\( f_1-f_2\)
Bellman Optimality Proof
- Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
- We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
- \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
- \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]\) (basic inequality PSet 1)
- \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
- \(\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
- \(\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]\)
- \(\leq 0\) (letting \(k\to\infty\))
- Therefore, \(V^\pi = V^{\pi_\star}\) so \(\pi\) must be optimal
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s\)
Agenda
1. Recap
2. Bellman Optimality
3. Value Iteration
Value Iteration
- The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
- If \(V^\star\) satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
- Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).
Value Iteration
Value Iteration
- Initialize \(V_0\)
- For \(t=0,\dots,T-1\):
- \(V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]\) for all \(s\)
- Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]\) \(\forall s\)
- Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).
Example: PA 1
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16
Bellman Operator
- Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as, \(\forall s\) $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
- Nonlinear map
- Value Iteration is repeated application of the Bellman Operator
- Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation
Convergence of VI
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
To show that Value Iteration converges, we use a contraction argument
Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
Convergence of VI
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
Proof
- \(|\mathcal T V(s) - \mathcal T V'(s)| = |\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|\)
- \(\leq \max_{a\in\mathcal A} | r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|\) (Basic Inequality PSet 1)
- \(= \max_{a\in\mathcal A}\gamma | \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \mathbb{E}_{s' \sim P( s, a)} [V(s')]|\)
- \(\leq \max_{a\in\mathcal A}\gamma \mathbb{E}_{s' \sim P( s, a)} [|V(s') - V(s')|]\) (Basic Inequality PSet 1)
- \(\leq \max_{s'\in\mathcal S}\gamma |V(s') - V(s')|\) (Basic Inequality PSet 1)
- \(= \gamma \|V - V\|_\infty\) (Basic Inequality PSet 1)
- The above holds for all \(s\) so \(\max_s|\mathcal T V(s) - \mathcal T V'(s)| =\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty\)
Convergence of VI
Proof
- \(\|V_t - V^\star\|_\infty = \|\mathcal T V_{t-1} -\mathcal T V^\star\|_\infty\) (Definition of VI and BOE)
- \(\leq\gamma\|V_{t-1} - V^\star\|_\infty\) (Contraction Lemma)
- We prove the Lemma by induction using the above inequality
- Base case \((t=0)\):\( \|V_0-V^\star\|_\infty = \|V_0-V^\star\|_\infty\)
- Induction step: Assume \(\|V_k - V^\star\|_\infty \leq \gamma^{k}\|V_0-V^\star\|_\infty\). By above inequality, we have that $$\|V_{k+1} - V^\star\|_\infty \leq \gamma \|V_k-V^\star\|_\infty \leq \gamma \cdot \gamma^k\|V_0-V^\star\|_\infty$$ thus \(\|V_{k+1} - V^\star\|_\infty \leq \gamma^{k+1}\|V_0-V^\star\|_\infty\).
Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
Performance of VI Policy
Proof
- Claim: \(V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
- Recursing once: \(V^{\pi_t}(s) - V^\star(s) \)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
- \(= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
- Recursing \(k\) times,
\(V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty\) - Letting \(k\to\infty\), \(V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty\)
Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{t+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$
Proof of Claim:
\(V^{\pi_t}(s) - V^\star(s) =\)
- \(= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]\) (Bellman Expectation, add and subtract)
- \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]\) (Grouping terms, add and subtract)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Definition of \(\pi_t\) as argmax)
- \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Bellman Expectation on \(V^\star\) and cancelling reward terms)
- \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]\) (Linearity of Expectation)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty\) (Basic Inequality)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\) (Convergence Lemma)
Preview: Policy Iteration
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Compute \(V^{\pi_t}\) with Policy Evaluation
- Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
- VI only generates a policy at the very end
- Policy Iteration is another iterative algorithm that updates a policy at every iteration step
Preview: Policy Iteration
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Policy Evaluation \(V^{\pi_t}\)
- Policy Improvement \(\pi^{t+1}\)
- Two key properties:
- Monotonic Improvement: \(V^{\pi_{t+1}} \geq V^{\pi_t}\)
- Convergence: \(\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty\)
Recap
- PSet 1 due TONIGHT
- PSet 2 due next Monday
- PA 1 due next Wednesday
- Optimal Policies
- Value Iteration
- Next lecture: Policy Iteration, Dynamic Programming
Sp23 CS 4/5789: Lecture 5
By Sarah Dean