CS 4/5789: Introduction to Reinforcement Learning
Lecture 6: Policy Iteration and Dynamic Programming
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Announcements
- Homework this week
- Problem Set 2 due Monday 2/13
- Programming Assignment 1 due 2/15
- Next PSet and PA released on 2/15
- My office hours:
- Tuesdays 10:30-11:30am in Gates 416A
- Wednesdays 4-4:50pm in Olin 255 (right after lecture)
Agenda
1. Recap
2. Policy Iteration
3. Finite Horizon MDP
4. Dynamic Programming
Recap: Bellman Equations
Bellman Optimality Equation (BOE): The optimal value satisfies, \(\forall s\), $$V^\star(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] \right]$$
Bellman Expectation Equation: For a given policy \(\pi\), the value is, \(\forall s\),
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
Recap: Value Iteration
Value Iteration
- Initialize \(V_0\)
- For \(t=0,\dots,T-1\):
- \(V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]\)
- Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]\)
- Idea: find approximately optimal \(\hat V\) with fixed point iteration, then get approximately optimal policy $$\hat\pi(s)=\argmax \hat Q(s,a)$$
Q Value Iteration
Q Value Iteration
- Initialize \(Q_0\)
- For \(t=0,\dots,T-1\):
- \(Q_{t+1}(s, a) = r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t}(s', a') \right]\)
- Return \(\displaystyle \hat\pi(s) =\arg\max Q_T (s,a)\)
We can think of the Q function as an \(S\times A\) array or an \(SA\) vector
Recap: Convergence of VI
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
Performance of VI Policy
Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$
- So far we know that \(V_t\) converges to \(V^\star\)
- But is \(\pi_t\) is near optimal?
- \(V_t\) is not exactly equal to \(V^{\pi_t}\)
- \(V^{\pi_t} = (I-\gamma P_{\pi_t})^{-1}R^{\pi_t}\)
- \(V_t\) may not correspond to the value of any policy
Performance of VI Policy
Proof
- Claim: \(V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
- Recursing once: \(V^{\pi_t}(s) - V^\star(s) \)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
- \(= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
- Recursing \(k\) times,
\(V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty\) - Letting \(k\to\infty\), \(V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty\)
Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$
Proof of Claim:
\(V^{\pi_t}(s) - V^\star(s) =\)
- \(= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]\) (Bellman Expectation, add and subtract)
- \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]\) (Grouping terms, add and subtract)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Definition of \(\pi_t\) as argmax)
- \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Bellman Expectation on \(V^\star\) and cancelling reward terms)
- \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]\) (Linearity of Expectation)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty\) (Basic Inequality)
- \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\) (Convergence Lemma)
Agenda
1. Recap
2. Policy Iteration
3. Finite Horizon MDP
4. Dynamic Programming
Policy Iteration
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Compute \(V^{\pi_t}\) with Policy Evaluation
- Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
- Policy Iteration updates a policy at every iteration step
- contrast with VI, which generates a policy only at the end
Example: PA 1
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16
Policy Iteration
- Two key properties:
- Monotonic Improvement: \(V^{\pi_{t+1}} \geq V^{\pi_t}\)
- Convergence: \(\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty\)
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Compute \(V^{\pi_t}\) with Policy Evaluation
- Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
Monotonic Improvement
Lemma (Monotonic Improvement): For iterates \(\pi_t\) of PI, the value monotonically improves, i.e. $$ V^{\pi_{t+1}} \geq V^{\pi_{t}}$$
Proof:
- \(V^{\pi_{t+1}}(s) - V^{\pi_t}(s) = \)
- \(=r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')] - (r(s,\pi_t(s))+\gamma\mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_{t}}(s')]) \) (Bellman Expectation Eq)
- \(\geq r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t}}(s')]) \) (definition of \(\pi_{t+1}\) in Policy Improvement step)
- \(= \gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')-V^{\pi_{t}}(s')] \)
- In vector form, \(V^{\pi_{t+1}} - V^{\pi_t} \geq \gamma P_{\pi_{t+1}} (V^{\pi_{t+1}} - V^{\pi_t})\)
- \(V^{\pi_{t+1}} - V^{\pi_t} \geq \gamma^k P^k_{\pi_{t+1}} (V^{\pi_{t+1}} - V^{\pi_t})\) (iterate \(k\) times)
- letting \(k\to\infty\), \(V^{\pi_{t+1}} - V^{\pi_t} \geq 0\)
Consider vectors \(V,V'\) and matrix \(P\) with nonnegative entries.
In homework, you will show that if \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries (inequalities hold entrywise).
You will also show that \((P^\pi)^k\) is bounded when \(P^\pi\) is a stochastic matrix.
Convergence of PI
Theorem (PI Convergence): For \(\pi_t\) from PI, $$ \|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$
Proof:
- \(V^\star(s) - V^{\pi_{t+1}}(s) = \)
- \(=\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')]) \) (Bellman Optimality and Expectation Eq)
- \(\leq \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t}}(s')]) \) (Monotonic Improvement)
- \(= \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right] \) (Definition of \(\pi_{t+1}\) in Policy Improvment)
- \(|V^\star(s) - V^{\pi_{t+1}}(s) |\)
- \(\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]|\)
Convergence of PI
Theorem (PI Convergence): For \(\pi_t\) from PI, $$ \|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$
Proof:
- \(|V^\star(s) - V^{\pi_{t+1}}(s) |\)
- \(\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]|\)
- \(\leq \max_a\left| r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-(r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')])\right| \) (Basic Inequality PSet 1)
- \(= \gamma\max_a\left|\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right| \)
- \(\leq \gamma\max_{a,s'}\left|V^{\star}(s')-V^{\pi_{t}}(s')\right| \) (Basic Inequality PSet 1)
- \(= \gamma\|V^{\star}-V^{\pi_{t}}\|_\infty \)
- By induction, this implies that \(\|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty\)
VI and PI Comparison
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Policy Evaluation: \(V^{\pi_t}\)
- Policy Improvement: \(\pi^{t+1}\)
Value Iteration
- Initialize \(V_0\)
- For \(t=0,\dots,T-1\):
- Bellman Operator: \(V_{t+1}\)
- Return \(\displaystyle \hat\pi\)
- Both have geometric convergence---for any finite \(T\), not zero
- Policy Iteration is guaranteed to converge to the exactly optimal policy in finite time (PSet 3)
Finite Horizon MDP
- \(\mathcal{S}, \mathcal{A}\) state and action space
- \(r\) reward function, \(P\) transition function
- \(H\) is horizon (positive integer)
Goal: achieve high cumulative reward:
$$\sum_{t=0}^{H-1} r_t$$
maximize \(\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\)
Lasts exactly \(H\) steps, no discounting
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Reward: \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
- Finite horizon \(H\)
- How might optimal policy differ when \(t\) close to \(H\)?

\(0\)
\(1\)




Time Varying Policies and Value
We consider time-varying policies $$\pi = (\pi_0,\dots,\pi_{H-1})$$
The value of a state also depends on time
$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k, a_k) \mid s_0=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Reward: \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
- Consider a policy that "stays" for all states and time
- \(V^\pi_t(0) =\) PollEv
- \(H-t\)
Finite Horizon Bellman Eqns
Bellman Expectation Equation: \(\forall s\),
\(V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]\)
Q function
\(Q_t^{\pi}(s, a) = \ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')]\)
Rather than a recursion, in finite time we have an iterative equation
Dynamic Programming
- Initialize \(V^\star_H = 0\)
- For \(t=H-1, H-2, ..., 0\):
- \(Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
- \(\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)\)
- \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
Bellman optimality is also an iterative rather than a recursive equation: \(V^\star_t(s)=\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
- We can solve this iteration directly and exactly (rather than approximately like VI and PI)
Example
- Reward: \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
- \(Q^{\pi}_{H-1}(s,a)=r(s,a)\) for all \(s,a\)
- \(\pi^\star_{H-1}(s)=\)stay for all \( s\)
- \(V^\star_{H-1}=\begin{bmatrix}1\\0\end{bmatrix}\), \(Q^\star_{H-2}=\begin{bmatrix}2&\frac{1}{2}\\p & -\frac{1}{2}+2p\end{bmatrix}\)
- \(\pi^\star_{H-2}(s)=\)stay for all \(s\)
- \(V^\star_{H-2}=\begin{bmatrix}2\\p\end{bmatrix}\), \(Q^\star_{H-3}=\begin{bmatrix}3&\frac{1}{2}+p\\(1-p)p+2p & -\frac{1}{2}+(1-2p)p+4p\end{bmatrix}\)
- \(\pi^\star_{H-3}(0)=\)stay and \(\pi^\star_{H-3}(1)=\)switch if \(p\geq 1-\frac{1}{\sqrt{2}}\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
Recap
- PSet 2 due Monday
- PA 1 due next Wednesday
- Value & Policy Iteration
- Finite Horizon MDP
- Dynamic Programming
- Next lecture: continuous control
Sp23 CS 4/5789: Lecture 6
By Sarah Dean