CS 4/5789: Introduction to Reinforcement Learning

Lecture 6: Policy Iteration and Dynamic Programming

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

  • Homework this week
    • Problem Set 2 due Monday 2/13
    • Programming Assignment 1 due 2/15
    • Next PSet and PA released on 2/15
  • My office hours:
    • Tuesdays 10:30-11:30am in Gates 416A
    • Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap

2. Policy Iteration

3. Finite Horizon MDP

4. Dynamic Programming

Recap: Bellman Equations

Bellman Optimality Equation (BOE): The optimal value satisfies, \(\forall s\), $$V^\star(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] \right]$$

Bellman Expectation Equation: For a given policy \(\pi\), the value is, \(\forall s\),

 \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

Recap: Value Iteration

Value Iteration

  • Initialize \(V_0\)
  • For \(t=0,\dots,T-1\):
    • \(V_{t+1}(s) = \max_{a\in\mathcal A}  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]\)
  • Idea: find approximately optimal \(\hat V\) with fixed point iteration, then get approximately optimal policy $$\hat\pi(s)=\argmax \hat Q(s,a)$$

Q Value Iteration

Q Value Iteration

  • Initialize \(Q_0\)
  • For \(t=0,\dots,T-1\):
    • \(Q_{t+1}(s, a) =   r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t}(s', a') \right]\)
  • Return \(\displaystyle \hat\pi(s) =\arg\max Q_T (s,a)\)

We can think of the Q function as an \(S\times A\) array or an \(SA\) vector

Recap: Convergence of VI

Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

Performance of VI Policy

Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s)  \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

  • So far we know that \(V_t\) converges to \(V^\star\)
  • But is \(\pi_t\) is near optimal?
  • \(V_t\) is not exactly equal to \(V^{\pi_t}\)
    • \(V^{\pi_t} = (I-\gamma P_{\pi_t})^{-1}R^{\pi_t}\)
    • \(V_t\) may not correspond to the value of any policy

Performance of VI Policy

Proof

  • Claim: \(V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Recursing once: \(V^{\pi_t}(s) - V^\star(s) \)
    • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
    • \(= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Recursing \(k\) times,
    \(V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Letting \(k\to\infty\), \(V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty\)

Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s)  \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

Proof of Claim:

\(V^{\pi_t}(s) - V^\star(s) =\)

  • \(= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]\) (Bellman Expectation, add and subtract)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]\) (Grouping terms, add and subtract)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Definition of \(\pi_t\) as argmax)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Bellman Expectation on \(V^\star\) and cancelling reward terms)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]\) (Linearity of Expectation)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty\) (Basic Inequality)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\) (Convergence Lemma)

Agenda

1. Recap

2. Policy Iteration

3. Finite Horizon MDP

4. Dynamic Programming

Policy Iteration

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Compute \(V^{\pi_t}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
  • Policy Iteration updates a policy at every iteration step
    • contrast with VI, which generates a policy only at the end

Example: PA 1

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

16

Policy Iteration

  • Two key properties:
    1. Monotonic Improvement: \(V^{\pi_{t+1}} \geq V^{\pi_t}\)
    2. Convergence: \(\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty\)

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Compute \(V^{\pi_t}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$

Monotonic Improvement

Lemma (Monotonic Improvement): For iterates \(\pi_t\) of PI, the value monotonically improves, i.e. $$ V^{\pi_{t+1}} \geq V^{\pi_{t}}$$

Proof:

  • \(V^{\pi_{t+1}}(s) - V^{\pi_t}(s) = \)
    • \(=r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')] - (r(s,\pi_t(s))+\gamma\mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_{t}}(s')]) \) (Bellman Expectation Eq)
    • \(\geq r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t}}(s')]) \) (definition of \(\pi_{t+1}\) in Policy Improvement step)
    • \(= \gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')-V^{\pi_{t}}(s')] \)
  • In vector form, \(V^{\pi_{t+1}} - V^{\pi_t} \geq \gamma P_{\pi_{t+1}} (V^{\pi_{t+1}} - V^{\pi_t})\)
    • \(V^{\pi_{t+1}} - V^{\pi_t} \geq \gamma^k P^k_{\pi_{t+1}} (V^{\pi_{t+1}} - V^{\pi_t})\) (iterate \(k\) times)
    • letting \(k\to\infty\), \(V^{\pi_{t+1}} - V^{\pi_t} \geq 0\)

Consider vectors \(V,V'\) and matrix \(P\) with nonnegative entries.

In homework, you will show that if \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries (inequalities hold entrywise).

You will also show that \((P^\pi)^k\) is bounded when \(P^\pi\) is a stochastic matrix.

Convergence of PI

Theorem (PI Convergence): For \(\pi_t\) from PI,  $$ \|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$

Proof:

  • \(V^\star(s) - V^{\pi_{t+1}}(s) = \)
    • \(=\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')]) \) (Bellman Optimality and Expectation Eq)
    • \(\leq \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t}}(s')]) \) (Monotonic Improvement)
    • \(= \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right] \) (Definition of \(\pi_{t+1}\) in Policy Improvment)
  • \(|V^\star(s) - V^{\pi_{t+1}}(s) |\)
    • \(\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]|\)

Convergence of PI

Theorem (PI Convergence): For \(\pi_t\) from PI,  $$ \|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$

Proof:

  • \(|V^\star(s) - V^{\pi_{t+1}}(s) |\)
    • \(\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]|\)
    • \(\leq \max_a\left| r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-(r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')])\right| \) (Basic Inequality PSet 1)
    • \(= \gamma\max_a\left|\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right| \)
    • \(\leq  \gamma\max_{a,s'}\left|V^{\star}(s')-V^{\pi_{t}}(s')\right| \) (Basic Inequality PSet 1)
    • \(=  \gamma\|V^{\star}-V^{\pi_{t}}\|_\infty \)
  • By induction, this implies that \(\|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty\)

VI and PI Comparison

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Policy Evaluation: \(V^{\pi_t}\)
    • Policy Improvement: \(\pi^{t+1}\)

Value Iteration

  • Initialize \(V_0\)
  • For \(t=0,\dots,T-1\):
    • Bellman Operator: \(V_{t+1}\)
  • Return \(\displaystyle \hat\pi\)
  • Both have geometric convergence---for any finite \(T\), not zero
  • Policy Iteration is guaranteed to converge to the exactly optimal policy in finite time (PSet 3)

Finite Horizon MDP

  • \(\mathcal{S}, \mathcal{A}\) state and action space
  • \(r\) reward function, \(P\) transition function
  • \(H\) is horizon (positive integer)

Goal: achieve high cumulative reward:

$$\sum_{t=0}^{H-1}  r_t$$

maximize   \(\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\)

Lasts exactly \(H\) steps, no discounting

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Reward:  \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
  • Finite horizon \(H\)
  • How might optimal policy differ when \(t\) close to \(H\)?

\(0\)

\(1\)

Time Varying Policies and Value

We consider time-varying policies $$\pi = (\pi_0,\dots,\pi_{H-1})$$

The value of a state also depends on time

$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1}  r(s_k, a_k) \mid s_0=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Reward:  \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
  • Consider a policy that "stays" for all states and time
  • \(V^\pi_t(0) =\) PollEv
    • \(H-t\)

Finite Horizon Bellman Eqns

Bellman Expectation Equation: \(\forall s\),

 \(V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) +  \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]\)

Q function

 \(Q_t^{\pi}(s, a) = \ r(s, a) +  \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')]\)

Rather than a recursion, in finite time we have an iterative equation

Dynamic Programming

  • Initialize \(V^\star_H = 0\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
    • \(\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)

Bellman optimality is also an iterative rather than a recursive equation: \(V^\star_t(s)=\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)

  • We can solve this iteration directly and exactly (rather than approximately like VI and PI)

Example

  • Reward:  \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
  • \(Q^{\pi}_{H-1}(s,a)=r(s,a)\) for all \(s,a\)
  • \(\pi^\star_{H-1}(s)=\)stay for all \( s\)
  • \(V^\star_{H-1}=\begin{bmatrix}1\\0\end{bmatrix}\), \(Q^\star_{H-2}=\begin{bmatrix}2&\frac{1}{2}\\p & -\frac{1}{2}+2p\end{bmatrix}\)
  • \(\pi^\star_{H-2}(s)=\)stay for all \(s\)
  • \(V^\star_{H-2}=\begin{bmatrix}2\\p\end{bmatrix}\), \(Q^\star_{H-3}=\begin{bmatrix}3&\frac{1}{2}+p\\(1-p)p+2p & -\frac{1}{2}+(1-2p)p+4p\end{bmatrix}\)
  • \(\pi^\star_{H-3}(0)=\)stay and \(\pi^\star_{H-3}(1)=\)switch if \(p\geq 1-\frac{1}{\sqrt{2}}\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(1-p\)

switch: \(1-2p\)

stay: \(p\)

switch: \(2p\)

Recap

  • PSet 2 due Monday
  • PA 1 due next Wednesday

 

  • Value & Policy Iteration
  • Finite Horizon MDP
  • Dynamic Programming

 

  • Next lecture: continuous control

Sp23 CS 4/5789: Lecture 6

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 6