CS 4/5789: Introduction to Reinforcement Learning

Lecture 6: Policy Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

  • Auditing (unofficial)
  • Homework
    • Problem Set 2 due Monday
    • Programming Assignment 1 due next Wednesday
    • PSet 3, PA 2 released next Wednesday
  • First exam is Monday 3/4 during lecture
    • If you have a conflict, post on Ed ASAP!

Agenda

1. Recap: VI

2. Policy Iteration

3. PI Convergence

4. VI/PI Comparison

Recap: Value Iteration

Value Iteration

  • Initialize \(V_0\)
  • For \(i=0,\dots,N-1\):
    • \(V_{i+1}(s) = \max_{a\in\mathcal A}  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{i}(s') \right]\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}\underbrace{r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_N(s')]}_{Q_N(s,a)}\)
  • Idea: find approximately optimal \(V_N\) with fixed point iteration, then get approximately optimal policy

Q Value Iteration

Q Value Iteration

  • Initialize \(Q_0\)
  • For \(t=0,\dots,T-1\):
    • \(Q_{t+1}(s, a) =   r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t}(s', a') \right]\)
  • Return \(\displaystyle \hat\pi(s) =\arg\max Q_T (s,a)\)

We can think of the Q function as an \(S\times A\) array or an \(SA\) vector

  • Definition as (discounted) cumulative reward
    • \(V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]\)
    • \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]\)
  • Bellman Consistency: translate between \(V^\pi\) and \(Q^\pi\)
    •  \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ Q^{\pi}(s, a) \right]\)
    • \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]\)
  • Bellman Optimality: translate between \(V^\star\) and \(Q^\star\)
    • \(V^*(s) = \max_{a\in\mathcal A} Q^*(s, a)\)
    • \( Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ V^*(s') \right]\)

Q Function

Recap: Convergence of VI

Define the Bellman Operator \(\mathcal J_\star:\mathbb R^S\to\mathbb R^S\) as $$\mathcal J_\star[V](s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]~~\forall~s$$

VI: For \(i=0,\dots,N-1\):

  • \(V_{i+1} = \mathcal J_\star(V_i)\)
  • Contraction: For any \(V, V'\) \(\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty\)
  • Convergence: For iterates \(V_i\) of VI, \(\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty\)

Recap: Performance

Theorem (Suboptimality): For policy \(\hat\pi\) from \(N\) steps of VI, \(\forall s\) $$ V^\star(s) - V^{\hat\pi}(s)  \leq \frac{2\gamma}{1-\gamma} \cdot \gamma^N \|V_0-V^\star\|_\infty$$

  • Performance of \(\hat \pi(s)=\arg\max_a Q_N(s,a)\) quantified by \(V^{\hat\pi}\)
  • \(V_N\) may not be exactly equal to \(V^{\hat\pi}\)
    • \(V^{\hat\pi} = (I-\gamma P_{\hat\pi})^{-1}R^{\hat\pi}\)
    • \(V_N\) may not correspond to the value of any policy

Example

\(0\)

\(1\)

stay: \(1\)

move: \(1\)

stay: \(p_1\)

move: \(1-p_2\)

stay: \(1-p_1\)

move: \(p_2\)

  • Suppose the reward is: \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) move
  • We know \(V^\star(0)=\frac{1}{1-\gamma}\)
  • Suppose VI with \(V_0=[10,10]\)
    • \(V_i(0)=1+\gamma V_{i-1}(0) \)
    • \(V_i(0)=\sum_{k=0}^{i-1} \gamma^k + \gamma^i\cdot 10\)
  • \(V_i\) converges towards \(V^\star\), but there is no policy achieving
    value \(V_i\)!

Four possible policies:

  1. \(\pi(0)=S,\pi(1)=S\)
    • \(V(0)=\frac{1}{1-\gamma}, V(1)=\gamma p_1 V(1) + \gamma (1-p_1) V(0)\)
  2. \(\pi(0)=S,\pi(1)=M\)
    • \(V(0)=\frac{1}{1-\gamma}, V(1)=-0.5+\gamma (1-p_2) V(1) + \gamma p_2 V(0)\)
  3. \(\pi(0)=M,\pi(1)=M\)
    • \(V(0) = 0.5+\gamma V(1)\)
    • \(V(1)=-0.5+\gamma (1-p_2) V(1) + \gamma p_2 V(0)\)
  4. \(\pi(0)=M,\pi(1)=S\)
    • \(V(0) = 0.5+\gamma V(1)\)
    • \(V(1)=-0.5+\gamma p_1 V(1) + \gamma(1-p_1) V(0)\)

\(V(0)\)

\(V(1)\)

\(\frac{1}{1-\gamma}\)

Agenda

1. Recap: VI

2. Policy Iteration

3. PI Convergence

4. VI/PI Comparison

Policy Iteration

Policy Iteration

  • Initialize \(\pi^0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,N-1\):
    • Compute \(V^{\pi^i}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi^i}(s')]$$
  • Idea: compute "greedy" (i.e. argmax) policy at every step

Example: PA 1

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

16

Policy Iteration

  • Two key properties:
    1. Monotonic Improvement: \(V^{\pi^{i+1}} \geq V^{\pi^i}\)
    2. Convergence: \(\|V^{\pi^i} - V^\star\|_\infty \leq\gamma^i \|V^{\pi^0}-V^\star\|_\infty\)

Policy Iteration

  • Initialize \(\pi^0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,N-1\):
    • Compute \(V^{\pi^i}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi^i}(s')]$$

Agenda

1. Recap: VI

2. Policy Iteration

3. PI Convergence

4. VI/PI Comparison

Monotonic Improvement

Lemma (Monotonic Improvement): For iterates \(\pi^i\) of PI, the value monotonically improves, i.e. \( V^{\pi^{i+1}} \geq V^{\pi^{i}}\)

Proof:

  • \(V^{\pi^{i+1}}(s) - V^{\pi^i}(s) = \)
    • \(=r(s,\pi^{i+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi^{i+1}(s))}[V^{\pi^{i+1}}(s')]\)
      \(- (r(s,\pi^i(s))+\gamma\mathbb E_{s'\sim P(s,\pi^i(s))}[V^{\pi^{i}}(s')]) \) (Bellman Expectation Eq)
    • \(\geq r(s,\pi^{i+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi^{i+1}(s))}[V^{\pi^{i+1}}(s')]\)
      \(- (r(s,\pi^{i+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi^{i+1}(s))}[V^{\pi^{i}}(s')]) \) (Policy Improvement step)
    • \(= \gamma\mathbb E_{s'\sim P(s,\pi^{i+1}(s))}[V^{\pi^{i+1}}(s')-V^{\pi^{i}}(s')] \)
  • In vector form, \(V^{\pi^{i+1}} - V^{\pi^i} \geq \gamma P_{\pi^{i+1}} (V^{\pi^{i+1}} - V^{\pi^i})\)
    • \(V^{\pi^{i+1}} - V^{\pi^i} \geq \gamma^k P^k_{\pi^{i+1}} (V^{\pi^{i+1}} - V^{\pi^i})\) (iterate \(k\) times)
    • letting \(k\to\infty\), \(V^{\pi^{i+1}} - V^{\pi^i} \geq 0\)

What about VI? PollEv

Consider vectors \(V,V'\) and matrix \(P\) with nonnegative entries.

In homework, you will show that if \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries (inequalities hold entrywise).

You will also show that \((P^\pi)^k\) is bounded when \(P^\pi\) is a stochastic matrix.

Convergence of PI

Theorem (PI Convergence): For \(\pi^i\) from PI,  $$ \|V^{\pi^{i}}-V^\star\|_\infty \leq \gamma^i \|V^{\pi^{0}}-V^\star\|_\infty$$

Proof:

  • \(V^\star(s) - V^{\pi^{i+1}}(s) = \)
    • \(=\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi^{i+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi^{i+1}(s))}[V^{\pi^{i+1}}(s')]) \) (Bellman Optimality and Consistency Eq)
    • \(\leq \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi^{i+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi^{i+1}(s))}[V^{\pi^{i}}(s')]) \) (Monotonic Improvement)
    • \(= \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\pi^{i}}(s')]\right] \) (Definition of \(\pi^{i+1}\) in Policy Improvment)
  • \(|V^\star(s) - V^{\pi^{i+1}}(s) |\)
    • \(\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\pi^{i}}(s')]\right]|\)

Convergence of PI

Theorem (PI Convergence): For \(\pi^i\) from PI,  $$ \|V^{\pi^{i}}-V^\star\|_\infty \leq \gamma^i \|V^{\pi^{0}}-V^\star\|_\infty$$

Proof:

  • \(|V^\star(s) - V^{\pi^{i+1}}(s) |\)
    • \(\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi^{i}}(s')]\right]|\)
    • \(\leq \max_a\left| r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-(r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi^{i}}(s')])\right| \)
      (Basic Inequality PSet 1)
    • \(= \gamma\max_a\left|\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-\mathbb E_{s'\sim P(s,a(s))}[V^{\pi^{i}}(s')]\right| \)
    • \(\leq  \gamma\max_{a,s'}\left|V^{\star}(s')-V^{\pi^{i}}(s')\right| \) (Basic Inequality PSet 1)
    • \(=  \gamma\|V^{\star}-V^{\pi^{i}}\|_\infty \)
    • By induction, this implies that \(\|V^{\pi^{i}}-V^\star\|_\infty \leq \gamma^i \|V^{\pi^{0}}-V^\star\|_\infty\)

Agenda

1. Recap: VI

2. Policy Iteration

3. PI Convergence

4. VI/PI Comparison

VI and PI Comparison

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,N-1\):
    • Policy Evaluation: \(V^{\pi_i}\)
    • Policy Improvement: \(\pi^{i+1}\)

Value Iteration

  • Initialize \(V_0\)
  • For \(i=0,\dots,N-1\):
    • Bellman Operator: \(V_{i+1}\)
  • Return \(\displaystyle \hat\pi\)
  • VI only tracks Value function, PI tracks Value and Policy
  • Both have geometric convergence
    • Gets small quickly, but never equal to zero for finite \(N\)
  • Only PI guarantees monotone improvement

PI Converges in Finite Time

PI finds an exactly optimal policy in a finite number of iterations

  1. If PI doesn't make progress, it has found an optimal policy
    • i.e. \(V^{\pi_{i+1}} = V^{\pi_{i}} \implies \pi_{i+1}\) (and \(\pi_{i}\)) are optimal
    • Proof: \(V^{\pi_{i+1}}(s) = r(s,\pi_{i+1}(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_{i+1}(s))}[V^{\pi_{i+1}}(s')] \)
      • \(= r(s,\pi_{i+1}(s)) + \gamma\mathbb E_{s'\sim P(s,\pi_{i+1}(s))}[V^{\pi_{i}}(s')] \) (assumption)
      • \(=\max_a  r(s,a) + \gamma\mathbb E_{s'\sim P(s,a)}[V^{\pi_{i}}(s')] \) (PI definition)
      • \(=\max_a  r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_{i+1}}(s')] \) (assumption)
      • Thus satisfies BOE

PI finds an exactly optimal policy in a finite number of iterations

  1. If PI doesn't make progress, it has found an optimal policy
    • \(V^{\pi_{i+1}} = V^{\pi_{i}} \implies \pi_{i+1}\) (and \(\pi_{i}\)) are optimal
  2. Non-repeating iterates of PI (unless optimal)
    • If there exists \(0 \leq i_1<i_2<A^S\) such that \(\pi_{i_1}=\pi_{i_2}\). By monotonic improvement, \(V^{\pi^{i_1}}\leq V^{\pi^{i_1+1}} \leq V^{\pi^{i_2}} = V^{\pi^{i_1}}\), it must be \(V^{\pi^{i_1}} = V^{\pi^{i_1+1}}\). Thus \(\pi_{i_1}\) is optimal by above.
  3. Since non-optimal iterates don't repeat, by the pigeonhole principle, there must exist \(0\leq i<A^S\) such that \(\pi_i = \pi^\star\) since the number of all possible policies is \(A^S\).

PI Converges in Finite Time

PI finds an exactly optimal policy in a finite number of iterations

  1. If PI doesn't make progress, it has found an optimal policy
    • \(V^{\pi_{i+1}} = V^{\pi_{i}} \implies \pi_{i+1}\) (and \(\pi_{i}\)) are optimal
  2. Non-repeating iterates of PI (unless optimal)
    • Suppose there exists \(0 \leq i_1<i_2<A^S\) such that \(\pi_{i_1}=\pi_{i_2}\).
    • By monotonic improvement, \(V^{\pi^{i_1}}\leq V^{\pi^{i_1+1}} \leq V^{\pi^{i_2}} = V^{\pi^{i_1}}\)
    • It must be \(V^{\pi^{i_1}} = V^{\pi^{i_1+1}}\). Thus \(\pi_{i_1}\) is optimal by above.
  3. Since non-optimal iterates don't repeat, by the pigeonhole principle, there must exist \(0\leq i<A^S\) such that \(\pi_i = \pi^\star\) since the number of all possible policies is \(A^S\).

Finite vs. Infinite Time MDPs

Finite Horizon

  • Policy Evaluation
    • Solve Bellman Consistency Eq exactly with \(H\) steps of backwards iteration
  • Policy Optimization
    • Solve Bellman Optimality Eq exactly with \(H\) steps of backward iteration (Dynamic Programming)

Infinite Horizon

  • Policy Evaluation
    • Solve Bellman Consistency Eq exactly with matrix inversion or approx w/ fixed point iteration
  • Policy Optimization
    • Solve Bellman Optimality Eq approx w/ fixed point iteration on value (VI) or policy (PI)

Recap

  • PSet 2 due Monday
  • PA 1 due next Wednesday
  • Exam on Monday 3/4

 

  • Policy Iteration
  • VI/PI Comparison

 

  • Next lecture: continuous control