CS 4/5789: Introduction to Reinforcement Learning

Lecture 5: Value Iteration

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

  • Questions about waitlist/enrollment?
  • Homework this week
    • Problem Set 1 due TONIGHT
    • Problem Set 2 released tonight due 2/13
    • Programming Assignment 1 due 2/15
  • My office hours: Tuesdays 10:30-11:30am in Gates 416A, Wednesday 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

Recap: Optimal Policy

  • The value of a state \(s\) under a policy \(\pi\) denoted \(V^\pi(s)\) is the expected cumulative discounted reward starting from that state
  • An optimal policy \(\pi_\star\) is one with a dominating value,
    • i.e. \(V^{\pi_\star} \geq V^{\pi}\) for all policies \(\pi\)
  • All optimal policies achieve the same value \(V^\star\)

Recap: Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Reward:  \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
  • Consider the policy \(\pi(s)=\)stay for all \(s\)
  • Optimal if \(p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}\), i.e. effectiveness of "switch" small compared with "stickiness" of 1 and discount factor

\(0\)

\(1\)

Recap: Bellman Equations

Bellman Optimality Equation (BOE): \(\forall s\), $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Theorem (Bellman Optimality):

  1. If \(\pi_\star\) is an optimal policy, then \(V^{\pi_\star}\) satisfies the BOE
  2. If \(V^\pi\) satisfies the BOE, then \(\pi\) is an optimal policy

Bellman Expectation Equation: \(\forall s\),

 \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

The value of a state \(s\) under a policy \(\pi\) denoted \(V^\pi(s)\) is the expected cumulative discounted reward starting from that state

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \)
      • \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] \)
      • \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)
      • \(\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]\)
    • Writing the above expression in vector form:
    • \(V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}\)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}\)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}\) (subtract from both sides)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}\) (Bellman Expectation Eq)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star}  -V^{\hat \pi})\)       (\(\star\))
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) to RHS)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

Consider vectors \(V,V'\) and matrix \(P\).

If \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries, where inequalities hold entrywise.

To see why this is true, consider each entry:

\([PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i\)

The middle inequalities holds due to \(V_j\leq V_i\) and the fact that all \(P_{ij}\) are positive.

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star}  -V^{\hat \pi})\)       (\(\star\))
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi )^2(V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) to RHS)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi )^k(V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) k times)
    • \(V^{\pi_\star} - V^{\hat\pi} \leq 0\) (limit \(k\to\infty\))
  • Therefore, \(V^{\pi_\star} = V^{\hat\pi}\)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

\((P^\pi)^k\) is bounded because \(P^\pi\) is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • We showed that \(V^{\pi_\star} = V^{\hat\pi}\)
    • this means \(\hat \pi(s)\) is an optimal policy!
  • By definition of \(\hat\pi\) and the Bellman Expectation Equation, \(V^{\hat \pi}\) satisfies the Bellman Optimality Equation
  • Therefore, \(V^{\pi_\star}\) must also satisfy it.

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$

  • If we know the optimal value \(V^\star\) then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
  • Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] $$
  • \(\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)\)

Bellman Optimality

Bellman Optimality Proof

Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s\)

  • Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
  • By part 1, we know that \(V^{\pi_\star}\) satisfies BOE
  • We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
    • \(=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\) (BOE by assumption and part 1)
    • \(\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\)
      • PollEV basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
    • \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)

\(\max_x f_1(x)\)

Basic Inequalities

\(\max_x f_2(x)\)

\(\max_x |f_1(x)-f_2(x)|\)

\(\mathbb E[ f_1(x)]\)

\(\mathbb E[ f_2(x)]\)

\( f_1\)

\( f_2\)

\( f_1-f_2\)

Bellman Optimality Proof

  • Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
  • We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
    • \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
    • \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]\) (basic inequality PSet 1)
    • \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
    • \(\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
    • \(\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]\)
    • \(\leq 0\) (letting \(k\to\infty\))
  • Therefore, \(V^\pi = V^{\pi_\star}\) so \(\pi\) must be optimal

Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s\)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

Value Iteration

  • The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
  • If \(V^\star\) satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
  • Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).

Value Iteration

Value Iteration

  • Initialize \(V_0\)
  • For \(t=0,\dots,T-1\):
    • \(V_{t+1}(s) = \max_{a\in\mathcal A}  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]\) for all \(s\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]\) \(\forall s\)
  • Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).

Example: PA 1

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

16

Bellman Operator

  • Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as, \(\forall s\) $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
  • Nonlinear map
  • Value Iteration is repeated application of the Bellman Operator
  • Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation

Convergence of VI

Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

To show that Value Iteration converges, we use a contraction argument

Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

Convergence of VI

Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Proof

  • \(|\mathcal T V(s) - \mathcal T V'(s)| =  |\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|\)
    • \(\leq \max_{a\in\mathcal A} | r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|\) (Basic Inequality PSet 1)
    • \(= \max_{a\in\mathcal A}\gamma |  \mathbb{E}_{s' \sim P( s, a)} [V(s')] -  \mathbb{E}_{s' \sim P( s, a)} [V(s')]|\)
    • \(\leq \max_{a\in\mathcal A}\gamma  \mathbb{E}_{s' \sim P( s, a)} [|V(s') -  V(s')|]\) (Basic Inequality PSet 1)
    • \(\leq \max_{s'\in\mathcal S}\gamma |V(s') -  V(s')|\) (Basic Inequality PSet 1)
    • \(= \gamma \|V -  V\|_\infty\) (Basic Inequality PSet 1)
  • The above holds for all \(s\) so \(\max_s|\mathcal T V(s) - \mathcal T V'(s)| =\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty\)

Convergence of VI

Proof

  • \(\|V_t - V^\star\|_\infty = \|\mathcal T V_{t-1} -\mathcal T  V^\star\|_\infty\) (Definition of VI and BOE)
    • \(\leq\gamma\|V_{t-1} -  V^\star\|_\infty\) (Contraction Lemma)
  • We prove the Lemma by induction using the above inequality
    • Base case \((t=0)\):\( \|V_0-V^\star\|_\infty = \|V_0-V^\star\|_\infty\)
    • Induction step: Assume \(\|V_k - V^\star\|_\infty \leq \gamma^{k}\|V_0-V^\star\|_\infty\). By above inequality, we have that $$\|V_{k+1} - V^\star\|_\infty \leq \gamma \|V_k-V^\star\|_\infty \leq \gamma \cdot \gamma^k\|V_0-V^\star\|_\infty$$ thus \(\|V_{k+1} - V^\star\|_\infty \leq \gamma^{k+1}\|V_0-V^\star\|_\infty\).

Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

Performance of VI Policy

Proof

  • Claim: \(V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Recursing once: \(V^{\pi_t}(s) - V^\star(s) \)
    • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
    • \(= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Recursing \(k\) times,
    \(V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Letting \(k\to\infty\), \(V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty\)

Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s)  \leq \frac{2\gamma^{t+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

Proof of Claim:

\(V^{\pi_t}(s) - V^\star(s) =\)

  • \(= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]\) (Bellman Expectation, add and subtract)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]\) (Grouping terms, add and subtract)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Definition of \(\pi_t\) as argmax)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Bellman Expectation on \(V^\star\) and cancelling reward terms)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]\) (Linearity of Expectation)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty\) (Basic Inequality)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\) (Convergence Lemma)

Preview: Policy Iteration

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Compute \(V^{\pi_t}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
  • VI only generates a policy at the very end
  • Policy Iteration is another iterative algorithm that updates a policy at every iteration step

Preview: Policy Iteration

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Policy Evaluation \(V^{\pi_t}\)
    • Policy Improvement \(\pi^{t+1}\)
  • Two key properties:
    1. Monotonic Improvement: \(V^{\pi_{t+1}} \geq V^{\pi_t}\)
    2. Convergence: \(\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty\)

Recap

  • PSet 1 due TONIGHT
  • PSet 2 due next Monday
  • PA 1 due next Wednesday

 

  • Optimal Policies
  • Value Iteration

 

  • Next lecture: Policy Iteration, Dynamic Programming

Sp23 CS 4/5789: Lecture 5

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 5