CS 4/5789: Introduction to Reinforcement Learning

Lecture 5: Value Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

  • Add period ends today!
  • Homework this week
    • Problem Set 1 due tonight
    • Problem Set 2 released tonight due 2/12
    • Programming Assignment 1 due 2/14
  • Office hours today after lecture (4:10-5:10 in Olin 255)
    • I prioritize lecture/conceptual questions over HW

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Recap: Infinite Horizon

  • Accumulate discounted reward on infinite horizon: $$V^\pi(s) = \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t,a_t) \mid s_0=s, s_{t+1}\sim P(s_t,a_t), a_t\sim \pi(s_t) \right]$$

  • Bellman Consistency Equation leads to Exact & Approximate Policy Evaluation (PE) algorithms

  • Approximate Policy Evaluation is a fixed point iteration of the Bellman Operator, which is a contraction $$\mathcal J_\pi[V] = R^\pi + \gamma P_\pi V$$

assuming deterministic reward function and stationary, state-dependent policy (possibly stochastic)

Recap: Optimal Policy

  • Optimal policies uniformly dominate in value

    • i.e. they have the highest value \(V^\star(s)\) for all \(s\)

  • The finite horizon Bellman Optimality Equation (BOE) enables efficient policy optimization with dynamic programming

  • The optimal policy is greedy with respect to the optimal value \(V^\star\)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

How to efficiently find a policy that maximizes expected discounted reward?

Big question for today's lecture

\(a_t=\pi_t(s_t)\)

\(r_t= r(s_t, a_t)\)

\(s_{t}\sim P(s_{t-1}, a_{t-1})\)

Optimal Policy

  • Define: An optimal policy \(\pi_\star\) is one where \(V^{\pi_\star}(s) \geq V^{\pi}(s)\) for all \(s\in\mathcal S\), and policies \(\pi\in\Pi\)
    • i.e. the policy dominates other policies for all states
    • vector notation: \(V^{\pi_\star} \geq V^{\pi}\)
  • Thus we can write \(V^\star(s) = V^{\pi_\star}(s)\)
  • Naive algorithm: enumeration compares the value of all possible policies (using PE)
    • Even only considering deterministic state-dependent policies, complexity would be \(\mathcal O(A^S S^3)\)!

\(\Pi\) is all possible policies (including stochastic, history-dependent)

Bellman Optimality Equation

  • Bellman Optimality Equation (BOE): A value function
    \(V\) satisfies the BOE if for all \(s\), $$V(s)=\max_a~~ r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}[V(s')]$$
  • Theorem (Bellman Optimality):

    1. \(\pi\) is an optimal policy if and only if \(V^{\pi}\) satisfies the BOE
    2. The optimal policy is greedy with respect to the optimal value function $$\pi^\star(s) \in \arg\max_a r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$

shorthand \(Q^\star(s,a)\)

\(\underbrace{\qquad\qquad\qquad\qquad}{}\)

Example

\(0\)

\(1\)

stay: \(1\)

move: \(1\)

stay: \(p_1\)

move: \(1-p_2\)

stay: \(1-p_1\)

move: \(p_2\)

  • Suppose the reward is: \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) move
  • Consider the policy \(\pi(s)=\)stay for all \(s\)
  • \(V^\pi(0) =\frac{1}{1-\gamma}\), \(V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
  • When is this optimal?
    • \(p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}\)
  • When effectiveness of "move" is small compared with "stickiness" of 1 and discount factor

\(0\)

\(1\)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Value Iteration

  • The Bellman Optimality Equation is a fixed point equation!
  • Define the Bellman Operator \(\mathcal J_\star:\mathbb R^S\to\mathbb R^S\) as $$\mathcal J_\star[V](s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]~~\forall~s$$
  • Then BOE can be written as \(V = \mathcal J_\star[V]\)
  • Idea: find \(\hat V\) with fixed point iteration of \(\mathcal J_\star\), then use argmax policy \(\hat\pi\)

Value Iteration

  • Define the Bellman Operator \(\mathcal J_\star:\mathbb R^S\to\mathbb R^S\) as $$\mathcal J_\star[V](s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]~~\forall~s$$
  • Idea: find \(\hat V\) with fixed point iteration of \(\mathcal J_\star\), then use argmax policy \(\hat\pi\)

Value Iteration

  • Initialize \(V_0\)
  • For \(i=0,\dots,N-1\):
    • \(V_{i+1}=\mathcal J_\star[V_i]\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}~~ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_N(s')]\) \(\forall s\)

Example: PA 1

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

16

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Convergence of VI

Lemma (Contraction): For any \(V, V'\) $$\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

To show that Value Iteration converges, we use a contraction argument

Theorem (Convergence): For iterates \(V_i\) of VI, $$\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty$$

Convergence of VI

Lemma (Contraction): For any \(V, V'\) $$\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Proof

  • \(|\mathcal J_\star V(s) - \mathcal J_\star V'(s)| =  |\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right] - \) $$\textstyle \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V'(s')] \right]|$$
    • \(\leq \max_{a\in\mathcal A} | r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V'(s')] \right]|\)
      Poll Ev (Basic Inequality PSet 1)
    • \(= \max_{a\in\mathcal A}\gamma |  \mathbb{E}_{s' \sim P( s, a)} [V(s')] -  \mathbb{E}_{s' \sim P( s, a)} [V'(s')]|\)
    • \(\leq \max_{a\in\mathcal A}\gamma  \mathbb{E}_{s' \sim P( s, a)} [|V(s') -  V'(s')|]\) (Basic Inequality PSet 1)
    • \(\leq \max_{s'\in\mathcal S}\gamma |V(s') -  V'(s')|\) (Basic Inequality PSet 1)
    • \(= \gamma \|V -  V'\|_\infty\) (Basic Inequality PSet 1)
  • The above holds for all \(s\) so \(\max_s|\mathcal J_\star V(s) - \mathcal J_\star V'(s)| =\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty\)

\(\max_x f_1(x)\)

Basic Inequalities

\(\max_x f_2(x)\)

\(\max_x |f_1(x)-f_2(x)|\)

\(\mathbb E[ f_1(x)]\)

\(\mathbb E[ f_2(x)]\)

\( f_1\)

\( f_2\)

\( f_1-f_2\)

Convergence of VI

Proof

  • \(\|V_i - V^\star\|_\infty = \|\mathcal J_\star V_{i-1} -\mathcal J_\star  V^\star\|_\infty\) (Definition of VI and BOE)
    • \(\leq\gamma\|V_{i-1} -  V^\star\|_\infty\) (Contraction Lemma)
  • Proof by induction using the above inequality
    • Base case \((i=0)\):\( \|V_0-V^\star\|_\infty = \|V_0-V^\star\|_\infty\)
    • Induction step: Assume \(\|V_k - V^\star\|_\infty \leq \gamma^{k}\|V_0-V^\star\|_\infty\). By above inequality, we have that $$\|V_{k+1} - V^\star\|_\infty \leq \gamma \|V_k-V^\star\|_\infty \leq \gamma \cdot \gamma^k\|V_0-V^\star\|_\infty$$ thus \(\|V_{k+1} - V^\star\|_\infty \leq \gamma^{k+1}\|V_0-V^\star\|_\infty\).

Theorem (Convergence): For iterates \(V_i\) of VI, $$\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty$$

Performance of VI Policy

Theorem (Suboptimality): For policy \(\hat\pi\) from VI, \(\forall s\) $$ V^\star(s) - V^{\hat\pi}(s)  \leq \frac{2\gamma}{1-\gamma} \cdot \gamma^N \|V_0-V^\star\|_\infty$$

  • The iterate \(V_N\) in VI is not necessarily equal to the value of the policy \(\hat \pi\) after \(N\) iterations $$V^{\hat\pi} = (I-\gamma P_{\hat\pi})^{-1}R^{\hat\pi}$$

Proof of VI Performance

This is optional material. In the proof we use \(t\) in place of \(i\).

  • Claim: \(V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Recursing once: \(V^{\pi_t}(s) - V^\star(s) \)
    • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
    • \(= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Recursing \(k\) times,
    \(V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty\)
  • Letting \(k\to\infty\), \(V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty\)

Proof of Claim:

\(V^{\pi_t}(s) - V^\star(s) =\)

  • \(= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]\) (Bellman Expectation, add and subtract)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]\) (Grouping terms, add and subtract)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Definition of \(\pi_t\) as argmax)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]\) (Bellman Expectation on \(V^\star\) and cancelling reward terms)
  • \(= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]\) (Linearity of Expectation)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty\) (Basic Inequality)
  • \(\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\) (Convergence Theorem)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Bellman Optimality Proof

  • Theorem (Bellman Optimality): \(\pi\) is an optimal policy if and only if  (\(\iff\)) \(V^{\pi}\) satisfies the BOE $$ V^{\pi}=\mathcal J_\star[V^{\pi}]$$
  • Proof Outline
    1. (\(\implies\)) If \(\pi^\star\) is an optimal policy, then \(V^{\pi^\star}\) satisfies BOE
    2. (\(\impliedby\)) If \(V^\pi\) satisfies BOE, then \(\pi\) is an optimal policy
  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \) (BCE)
      • \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\) (PSet 1)
      • \(= r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]\) (Defn of \(\hat\pi\))
    • In vector form, \(V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}\)

Bellman Optimality Proof

  1. (\(\implies\)) If \(\pi^\star\) is an optimal policy, then \(V^{\pi^\star}\) satisfies BOE

Bellman Optimality Proof

  1. (\(\implies\)) If \(\pi^\star\) is an optimal policy, then \(V^{\pi^\star}\) satisfies BOE
  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}\) i.e. \(V^{\pi_\star} \leq \mathcal J_{\hat\pi}[V^{\pi_\star}]\)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}\) (subtract from both sides)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}\) (BCE)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star}  -V^{\hat \pi})\)       (\(\star\))
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) to RHS)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi )^k(V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) k times)
    • \(V^{\pi_\star} - V^{\hat\pi} \leq 0\) (limit \(k\to\infty\))

Consider vectors \(V,V'\) and matrix \(P\).

If \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries, where inequalities hold entrywise.

To see why this is true, consider each entry:

\([PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i\)

The middle inequalities holds due to \(V_j\leq V_i\) and the fact that all \(P_{ij}\) are positive.

\((P^\pi)^k\) is bounded because \(P^\pi\) is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.

Bellman Optimality Proof

  1. (\(\implies\)) If \(\pi^\star\) is an optimal policy, then \(V^{\pi^\star}\) satisfies BOE
  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We showed that \( V^{\pi_\star}\leq V^{\hat \pi}\)
  • Therefore, it must be that \(V^{\pi_\star} = V^{\hat\pi}\)
    • this means \(\hat \pi(s)\) is an optimal policy!
  • By definition of \(\hat\pi\) and the BCE, \(V^{\hat \pi}\) satisfies the BOE
  • Therefore, \(V^{\pi_\star}(=V^{\hat \pi})\) must also satisfy it.

Bellman Optimality Proof

  • Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
  • By part 1, we know that \(V^{\pi_\star}\) satisfies BOE
  • We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
    • \(=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\) (BOE by assumption and part 1)
    • \(\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\)
      (basic inequality PSet 1)
    • \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)

2. (\(\impliedby\)) If \(V^\pi\) satisfies BOE, then \(\pi\) is an optimal policy

Bellman Optimality Proof

  • Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
  • We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
    • \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
    • \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]\) (basic inequality PSet 1)
    • \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
    • \(\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
    • \(\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]\)
    • \(\leq 0\) (letting \(k\to\infty\))
  • Therefore, \(V^\pi = V^{\pi_\star}\) so \(\pi\) must be optimal

2. (\(\impliedby\)) If \(V^\pi\) satisfies BOE, then \(\pi\) is an optimal policy

Recap: Proof

  • Proof Outline
    1. (\(\implies\)) If \(\pi^\star\) is an optimal policy, then \(V^{\pi^\star}\) satisfies BOE
      • On the way, showed the following was optimal $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
    2. (\(\impliedby\)) If \(V^\pi\) satisfies BOE, then \(\pi\) is an optimal policy

Recap

  • PSet 1 due TONIGHT
  • PSet 2 due next Monday
  • PA 1 due next Wednesday

 

  • Optimal Policies
  • Value Iteration

 

  • Next lecture: Policy Iteration

Sp24 CS 4/5789: Lecture 5

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 5