CS 4/5789: Introduction to Reinforcement Learning

Lecture 5: Value Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

Add period ends today!
- Everyone on waitlist will receive a PIN
- Questions: https://www.cs.cornell.edu/courseinfo/enrollment
Homework this week
- Problem Set 1 due tonight
- Problem Set 2 released tonight due 2/12
- Programming Assignment 1 due 2/14
Office hours today after lecture (4:10-5:10 in Olin 255)
- I prioritize lecture/conceptual questions over HW

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Recap: Infinite Horizon

Accumulate discounted reward on infinite horizon: $$V^\pi(s) = \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t,a_t) \mid s_0=s, s_{t+1}\sim P(s_t,a_t), a_t\sim \pi(s_t) \right]$$
Bellman Consistency Equation leads to Exact & Approximate Policy Evaluation (PE) algorithms
Approximate Policy Evaluation is a fixed point iteration of the Bellman Operator, which is a contraction $$\mathcal J_\pi[V] = R^\pi + \gamma P_\pi V$$

assuming deterministic reward function and stationary, state-dependent policy (possibly stochastic)

Recap: Optimal Policy

Optimal policies uniformly dominate in value
- i.e. they have the highest value $V^\star(s)$ for all $s$
The finite horizon Bellman Optimality Equation (BOE) enables efficient policy optimization with dynamic programming
The optimal policy is greedy with respect to the optimal value $V^\star$

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

How to efficiently find a policy that maximizes expected discounted reward?

Big question for today's lecture

$a_t=\pi_t(s_t)$

$r_t= r(s_t, a_t)$

$s_{t}\sim P(s_{t-1}, a_{t-1})$

Optimal Policy

Define: An optimal policy $\pi_\star$ is one where $V^{\pi_\star}(s) \geq V^{\pi}(s)$ for all $s\in\mathcal S$, and policies $\pi\in\Pi$
- i.e. the policy dominates other policies for all states
- vector notation: $V^{\pi_\star} \geq V^{\pi}$
Thus we can write $V^\star(s) = V^{\pi_\star}(s)$
Naive algorithm: enumeration compares the value of all possible policies (using PE)
- Even only considering deterministic state-dependent policies, complexity would be $\mathcal O(A^S S^3)$!

$\Pi$ is all possible policies (including stochastic, history-dependent)

Bellman Optimality Equation

Bellman Optimality Equation (BOE): A value function
$V$ satisfies the BOE if for all $s$, $$V(s)=\max_a~~ r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}[V(s')]$$
Theorem (Bellman Optimality):
1. $\pi$ is an optimal policy if and only if $V^{\pi}$ satisfies the BOE
2. The optimal policy is greedy with respect to the optimal value function $$\pi^\star(s) \in \arg\max_a r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$

shorthand $Q^\star(s,a)$

$\underbrace{\qquad\qquad\qquad\qquad}{}$

Example

$0$

$1$

stay: $1$

move: $1$

stay: $p_1$

move: $1-p_2$

stay: $1-p_1$

move: $p_2$

Suppose the reward is: $+1$ for $s=0$ and $-\frac{1}{2}$ for $a=$ move
Consider the policy $\pi(s)=$stay for all $s$
$V^\pi(0) =\frac{1}{1-\gamma}$, $V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$
When is this optimal?
- $p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}$
When effectiveness of "move" is small compared with "stickiness" of 1 and discount factor

$0$

$1$

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Value Iteration

The Bellman Optimality Equation is a fixed point equation!
Define the Bellman Operator $\mathcal J_\star:\mathbb R^S\to\mathbb R^S$ as $$\mathcal J_\star[V](s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]~~\forall~s$$
Then BOE can be written as $V = \mathcal J_\star[V]$
Idea: find $\hat V$ with fixed point iteration of $\mathcal J_\star$, then use argmax policy $\hat\pi$

Value Iteration

Define the Bellman Operator $\mathcal J_\star:\mathbb R^S\to\mathbb R^S$ as $$\mathcal J_\star[V](s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]~~\forall~s$$
Idea: find $\hat V$ with fixed point iteration of $\mathcal J_\star$, then use argmax policy $\hat\pi$

Value Iteration

Initialize $V_0$
For $i=0,\dots,N-1$:
- $V_{i+1}=\mathcal J_\star[V_i]$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}~~ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_N(s')]$ $\forall s$

Example: PA 1

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Convergence of VI

Lemma (Contraction): For any $V, V'$ $$\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

To show that Value Iteration converges, we use a contraction argument

Theorem (Convergence): For iterates $V_i$ of VI, $$\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty$$

Convergence of VI

Lemma (Contraction): For any $V, V'$ $$\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Proof

$|\mathcal J_\star V(s) - \mathcal J_\star V'(s)| = |\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right] - $ $$\textstyle \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V'(s')] \right]|$$
- $\leq \max_{a\in\mathcal A} | r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V'(s')] \right]|$
  Poll Ev (Basic Inequality PSet 1)
- $= \max_{a\in\mathcal A}\gamma | \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \mathbb{E}_{s' \sim P( s, a)} [V'(s')]|$
- $\leq \max_{a\in\mathcal A}\gamma \mathbb{E}_{s' \sim P( s, a)} [|V(s') - V'(s')|]$ (Basic Inequality PSet 1)
- $\leq \max_{s'\in\mathcal S}\gamma |V(s') - V'(s')|$ (Basic Inequality PSet 1)
- $= \gamma \|V - V'\|_\infty$ (Basic Inequality PSet 1)
The above holds for all $s$ so $\max_s|\mathcal J_\star V(s) - \mathcal J_\star V'(s)| =\|\mathcal J_\star V - \mathcal J_\star V'\|_\infty \leq \gamma \|V-V'\|_\infty$

$\max_x f_1(x)$

Basic Inequalities

$\max_x f_2(x)$

$\max_x |f_1(x)-f_2(x)|$

$\mathbb E[ f_1(x)]$

$\mathbb E[ f_2(x)]$

$ f_1$

$ f_2$

$ f_1-f_2$

Convergence of VI

Proof

$\|V_i - V^\star\|_\infty = \|\mathcal J_\star V_{i-1} -\mathcal J_\star V^\star\|_\infty$ (Definition of VI and BOE)
- $\leq\gamma\|V_{i-1} - V^\star\|_\infty$ (Contraction Lemma)
Proof by induction using the above inequality
- Base case $(i=0)$:$ \|V_0-V^\star\|_\infty = \|V_0-V^\star\|_\infty$
- Induction step: Assume $\|V_k - V^\star\|_\infty \leq \gamma^{k}\|V_0-V^\star\|_\infty$. By above inequality, we have that $$\|V_{k+1} - V^\star\|_\infty \leq \gamma \|V_k-V^\star\|_\infty \leq \gamma \cdot \gamma^k\|V_0-V^\star\|_\infty$$ thus $\|V_{k+1} - V^\star\|_\infty \leq \gamma^{k+1}\|V_0-V^\star\|_\infty$.

Theorem (Convergence): For iterates $V_i$ of VI, $$\|V_i - V^\star\|_\infty \leq \gamma^i \|V_0-V^\star\|_\infty$$

Performance of VI Policy

Theorem (Suboptimality): For policy $\hat\pi$ from VI, $\forall s$ $$ V^\star(s) - V^{\hat\pi}(s) \leq \frac{2\gamma}{1-\gamma} \cdot \gamma^N \|V_0-V^\star\|_\infty$$

The iterate $V_N$ in VI is not necessarily equal to the value of the policy $\hat \pi$ after $N$ iterations $$V^{\hat\pi} = (I-\gamma P_{\hat\pi})^{-1}R^{\hat\pi}$$

Proof of VI Performance

This is optional material. In the proof we use $t$ in place of $i$.

Claim: $V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$
Recursing once: $V^{\pi_t}(s) - V^\star(s) $
- $\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$
- $= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$
Recursing $k$ times,
$V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty$
Letting $k\to\infty$, $V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty$

Proof of Claim:

$V^{\pi_t}(s) - V^\star(s) =$

$= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]$ (Bellman Expectation, add and subtract)
$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]$ (Grouping terms, add and subtract)
$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$ (Definition of $\pi_t$ as argmax)
$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$ (Bellman Expectation on $V^\star$ and cancelling reward terms)
$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]$ (Linearity of Expectation)
$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty$ (Basic Inequality)
$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$ (Convergence Theorem)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

4. VI Convergence

5. Proof of BOE

Bellman Optimality Proof

Theorem (Bellman Optimality): $\pi$ is an optimal policy if and only if ($\iff$) $V^{\pi}$ satisfies the BOE $$ V^{\pi}=\mathcal J_\star[V^{\pi}]$$
Proof Outline
1. ($\implies$) If $\pi^\star$ is an optimal policy, then $V^{\pi^\star}$ satisfies BOE
2. ($\impliedby$) If $V^\pi$ satisfies BOE, then $\pi$ is an optimal policy

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] $ (BCE)
  - $\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$ (PSet 1)
  - $= r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]$ (Defn of $\hat\pi$)
- In vector form, $V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}$

Bellman Optimality Proof

($\implies$) If $\pi^\star$ is an optimal policy, then $V^{\pi^\star}$ satisfies BOE

Bellman Optimality Proof

($\implies$) If $\pi^\star$ is an optimal policy, then $V^{\pi^\star}$ satisfies BOE

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}$ i.e. $V^{\pi_\star} \leq \mathcal J_{\hat\pi}[V^{\pi_\star}]$
- $V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}$ (subtract from both sides)
- $V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}$ (BCE)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$ ($\star$)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) to RHS)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi )^k(V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) k times)
- $V^{\pi_\star} - V^{\hat\pi} \leq 0$ (limit $k\to\infty$)

Consider vectors $V,V'$ and matrix $P$.

If $V\leq V'$ then $PV\leq PV'$ when $P$ has non-negative entries, where inequalities hold entrywise.

To see why this is true, consider each entry:

$[PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i$

The middle inequalities holds due to $V_j\leq V_i$ and the fact that all $P_{ij}$ are positive.

$(P^\pi)^k$ is bounded because $P^\pi$ is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.

Bellman Optimality Proof

($\implies$) If $\pi^\star$ is an optimal policy, then $V^{\pi^\star}$ satisfies BOE

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We showed that $ V^{\pi_\star}\leq V^{\hat \pi}$
Therefore, it must be that $V^{\pi_\star} = V^{\hat\pi}$
- this means $\hat \pi(s)$ is an optimal policy!
By definition of $\hat\pi$ and the BCE, $V^{\hat \pi}$ satisfies the BOE
Therefore, $V^{\pi_\star}(=V^{\hat \pi})$ must also satisfy it.

Bellman Optimality Proof

Consider an optimal policy $\pi_\star$ and the value $V^{\pi_\star}$
By part 1, we know that $V^{\pi_\star}$ satisfies BOE
We bound $|V^{\pi}(s)-V^{\pi_\star}(s)|$
- $=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$ (BOE by assumption and part 1)
- $\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$
  (basic inequality PSet 1)
- $\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$ (linearity of expectation)

2. ($\impliedby$) If $V^\pi$ satisfies BOE, then $\pi$ is an optimal policy

Bellman Optimality Proof

Consider an optimal policy $\pi_\star$ and the value $V^{\pi_\star}$
We bound $|V^{\pi}(s)-V^{\pi_\star}(s)|$
- $\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$ (linearity of expectation)
- $\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]$ (basic inequality PSet 1)
- $\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$
- $\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$
- $\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]$
- $\leq 0$ (letting $k\to\infty$)
Therefore, $V^\pi = V^{\pi_\star}$ so $\pi$ must be optimal

2. ($\impliedby$) If $V^\pi$ satisfies BOE, then $\pi$ is an optimal policy

Recap: Proof

Proof Outline
1. ($\implies$) If $\pi^\star$ is an optimal policy, then $V^{\pi^\star}$ satisfies BOE
  - On the way, showed the following was optimal $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
2. ($\impliedby$) If $V^\pi$ satisfies BOE, then $\pi$ is an optimal policy

Recap

PSet 1 due TONIGHT
PSet 2 due next Monday
PA 1 due next Wednesday

Optimal Policies
Value Iteration

Next lecture: Policy Iteration

Sp24 CS 4/5789: Lecture 5

By Sarah Dean

Sp24 CS 4/5789: Lecture 5

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 5: Value Iteration

Announcements

Agenda

Recap: Infinite Horizon

Recap: Optimal Policy

Agenda

Big question for today's lecture

Optimal Policy

Bellman Optimality Equation

Example

Agenda

Value Iteration

Value Iteration

Example: PA 1

Agenda

Convergence of VI

Convergence of VI

Basic Inequalities

Convergence of VI

Performance of VI Policy

Proof of VI Performance

Agenda

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Recap: Proof

Recap

Sp24 CS 4/5789: Lecture 5

More from Sarah Dean