CS 4/5789: Introduction to Reinforcement Learning

Lecture 5: Value Iteration

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

Questions about waitlist/enrollment?
- https://www.cs.cornell.edu/courseinfo/enrollment
Homework this week
- Problem Set 1 due TONIGHT
- Problem Set 2 released tonight due 2/13
- Programming Assignment 1 due 2/15
My office hours: Tuesdays 10:30-11:30am in Gates 416A, Wednesday 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

Recap: Optimal Policy

The value of a state $s$ under a policy $\pi$ denoted $V^\pi(s)$ is the expected cumulative discounted reward starting from that state
An optimal policy $\pi_\star$ is one with a dominating value,
- i.e. $V^{\pi_\star} \geq V^{\pi}$ for all policies $\pi$
All optimal policies achieve the same value $V^\star$

Recap: Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Reward: $+1$ for $s=0$ and $-\frac{1}{2}$ for $a=$ switch
Consider the policy $\pi(s)=$stay for all $s$
Optimal if $p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}$, i.e. effectiveness of "switch" small compared with "stickiness" of 1 and discount factor

$0$

$1$

Recap: Bellman Equations

Bellman Optimality Equation (BOE): $\forall s$, $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Theorem (Bellman Optimality):

If $\pi_\star$ is an optimal policy, then $V^{\pi_\star}$ satisfies the BOE
If $V^\pi$ satisfies the BOE, then $\pi$ is an optimal policy

Bellman Expectation Equation: $\forall s$,

$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

The value of a state $s$ under a policy $\pi$ denoted $V^\pi(s)$ is the expected cumulative discounted reward starting from that state

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] $
  - $\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $\pi^\star$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] $
  - $\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$
  - $\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]$
- Writing the above expression in vector form:
- $V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}$

Bellman Optimality Proof

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}$
- $V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}$ (subtract from both sides)
- $V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}$ (Bellman Expectation Eq)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$ ($\star$)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) to RHS)

Bellman Optimality Proof

Consider vectors $V,V'$ and matrix $P$.

If $V\leq V'$ then $PV\leq PV'$ when $P$ has non-negative entries, where inequalities hold entrywise.

To see why this is true, consider each entry:

$[PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i$

The middle inequalities holds due to $V_j\leq V_i$ and the fact that all $P_{ij}$ are positive.

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$ ($\star$)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi )^2(V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) to RHS)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi )^k(V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) k times)
- $V^{\pi_\star} - V^{\hat\pi} \leq 0$ (limit $k\to\infty$)
Therefore, $V^{\pi_\star} = V^{\hat\pi}$

Bellman Optimality Proof

$(P^\pi)^k$ is bounded because $P^\pi$ is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
We showed that $V^{\pi_\star} = V^{\hat\pi}$
- this means $\hat \pi(s)$ is an optimal policy!
By definition of $\hat\pi$ and the Bellman Expectation Equation, $V^{\hat \pi}$ satisfies the Bellman Optimality Equation
Therefore, $V^{\pi_\star}$ must also satisfy it.

Bellman Optimality Proof

If we know the optimal value $V^\star$ then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] $$
$\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)$

Bellman Optimality

Bellman Optimality Proof

Theorem (Bellman Optimality) 2: $\pi$ is an optimal policy if $V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s$

Consider an optimal policy $\pi_\star$ and the value $V^{\pi_\star}$
By part 1, we know that $V^{\pi_\star}$ satisfies BOE
We bound $|V^{\pi}(s)-V^{\pi_\star}(s)|$
- $=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$ (BOE by assumption and part 1)
- $\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$
  - PollEV basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
- $\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$ (linearity of expectation)

$\max_x f_1(x)$

Basic Inequalities

$\max_x f_2(x)$

$\max_x |f_1(x)-f_2(x)|$

$\mathbb E[ f_1(x)]$

$\mathbb E[ f_2(x)]$

$ f_1$

$ f_2$

$ f_1-f_2$

Bellman Optimality Proof

Consider an optimal policy $\pi_\star$ and the value $V^{\pi_\star}$
We bound $|V^{\pi}(s)-V^{\pi_\star}(s)|$
- $\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$ (linearity of expectation)
- $\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]$ (basic inequality PSet 1)
- $\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$
- $\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$
- $\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]$
- $\leq 0$ (letting $k\to\infty$)
Therefore, $V^\pi = V^{\pi_\star}$ so $\pi$ must be optimal

Theorem (Bellman Optimality) 2: $\pi$ is an optimal policy if $V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s$

Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

Value Iteration

The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
If $V^\star$ satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
Idea: find $\hat V$ with fixed point iteration, then get approximately optimal policy $\hat\pi$.

Value Iteration

Value Iteration

Initialize $V_0$
For $t=0,\dots,T-1$:
- $V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]$ for all $s$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]$ $\forall s$

Idea: find $\hat V$ with fixed point iteration, then get approximately optimal policy $\hat\pi$.

Example: PA 1

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

Bellman Operator

Define the Bellman Operator $\mathcal T:\mathbb R^S\to \mathbb R^S$ as, $\forall s$ $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Nonlinear map
Value Iteration is repeated application of the Bellman Operator
Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation

Convergence of VI

Lemma (Contraction): For any $V, V'$ $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

To show that Value Iteration converges, we use a contraction argument

Lemma (Convergence): For iterates $V_t$ of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

Convergence of VI

Lemma (Contraction): For any $V, V'$ $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Proof

$|\mathcal T V(s) - \mathcal T V'(s)| = |\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|$
- $\leq \max_{a\in\mathcal A} | r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|$ (Basic Inequality PSet 1)
- $= \max_{a\in\mathcal A}\gamma | \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \mathbb{E}_{s' \sim P( s, a)} [V(s')]|$
- $\leq \max_{a\in\mathcal A}\gamma \mathbb{E}_{s' \sim P( s, a)} [|V(s') - V(s')|]$ (Basic Inequality PSet 1)
- $\leq \max_{s'\in\mathcal S}\gamma |V(s') - V(s')|$ (Basic Inequality PSet 1)
- $= \gamma \|V - V\|_\infty$ (Basic Inequality PSet 1)
The above holds for all $s$ so $\max_s|\mathcal T V(s) - \mathcal T V'(s)| =\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$

Convergence of VI

Proof

$\|V_t - V^\star\|_\infty = \|\mathcal T V_{t-1} -\mathcal T V^\star\|_\infty$ (Definition of VI and BOE)
- $\leq\gamma\|V_{t-1} - V^\star\|_\infty$ (Contraction Lemma)
We prove the Lemma by induction using the above inequality
- Base case $(t=0)$:$ \|V_0-V^\star\|_\infty = \|V_0-V^\star\|_\infty$
- Induction step: Assume $\|V_k - V^\star\|_\infty \leq \gamma^{k}\|V_0-V^\star\|_\infty$. By above inequality, we have that $$\|V_{k+1} - V^\star\|_\infty \leq \gamma \|V_k-V^\star\|_\infty \leq \gamma \cdot \gamma^k\|V_0-V^\star\|_\infty$$ thus $\|V_{k+1} - V^\star\|_\infty \leq \gamma^{k+1}\|V_0-V^\star\|_\infty$.

Lemma (Convergence): For iterates $V_t$ of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

Performance of VI Policy

Proof

Claim: $V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$
Recursing once: $V^{\pi_t}(s) - V^\star(s) $
- $\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$
- $= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$
Recursing $k$ times,
$V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty$
Letting $k\to\infty$, $V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty$

Theorem (Suboptimality): For policy $\pi_T$ from VI, $\forall s$ $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{t+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

Proof of Claim:

$V^{\pi_t}(s) - V^\star(s) =$

$= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]$ (Bellman Expectation, add and subtract)
$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]$ (Grouping terms, add and subtract)
$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$ (Definition of $\pi_t$ as argmax)
$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$ (Bellman Expectation on $V^\star$ and cancelling reward terms)
$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]$ (Linearity of Expectation)
$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty$ (Basic Inequality)
$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$ (Convergence Lemma)

Preview: Policy Iteration

Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $t=0,\dots,T-1$:
- Compute $V^{\pi_t}$ with Policy Evaluation
- Policy Improvement: $\forall s$, $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$

VI only generates a policy at the very end
Policy Iteration is another iterative algorithm that updates a policy at every iteration step

Preview: Policy Iteration

Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $t=0,\dots,T-1$:
- Policy Evaluation $V^{\pi_t}$
- Policy Improvement $\pi^{t+1}$

Two key properties:
1. Monotonic Improvement: $V^{\pi_{t+1}} \geq V^{\pi_t}$
2. Convergence: $\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty$

Recap

PSet 1 due TONIGHT
PSet 2 due next Monday
PA 1 due next Wednesday

Optimal Policies
Value Iteration

Next lecture: Policy Iteration, Dynamic Programming

Sp23 CS 4/5789: Lecture 5

By Sarah Dean

Sp23 CS 4/5789: Lecture 5

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 5: Value Iteration

Announcements

Agenda

Recap: Optimal Policy

Recap: Example

Recap: Bellman Equations

Agenda

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality

Bellman Optimality Proof

Basic Inequalities

Bellman Optimality Proof

Agenda

Value Iteration

Value Iteration

Example: PA 1

Bellman Operator

Convergence of VI

Convergence of VI

Convergence of VI

Performance of VI Policy

Preview: Policy Iteration

Preview: Policy Iteration

Recap

Sp23 CS 4/5789: Lecture 5

More from Sarah Dean