Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Announcements

• Homework this week
• Problem Set 1 due TONIGHT
• Problem Set 2 released tonight due 2/13
• Programming Assignment 1 due 2/15
• My office hours: Tuesdays 10:30-11:30am in Gates 416A, Wednesday 4-4:50pm in Olin 255 (right after lecture)

## Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

## Recap: Optimal Policy

• The value of a state $$s$$ under a policy $$\pi$$ denoted $$V^\pi(s)$$ is the expected cumulative discounted reward starting from that state
• An optimal policy $$\pi_\star$$ is one with a dominating value,
• i.e. $$V^{\pi_\star} \geq V^{\pi}$$ for all policies $$\pi$$
• All optimal policies achieve the same value $$V^\star$$

## Recap: Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Reward:  $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for $$a=$$ switch
• Consider the policy $$\pi(s)=$$stay for all $$s$$
• Optimal if $$p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}$$, i.e. effectiveness of "switch" small compared with "stickiness" of 1 and discount factor

$$0$$

$$1$$

## Recap: Bellman Equations

Bellman Optimality Equation (BOE): $$\forall s$$, $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Theorem (Bellman Optimality):

1. If $$\pi_\star$$ is an optimal policy, then $$V^{\pi_\star}$$ satisfies the BOE
2. If $$V^\pi$$ satisfies the BOE, then $$\pi$$ is an optimal policy

Bellman Expectation Equation: $$\forall s$$,

$$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

The value of a state $$s$$ under a policy $$\pi$$ denoted $$V^\pi(s)$$ is the expected cumulative discounted reward starting from that state

## Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right]$$
• $$\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right]$$
• $$\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$$
• $$\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]$$
• Writing the above expression in vector form:
• $$V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}$$
• $$V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}$$ (subtract from both sides)
• $$V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}$$ (Bellman Expectation Eq)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$$       ($$\star$$)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$$ (apply ($$\star$$) to RHS)

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

Consider vectors $$V,V'$$ and matrix $$P$$.

If $$V\leq V'$$ then $$PV\leq PV'$$ when $$P$$ has non-negative entries, where inequalities hold entrywise.

To see why this is true, consider each entry:

$$[PV]_i = \sum_{j=1}^S P_{ij} V_j \leq \sum_{j=1}^S P_{ij} V'_j = [PV']_i$$

The middle inequalities holds due to $$V_j\leq V_i$$ and the fact that all $$P_{ij}$$ are positive.

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$$       ($$\star$$)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi )^2(V^{\pi_\star}- V^{\hat \pi})$$ (apply ($$\star$$) to RHS)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi )^k(V^{\pi_\star}- V^{\hat \pi})$$ (apply ($$\star$$) k times)
• $$V^{\pi_\star} - V^{\hat\pi} \leq 0$$ (limit $$k\to\infty$$)
• Therefore, $$V^{\pi_\star} = V^{\hat\pi}$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

$$(P^\pi)^k$$ is bounded because $$P^\pi$$ is a stochastic matrix (since it represents probabilities) so the maximal eigenvalue is 1.

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{ \pi^\star}(s')] \right]$$
• We showed that $$V^{\pi_\star} = V^{\hat\pi}$$
• this means $$\hat \pi(s)$$ is an optimal policy!
• By definition of $$\hat\pi$$ and the Bellman Expectation Equation, $$V^{\hat \pi}$$ satisfies the Bellman Optimality Equation
• Therefore, $$V^{\pi_\star}$$ must also satisfy it.

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right],~~\forall s$$

• If we know the optimal value $$V^\star$$ then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
• Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')]$$
• $$\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 2: $$\pi$$ is an optimal policy if $$V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s$$

• Consider an optimal policy $$\pi_\star$$ and the value $$V^{\pi_\star}$$
• By part 1, we know that $$V^{\pi_\star}$$ satisfies BOE
• We bound $$|V^{\pi}(s)-V^{\pi_\star}(s)|$$
• $$=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$$ (BOE by assumption and part 1)
• $$\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$$
• PollEV basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
• $$\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$$ (linearity of expectation)

$$\max_x f_1(x)$$

## Basic Inequalities

$$\max_x f_2(x)$$

$$\max_x |f_1(x)-f_2(x)|$$

$$\mathbb E[ f_1(x)]$$

$$\mathbb E[ f_2(x)]$$

$$f_1$$

$$f_2$$

$$f_1-f_2$$

## Bellman Optimality Proof

• Consider an optimal policy $$\pi_\star$$ and the value $$V^{\pi_\star}$$
• We bound $$|V^{\pi}(s)-V^{\pi_\star}(s)|$$
• $$\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$$ (linearity of expectation)
• $$\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]$$ (basic inequality PSet 1)
• $$\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$$
• $$\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$$
• $$\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]$$
• $$\leq 0$$ (letting $$k\to\infty$$)
• Therefore, $$V^\pi = V^{\pi_\star}$$ so $$\pi$$ must be optimal

Theorem (Bellman Optimality) 2: $$\pi$$ is an optimal policy if $$V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right],~~\forall s$$

## Agenda

1. Recap

2. Bellman Optimality

3. Value Iteration

## Value Iteration

• The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
• If $$V^\star$$ satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
• Idea: find $$\hat V$$ with fixed point iteration, then get approximately optimal policy $$\hat\pi$$.

## Value Iteration

Value Iteration

• Initialize $$V_0$$
• For $$t=0,\dots,T-1$$:
• $$V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]$$ for all $$s$$
• Return $$\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]$$ $$\forall s$$
• Idea: find $$\hat V$$ with fixed point iteration, then get approximately optimal policy $$\hat\pi$$.

## Example: PA 1

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16

## Bellman Operator

• Define the Bellman Operator $$\mathcal T:\mathbb R^S\to \mathbb R^S$$ as, $$\forall s$$ $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
• Nonlinear map
• Value Iteration is repeated application of the Bellman Operator
• Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation

## Convergence of VI

Lemma (Contraction): For any $$V, V'$$ $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

To show that Value Iteration converges, we use a contraction argument

Lemma (Convergence): For iterates $$V_t$$ of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

## Convergence of VI

Lemma (Contraction): For any $$V, V'$$ $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Proof

• $$|\mathcal T V(s) - \mathcal T V'(s)| = |\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|$$
• $$\leq \max_{a\in\mathcal A} | r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]|$$ (Basic Inequality PSet 1)
• $$= \max_{a\in\mathcal A}\gamma | \mathbb{E}_{s' \sim P( s, a)} [V(s')] - \mathbb{E}_{s' \sim P( s, a)} [V(s')]|$$
• $$\leq \max_{a\in\mathcal A}\gamma \mathbb{E}_{s' \sim P( s, a)} [|V(s') - V(s')|]$$ (Basic Inequality PSet 1)
• $$\leq \max_{s'\in\mathcal S}\gamma |V(s') - V(s')|$$ (Basic Inequality PSet 1)
• $$= \gamma \|V - V\|_\infty$$ (Basic Inequality PSet 1)
• The above holds for all $$s$$ so $$\max_s|\mathcal T V(s) - \mathcal T V'(s)| =\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

## Convergence of VI

Proof

• $$\|V_t - V^\star\|_\infty = \|\mathcal T V_{t-1} -\mathcal T V^\star\|_\infty$$ (Definition of VI and BOE)
• $$\leq\gamma\|V_{t-1} - V^\star\|_\infty$$ (Contraction Lemma)
• We prove the Lemma by induction using the above inequality
• Base case $$(t=0)$$:$$\|V_0-V^\star\|_\infty = \|V_0-V^\star\|_\infty$$
• Induction step: Assume $$\|V_k - V^\star\|_\infty \leq \gamma^{k}\|V_0-V^\star\|_\infty$$. By above inequality, we have that $$\|V_{k+1} - V^\star\|_\infty \leq \gamma \|V_k-V^\star\|_\infty \leq \gamma \cdot \gamma^k\|V_0-V^\star\|_\infty$$ thus $$\|V_{k+1} - V^\star\|_\infty \leq \gamma^{k+1}\|V_0-V^\star\|_\infty$$.

Lemma (Convergence): For iterates $$V_t$$ of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

## Performance of VI Policy

Proof

• Claim: $$V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$
• Recursing once: $$V^{\pi_t}(s) - V^\star(s)$$
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$
• $$= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$
• Recursing $$k$$ times,
$$V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty$$
• Letting $$k\to\infty$$, $$V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty$$

Theorem (Suboptimality): For policy $$\pi_T$$ from VI, $$\forall s$$ $$V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{t+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

Proof of Claim:

$$V^{\pi_t}(s) - V^\star(s) =$$

• $$= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]$$ (Bellman Expectation, add and subtract)
• $$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]$$ (Grouping terms, add and subtract)
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$$ (Definition of $$\pi_t$$ as argmax)
• $$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$$ (Bellman Expectation on $$V^\star$$ and cancelling reward terms)
• $$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]$$ (Linearity of Expectation)
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty$$ (Basic Inequality)
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$ (Convergence Lemma)

## Preview: Policy Iteration

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Compute $$V^{\pi_t}$$ with Policy Evaluation
• Policy Improvement: $$\forall s$$, $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
• VI only generates a policy at the very end
• Policy Iteration is another iterative algorithm that updates a policy at every iteration step

## Preview: Policy Iteration

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Policy Evaluation $$V^{\pi_t}$$
• Policy Improvement $$\pi^{t+1}$$
• Two key properties:
1. Monotonic Improvement: $$V^{\pi_{t+1}} \geq V^{\pi_t}$$
2. Convergence: $$\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty$$

## Recap

• PSet 1 due TONIGHT
• PSet 2 due next Monday
• PA 1 due next Wednesday

• Optimal Policies
• Value Iteration

• Next lecture: Policy Iteration, Dynamic Programming

By Sarah Dean

Private