Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Announcements

• Homework released this week
• Problem Set 1 due Monday 2/6
• Programming Assignment 1 released tonight, due in 2 weeks later
• CIS Partner Finding Social
• Come to Duffield Atrium to find a partner or study buddy for any CIS classes you are taking this semester! February 2nd from 4:30 to 6:30pm

## Agenda

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Recall ongoing example
• Suppose the reward is:
• $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for
$$a=$$ switch
• Notation review: what is $$\{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$ for this example?

$$0$$

$$1$$

## Notation Review

$$0$$

$$1$$

• $$\mathcal S = \{0,1\}$$ and $$\mathcal A=\{$$stay,switch$$\}$$
• $$r(0,$$stay$$)=1$$, $$r(0,$$switch$$)=\frac{1}{2}$$
• $$r(1,$$stay$$)=0$$, $$r(1,$$switch$$)=-\frac{1}{2}$$
• $$P(0,$$stay$$)=\mathbf{1}_{0}=\mathsf{Bernoulli}(0)$$
• $$P(1,$$stay$$)=\mathsf{Bernoulli}(p_1)$$
• $$P(0\mid 0,$$stay$$)=$$
• $$P(1\mid 0,$$stay$$)=$$
• $$P(0\mid 1,$$stay$$)=$$
• $$P(1\mid 1,$$stay$$)=$$
• $$P(0\mid 0,$$switch$$)=$$
• $$P(1\mid 0,$$switch$$)=$$
• $$P(0\mid 1,$$switch$$)=$$
• $$P(1\mid 1,$$switch$$)=$$
• $$1$$
• $$0$$
• $$1-p_1$$
• $$p_1$$
• $$0$$
• $$1$$
• $$p_2$$
• $$1-p_2$$
• $$P(0,$$switch$$)=\mathbf{1}_{1}=\mathsf{Bernoulli}(1)$$
• $$P(1,$$switch$$)=\mathsf{Bernoulli}(1-p_2)$$

The value of a state $$s$$ under a policy $$\pi$$ is the expected cumulative discounted reward starting from that state

## Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t)\right]$$

Bellman Expectation Equation: $$\forall s$$,

$$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

...

...

...

Q function: $$Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')]$$

Proof of BE

• $$V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]$$
• $$= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]$$
(linearity of expectation)
• $$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]$$
(simplifying conditional expectation, re-indexing sum)
• $$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]$$ (tower property of conditional expectation)
• $$= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]$$
(definition of value function and linearity of expectation)

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Suppose the reward is:
• $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for
$$a=$$ switch
• Consider the policy $$\pi(s)=$$stay for all $$s$$
• $$V^\pi(0) =\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}$$
• $$V^\pi(1) =\sum_{T=0}^\infty p_1^T(1-p_1) \sum_{t=T}^\infty \gamma^t =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$$

## Policy Evaluation (PE)

• $$V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s')$$
• The matrix vector form of the Bellman Equation is

$$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$

$$s$$

$$s'$$

$$P(s'\mid s,\pi(s))$$

# $$+\gamma$$

$$V^\pi(s)$$

$$r(s,\pi(s))$$

Approximate Policy Evaluation:

• Initialize $$V_0$$
• For $$t=0,1,\dots, T$$:
• $$V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t$$

Complexity of each iteration is $$\mathcal O(S^2)$$

## Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

To show the Approx PE works, we first prove a contraction lemma

## Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

• $$\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty$$ by algorithm definition
• $$= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty$$ by Bellman eq
• $$= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle$$ norm definition
• $$=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|$$ expectation definition
• $$\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]$$ basic inequality (PSet 1)
• $$\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty$$ basic inequality (PSet 1)

Proof

• First statement follows by induction using the Lemma
• For the second statement,
• $$\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon$$
• Taking $$\log$$ of both sides,
• $$T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon$$, then rearrange

## Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an $$\epsilon$$ correct solution requires

$$T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}$$

## Agenda

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Suppose the reward is:
• $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for
$$a=$$ switch
• Consider the policy $$\pi(s)=$$stay for all $$s$$
• $$V^\pi(0) =\frac{1}{1-\gamma}$$
• $$V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$$
• Is this optimal? PollEV

## Optimal Policy

maximize   $$\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

• An optimal policy $$\pi_\star$$ is one where $$V^{\pi_\star}(s) \geq V^{\pi}(s)$$ for all $$s$$ and policies $$\pi$$
• i.e. the policy dominates other policies for all states
• vector notation: $$V^{\pi_\star}(s) \geq V^{\pi}(s)~\forall~s\iff V^{\pi_\star} \geq V^{\pi}$$
• All optimal policies achieve the same value $$V^\star$$, i.e. at every state $$s$$, $$V^\star(s) = V^{\pi_\star}(s)$$

## Finding and Verifying Optimal Policies

Enumeration:

• Initialize $$V^\star=-\infty, \pi_\star$$
• For all $$\pi:\mathcal S\to\mathcal A$$:
• compute $$V^\pi$$ with PE
• if $$V^\pi\geq V^\star$$: set $$V^\star =V^\pi$$ and $$\pi_\star=\pi$$
• return $$\pi^\star$$
• How can we find an optimal policy? How can we verify whether a policy is optimal?
• Naive approach: enumeration
• For $$S=|\mathcal S|$$ states and $$A=|\mathcal A|$$ actions, the complexity is $$\mathcal O(A^S S^3)$$!

## Bellman Optimality Equation

• Just like the Bellman Expectation Equation made it easier to compute the Value for a given policy,
• the Bellman Optimization Equation will make it easier to verify and compute the optimal policy/value function

Bellman Optimality Equation (BOE): $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Theorem (Bellman Optimality):

1. If $$\pi_\star$$ is an optimal policy, then $$V^{\pi_\star}$$ satisfies the BOE
2. If $$V^\pi$$ satisfies the BOE, then $$\pi$$ is an optimal policy

Theorem (Bellman Optimality) 2: $$\pi$$ is an optimal policy, if $$V^\pi$$ satisfies $$V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Suppose the reward is:
• $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for
$$a=$$ switch
• Consider the policy $$\pi(s)=$$stay for all $$s$$
• $$V^\pi(0) =\frac{1}{1-\gamma}$$
• $$V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$$
• Is this optimal?
• $$V^\pi(0) =\frac{1}{1-\gamma}$$
• $$\max_{a\in\mathcal A} \left[ r(0, a) + \gamma \mathbb{E}_{s' \sim P( 0, a)} [V(s')] \right]$$
• for $$a=$$stay, $$\frac{1}{1-\gamma}$$
• for $$a=$$switch,
• $$\frac{1}{2} + \gamma V(1) = \frac{\gamma (1-p_1)}{(1-\gamma p_1)(1-\gamma)} +\frac{1}{2} \leq 1 + \frac{\gamma}{1-\gamma} = \frac{1}{1-\gamma}$$
• Thus BOE satisfied for $$s=0$$
• $$V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$$ [warning: possible algebra mistakes below]
• $$\max_{a\in\mathcal A} \left[ r(1, a) + \gamma \mathbb{E}_{s' \sim P( 1, a)} [V(s')] \right]$$
• for $$a=$$stay, $$\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$$
• for $$a=$$switch,
• $$-\frac{1}{2} + \gamma ((1-p_2)V(1)+p_2V(0)) = \frac{\gamma (1-p_2)(1-p_1)}{(1-\gamma p_1)(1-\gamma)} + \frac{\gamma p_2}{1-\gamma} -\frac{1}{2}$$
• Thus BOE satisfied if $$p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}$$

discount factor $$\gamma$$

$$p_1$$ probability of stay | stay

• Color: maximum value that $$p_2$$ can have for "stay" to be optimal
• ranging from 0 (dark) to 1.5 (light)
• If $$\gamma$$ is small, cost of "switch" action is not worth it
• If $$p_1$$ is small, likely to transition without "switch" action

$$0$$

$$1$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right]$$
• $$\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right]$$
• $$\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$$
• $$\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]$$
• Writing the above expression in vector form:
• $$V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}$$
• $$V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}$$ (subtract from both sides)
• $$V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}$$ (Bellman Expectation Eq)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$$       ($$\star$$)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$$ (apply ($$\star$$) to RHS)

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
• Then by definition of optimality, $$V^{\pi_\star}\geq V^{\hat \pi}$$
• We now show that $$V^{\pi_\star}\leq V^{\hat \pi}$$
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$$       ($$\star$$)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$$ (apply ($$\star$$) to RHS)
• $$V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi)^k (V^{\pi_\star}- V^{\hat \pi})$$ (apply ($$\star$$) k times)
• $$V^{\pi_\star} - V^{\hat\pi} \leq 0$$ (limit $$k\to\infty$$)
• Therefore, $$V^{\pi_\star} = V^{\hat\pi}$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

• Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
• We showed that $$V^{\pi_\star} = V^{\hat\pi}$$
• this means $$\hat \pi(s)$$ is an optimal policy!
• By definition of $$\hat\pi$$ and the Bellman Expectation Equation, $$V^{\hat \pi}$$ satisfies the Bellman Optimality Equation
• Therefore, $$V^{\pi_\star}$$ must also satisfy it.

## Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $$\pi^\star$$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

• If we know the optimal value $$V^\star$$ then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
• Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')]$$
• $$\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)$$

## Bellman Optimality Proof

Theorem (Bellman Optimality) 2: $$\pi$$ is an optimal policy if $$V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

• Consider an optimal policy $$\pi_\star$$ and the value $$V^{\pi_\star}$$
• By part 1, we know that $$V^{\pi_\star}$$ satisfies BOE
• We bound $$|V^{\pi}(s)-V^{\pi_\star}(s)|$$
• $$=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$$ (BOE by assumption and part 1)
• $$\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$$
• basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
• $$\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$$ (linearity of expectation)

## Bellman Optimality Proof

Theorem (Bellman Optimality) 2: $$\pi$$ is an optimal policy if $$V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

• Consider an optimal policy $$\pi_\star$$ and the value $$V^{\pi_\star}$$
• We bound $$|V^{\pi}(s)-V^{\pi_\star}(s)|$$
• $$\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$$ (linearity of expectation)
• $$\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]$$ (basic inequality PSet 1)
• $$\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$$
• $$\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$$
• $$\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]$$
• $$\leq 0$$ (letting $$k\to\infty$$)
• Therefore, $$V^\pi = V^{\pi_\star}$$ so $$\pi$$ must be optimal

## Agenda

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

## Value Iteration

• The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
• If $$V^\star$$ satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
• Idea: find $$\hat V$$ with fixed point iteration, then get approximately optimal policy $$\hat\pi$$.

## Value Iteration

Value Iteration

• Initialize $$V_1$$
• For $$t=1,\dots,T$$:
• $$V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]$$
• Return $$\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]$$
• Idea: find $$\hat V$$ with fixed point iteration, then get approximately optimal policy $$\hat\pi$$.

## Bellman Operator

• Define the Bellman Operator $$\mathcal T:\mathbb R^S\to \mathbb R^S$$ as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
• Nonlinear map
• Value Iteration is repeated application of the Bellman Operator
• Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation

## Recap

• PSet 1 due Monday
• PA 1 released today

• Policy Evaluation
• Optimal Policies
• Value Iteration

• Next lecture: Value Iteration, Policy Iteration

By Sarah Dean

Private