Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Announcements

• Homework this week
• Problem Set 2 due Monday 2/13
• Programming Assignment 1 due 2/15
• Next PSet and PA released on 2/15
• My office hours:
• Tuesdays 10:30-11:30am in Gates 416A
• Wednesdays 4-4:50pm in Olin 255 (right after lecture)

## Agenda

1. Recap

2. Policy Iteration

3. Finite Horizon MDP

4. Dynamic Programming

## Recap: Bellman Equations

Bellman Optimality Equation (BOE): The optimal value satisfies, $$\forall s$$, $$V^\star(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] \right]$$

Bellman Expectation Equation: For a given policy $$\pi$$, the value is, $$\forall s$$,

$$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

## Recap: Value Iteration

Value Iteration

• Initialize $$V_0$$
• For $$t=0,\dots,T-1$$:
• $$V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]$$
• Return $$\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]$$
• Idea: find approximately optimal $$\hat V$$ with fixed point iteration, then get approximately optimal policy $$\hat\pi(s)=\argmax \hat Q(s,a)$$

## Q Value Iteration

Q Value Iteration

• Initialize $$Q_0$$
• For $$t=0,\dots,T-1$$:
• $$Q_{t+1}(s, a) = r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t}(s', a') \right]$$
• Return $$\displaystyle \hat\pi(s) =\arg\max Q_T (s,a)$$

We can think of the Q function as an $$S\times A$$ array or an $$SA$$ vector

## Recap: Convergence of VI

Lemma (Contraction): For any $$V, V'$$ $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$

Define the Bellman Operator $$\mathcal T:\mathbb R^S\to \mathbb R^S$$ as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Lemma (Convergence): For iterates $$V_t$$ of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$

## Performance of VI Policy

Theorem (Suboptimality): For policy $$\pi_T$$ from VI, $$\forall s$$ $$V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

• So far we know that $$V_t$$ converges to $$V^\star$$
• But is $$\pi_t$$ is near optimal?
• $$V_t$$ is not exactly equal to $$V^{\pi_t}$$
• $$V^{\pi_t} = (I-\gamma P_{\pi_t})^{-1}R^{\pi_t}$$
• $$V_t$$ may not correspond to the value of any policy

## Performance of VI Policy

Proof

• Claim: $$V^{\pi_t}(s) - V^\star(s) \geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$
• Recursing once: $$V^{\pi_t}(s) - V^\star(s)$$
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}\left[\gamma \mathbb E_{s''\sim P(s',\pi_t(s'))}[V^{\pi_t}(s'')-V^{\star}(s'')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty\right]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$
• $$= \gamma^2 \mathbb E_{s''}\left[V^{\pi_t}(s'')-V^{\star}(s'')]\right]-2\gamma^{t+2} \|V^{\pi_\star}-V^{0}\|_\infty-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$
• Recursing $$k$$ times,
$$V^{\pi_t}(s) - V^\star(s) \geq \gamma^k \mathbb E_{s_k}[V^{\pi_t}(s_k)-V^{\star}(s_k)]-2\gamma^{t+1}\sum_{\ell=0}^k\gamma^{\ell} \|V^{\pi_\star}-V^{0}\|_\infty$$
• Letting $$k\to\infty$$, $$V^{\pi_t}(s) - V^\star(s) \geq \frac{-2\gamma^{t+1}}{1-\gamma} \|V^{\pi_\star}-V^{0}\|_\infty$$

Theorem (Suboptimality): For policy $$\pi_T$$ from VI, $$\forall s$$ $$V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

Proof of Claim:

$$V^{\pi_t}(s) - V^\star(s) =$$

• $$= r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')] - V^\star(s) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')]$$ (Bellman Expectation, add and subtract)
• $$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_t(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)]$$ (Grouping terms, add and subtract)
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+r(s, \pi_t(s)) - V^\star(s) +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - r(s, \pi_t(s)) - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + r(s, \pi_\star(s)) + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$$ (Definition of $$\pi_t$$ as argmax)
• $$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]- \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')] - \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s)] + \gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)]$$ (Bellman Expectation on $$V^\star$$ and cancelling reward terms)
• $$= \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]+\gamma \mathbb E_{s'\sim P(s,\pi_\star(s))}[V^{t}(s)-V^{\star}(s)] +\gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\star}(s')-V^{t}(s')]$$ (Linearity of Expectation)
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_t}(s')-V^{\star}(s')]-2\gamma \|V^{\pi_\star}-V^{t}\|_\infty$$ (Basic Inequality)
• $$\geq \gamma \mathbb E_{s'\sim P(s,\pi_t(s))}[V^{t}(s')-V^{\star}(s')]-2\gamma^{t+1} \|V^{\pi_\star}-V^{0}\|_\infty$$ (Convergence Lemma)

## Agenda

1. Recap

2. Policy Iteration

3. Finite Horizon MDP

4. Dynamic Programming

## Policy Iteration

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Compute $$V^{\pi_t}$$ with Policy Evaluation
• Policy Improvement: $$\forall s$$, $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$
• Policy Iteration updates a policy at every iteration step
• contrast with VI, which generates a policy only at the end

## Example: PA 1

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16

## Policy Iteration

• Two key properties:
1. Monotonic Improvement: $$V^{\pi_{t+1}} \geq V^{\pi_t}$$
2. Convergence: $$\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty$$

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Compute $$V^{\pi_t}$$ with Policy Evaluation
• Policy Improvement: $$\forall s$$, $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi_t}(s')]$$

## Monotonic Improvement

Lemma (Monotonic Improvement): For iterates $$\pi_t$$ of PI, the value monotonically improves, i.e. $$V^{\pi_{t+1}} \geq V^{\pi_{t}}$$

Proof:

• $$V^{\pi_{t+1}}(s) - V^{\pi_t}(s) =$$
• $$=r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')] - (r(s,\pi_t(s))+\gamma\mathbb E_{s'\sim P(s,\pi_t(s))}[V^{\pi_{t}}(s')])$$ (Bellman Expectation Eq)
• $$\geq r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t}}(s')])$$ (definition of $$\pi_{t+1}$$ in Policy Improvement step)
• $$= \gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')-V^{\pi_{t}}(s')]$$
• In vector form, $$V^{\pi_{t+1}} - V^{\pi_t} \geq \gamma P_{\pi_{t+1}} (V^{\pi_{t+1}} - V^{\pi_t})$$
• $$V^{\pi_{t+1}} - V^{\pi_t} \geq \gamma^k P^k_{\pi_{t+1}} (V^{\pi_{t+1}} - V^{\pi_t})$$ (iterate $$k$$ times)
• letting $$k\to\infty$$, $$V^{\pi_{t+1}} - V^{\pi_t} \geq 0$$

Consider vectors $$V,V'$$ and matrix $$P$$ with nonnegative entries.

In homework, you will show that if $$V\leq V'$$ then $$PV\leq PV'$$ when $$P$$ has non-negative entries (inequalities hold entrywise).

You will also show that $$(P^\pi)^k$$ is bounded when $$P^\pi$$ is a stochastic matrix.

## Convergence of PI

Theorem (PI Convergence): For $$\pi_t$$ from PI,  $$\|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$

Proof:

• $$V^\star(s) - V^{\pi_{t+1}}(s) =$$
• $$=\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t+1}}(s')])$$ (Bellman Optimality and Expectation Eq)
• $$\leq \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - (r(s,\pi_{t+1}(s))+\gamma\mathbb E_{s'\sim P(s,\pi_{t+1}(s))}[V^{\pi_{t}}(s')])$$ (Monotonic Improvement)
• $$= \max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]$$ (Definition of $$\pi_{t+1}$$ in Policy Improvment)
• $$|V^\star(s) - V^{\pi_{t+1}}(s) |$$
• $$\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]|$$

## Convergence of PI

Theorem (PI Convergence): For $$\pi_t$$ from PI,  $$\|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$

Proof:

• $$|V^\star(s) - V^{\pi_{t+1}}(s) |$$
• $$\leq|\max_a\left[ r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]\right] - \max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right]|$$
• $$\leq \max_a\left| r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-(r(s,a)+\gamma\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')])\right|$$ (Basic Inequality PSet 1)
• $$= \gamma\max_a\left|\mathbb E_{s'\sim P(s,a)}[V^{\star}(s')]-\mathbb E_{s'\sim P(s,a(s))}[V^{\pi_{t}}(s')]\right|$$
• $$\leq \gamma\max_{a,s'}\left|V^{\star}(s')-V^{\pi_{t}}(s')\right|$$ (Basic Inequality PSet 1)
• $$= \gamma\|V^{\star}-V^{\pi_{t}}\|_\infty$$
• By induction, this implies that $$\|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$

## VI and PI Comparison

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Policy Evaluation: $$V^{\pi_t}$$
• Policy Improvement: $$\pi^{t+1}$$

Value Iteration

• Initialize $$V_0$$
• For $$t=0,\dots,T-1$$:
• Bellman Operator: $$V_{t+1}$$
• Return $$\displaystyle \hat\pi$$
• Both have geometric convergence---for any finite $$T$$, not zero
• Policy Iteration is guaranteed to converge to the exactly optimal policy in finite time (PSet 3)

## Finite Horizon MDP

• $$\mathcal{S}, \mathcal{A}$$ state and action space
• $$r$$ reward function, $$P$$ transition function
• $$H$$ is horizon (positive integer)

Goal: achieve high cumulative reward:

$$\sum_{t=0}^{H-1} r_t$$

maximize   $$\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}$$

Lasts exactly $$H$$ steps, no discounting

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Reward:  $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for $$a=$$ switch
• Finite horizon $$H$$
• How might optimal policy differ when $$t$$ close to $$H$$?

$$0$$

$$1$$

## Time Varying Policies and Value

We consider time-varying policies $$\pi = (\pi_0,\dots,\pi_{H-1})$$

The value of a state also depends on time

$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k, a_k) \mid s_0=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Reward:  $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for $$a=$$ switch
• Consider a policy that "stays" for all states and time
• $$V^\pi_t(0) =$$ PollEv
• $$H-t$$

## Finite Horizon Bellman Eqns

Bellman Expectation Equation: $$\forall s$$,

$$V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]$$

Q function

$$Q_t^{\pi}(s, a) = \ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')]$$

Rather than a recursion, in finite time we have an iterative equation

## Dynamic Programming

• Initialize $$V^\star_H = 0$$
• For $$t=H-1, H-2, ..., 0$$:
• $$Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$
• $$\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)$$
• $$V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$$

Bellman optimality is also an iterative rather than a recursive equation: $$V^\star_t(s)=\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$

• We can solve this iteration directly and exactly (rather than approximately like VI and PI)

## Example

• Reward:  $$+1$$ for $$s=0$$ and $$-\frac{1}{2}$$ for $$a=$$ switch
• $$Q^{\pi}_{H-1}(s,a)=r(s,a)$$ for all $$s,a$$
• $$\pi^\star_{H-1}(s)=$$stay for all $$s$$
• $$V^\star_{H-1}=\begin{bmatrix}1\\0\end{bmatrix}$$, $$Q^\star_{H-2}=\begin{bmatrix}2&\frac{1}{2}\\p & -\frac{1}{2}+2p\end{bmatrix}$$
• $$\pi^\star_{H-2}(s)=$$stay for all $$s$$
• $$V^\star_{H-2}=\begin{bmatrix}2\\p\end{bmatrix}$$, $$Q^\star_{H-3}=\begin{bmatrix}3&\frac{1}{2}+p\\(1-p)p+2p & -\frac{1}{2}+(1-2p)p+4p\end{bmatrix}$$
• $$\pi^\star_{H-3}(0)=$$stay and $$\pi^\star_{H-3}(1)=$$switch if $$p\geq 1-\frac{1}{\sqrt{2}}$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$1-p$$

switch: $$1-2p$$

stay: $$p$$

switch: $$2p$$

## Recap

• PSet 2 due Monday
• PA 1 due next Wednesday

• Value & Policy Iteration
• Finite Horizon MDP
• Dynamic Programming

• Next lecture: continuous control

By Sarah Dean

Private