Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Announcements

• Homework released this week
• Problem Set 1 released tonight, due 2/6
• Programming Assignment 1 released in a day or two, due 2 weeks later
• Office Hours posted on Ed
• Materials on Canvas

## Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

## Recap: Infinite Horizon MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$ defined by states, actions, reward, transition, discount factor

action $$a_t\in\mathcal A$$

state $$s_t\in\mathcal S$$

reward

$$r_t\sim r(s_t, a_t)$$

$$s_{t+1}\sim P(s_t, a_t)$$

## Recap: Infinite Horizon MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$ defined by states, actions, reward, transition, discount factor

maximize   $$\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

Goal

## Recap: Behavioral Cloning

Dataset from expert policy $$\pi_\star$$: $$\{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star$$

$$\pi$$

minimize   $$\sum_{i=1}^N \ell(\pi(s_i), a_i)$$

expert trajectory

learned policy

No training data of "recovery" behavior!

### Recap: Trajectory and State Distributions

• Probability of trajectory $$\tau =(s_0, a_0, s_1, ... s_t, a_t)$$ under policy $$\pi$$ starting from initial distribution $$\mu_0$$: $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
• State distribution $$d_t$$ as $${|\mathcal S|}$$ dimensional vector with $$d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\}$$

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

### Recap: Trajectory and State Distributions

• Probability of trajectory $$\tau =(s_0, a_0, s_1, ... s_t, a_t)$$: $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
• State distribution $$d_t$$ as $${|\mathcal S|}$$ dimensional vector with $$d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\}$$

Proposition: The state distribution evolves according to $$d_t = (P_\pi^t)^\top d_0$$

$$s$$

$$s'$$

$$P(s'\mid s,\pi(s))$$

$$P_\pi^\top=$$

## Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

The value of a state $$s$$ under a policy $$\pi$$ is the expected cumulative discounted reward starting from that state

## Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), r_t\sim r(s_t, a_t)\right]$$

simplification for the rest of lecture: $$r(s,a)$$ is deterministic

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Recall simple MDP example
• Suppose the reward is:
• $$r(0,a)=1$$ and $$r(1,a)=0$$ for all $$a$$
• Consider the policy
• $$\pi(s)=$$stay for all $$s$$
• Simulate reward sequences

## Rollout vs. Expectation

...

...

...

The cumulative reward of a given trajectory $$\sum_{t=0}^\infty \gamma^t \mathbb r(s_t, a_t)$$

The expected cumulative reward averages over all possible trajectories

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s,P,\pi\right]$$

• If $$s_0=0$$ then $$s_t=0$$ for all $$t$$
• PollEV: $$V^\pi(0) = \sum_{t=0}^\infty \gamma^t r(0,\pi(0))$$
• $$=\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}$$
• If $$s_0=1$$ then at some time $$T$$ state transitions and then $$s_t=0$$ for all $$t\geq T$$
• $$V^\pi(1) = \mathbb E_{T}[ \sum_{t=0}^{T-1} \gamma^t r(1,\pi(1)) + \sum_{t=T}^\infty \gamma^t r(0,\pi(0))]$$
• $$\displaystyle =\sum_{T=0}^\infty \mathbb P\{T=t\} \sum_{t=T}^\infty \gamma^t$$
• $$\displaystyle=\sum_{T=0}^\infty p_1^T(1-p_1) \frac{\gamma^T}{1-\gamma}=\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$$

## Example

$$0$$

$$1$$

$$p_1$$

$$1-p_1$$

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Recall the reward is:
• $$r(0,a)=1$$ and $$r(1,a)=0$$
for all $$a$$
• Consider the policy
• $$\pi(1)=$$stay
• $$\pi(0)=\begin{cases}\textsf{stay} & \text{w.p. } p_1\\ \textsf{switch}& \text{w.p. } 1-p_1\end{cases}$$

## Example

$$0$$

$$1$$

$$p_1$$

$$1-p_1$$

$$p_1$$

$$1-p_1$$

• Recall the reward is:
• $$r(0,a)=1$$ and $$r(1,a)=0$$
for all $$a$$
• What happens?
• Is $$V^\pi(0)$$ or $$V^\pi(1)$$ larger?

Food for thought: what distribution determines the value function?

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(a_t)) \mid s_0=s, P, \pi\right]$$

• Discounted "steady-state" distribution $$d_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_t$$

## Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Exercise: review proof (below)

Bellman Expectation Equation:

$$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$

...

...

...

The cumulative reward expression is almost recursive:

$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1})$$

## Bellman Expectation Equation

Proof

• $$V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]$$
• $$= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]$$
(linearity of expectation)
• $$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]$$
(simplifying conditional expectation, re-indexing sum)
• $$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]$$ (tower property of conditional expectation)
• $$= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]$$
(definition of value function and linearity of expectation)

## Example

$$0$$

$$1$$

$$p_1$$

$$1-p_1$$

$$p_1$$

$$1-p_1$$

Recall $$r(0,a)=1$$ and $$r(1,a)=0$$

• $$V^\pi(s)=\mathbb E_a[r(s,a)+\gamma \mathbb E_{s'}[V^\pi(s')]]$$
• $$V^\pi(0)=p_1(1+\gamma V^\pi(0))$$
$$\qquad \qquad + (1-p_1)(1+\gamma V^\pi(1))$$
• $$V^\pi(0)=\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}$$
• $$V^\pi(1)=0+\gamma(p_1V^\pi(1) + (1-p_1)V^\pi(0))$$
• $$V^\pi(1)=\frac{\gamma( 1-p_1)}{1-\gamma p_1}V^\pi(0)$$
• $$V^\pi(1)=\frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}$$

## Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

## Policy Evaluation

• Consider deterministic policy. The Bellman equation:
•  $$V^{\pi}(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P( s, \pi(s))} [V^\pi(s')]$$
•  $$V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s')$$
• As with the state distribution, consider $$V^\pi$$ as an $$S=|\mathcal S|$$ dimensional vector, also $$R^\pi$$
•  $$V^{\pi}(s) = R^\pi(s) + \gamma \langle P_\pi(s) , V^\pi \rangle$$

$$P_\pi=$$

$$s$$

$$P(\cdot\mid s,\pi(s))$$

## Policy Evaluation

• $$V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s')$$
• The matrix vector form of the Bellman Equation is

$$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$

$$s$$

$$s'$$

$$P(s'\mid s,\pi(s))$$

# $$+\gamma$$

$$V^\pi(s)$$

$$r(s,\pi(s))$$

Matrix inversion is slow! $$\mathcal O(S^3)$$

To exactly compute the value function, we just need to solve the $$S\times S$$ system of linear equations:

• $$V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$$
• (PSet 1)

## Exact Policy Evaluation

$$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$

Approximate Policy Evaluation:

• Initialize $$V_0$$
• For $$t=0,1,\dots, T$$:
• $$V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t$$

Complexity of each iteration?

• $$\mathcal O(S^2)$$

## Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

## Example

$$0$$

$$1$$

$$p_1$$

$$1-p_1$$

$$p_1$$

$$1-p_1$$

Recall $$r(0,a)=1$$ and $$r(1,a)=0$$

• $$R^\pi = \begin{bmatrix} 1\\ 0\end{bmatrix}$$
• $$P_\pi = \begin{bmatrix} p_1 & 1-p_1 \\ 1-p_1 &p_1 \end{bmatrix}$$
• Exercises:
• Approx PE with $$V_0 = [1,1]$$
• Confirm $$V^\pi=\begin{bmatrix}\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \\ \frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \end{bmatrix}$$ is a fixed point

To show the Approx PE works, we first prove a contraction lemma

## Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

• $$\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty$$ by algorithm definition
• $$= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty$$ by Bellman eq
• $$= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle$$ norm definition
• $$=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|$$ expectation definition
• $$\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]$$ basic inequality (PSet 1)
• $$\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty$$ basic inequality (PSet 1)

Proof

• First statement follows by induction using the Lemma
• For the second statement,
• $$\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon$$
• Taking $$\log$$ of both sides,
• $$T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon$$, then rearrange

## Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an $$\epsilon$$ correct solution requires

$$T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}$$

## Recap

• PSet 1 released today
• Office Hours on Ed

• Value Function
• Bellman Equation
• Policy Evaluation

• Next lecture: Optimality

By Sarah Dean

Private