CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Infinite Horizon Discounted MDPs

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

Questions about waitlist/enrollment?
- https://www.cs.cornell.edu/courseinfo/enrollment
Homework
- Problem Set 1 released Monday, due 2/5
- Programming Assignment 1 released tonight, due in 2 weeks
Office Hours posted on Ed
Materials (slides and *new* lecture notes) on Canvas

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Recap: Value & Bellman (finite H)

The value function $V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k,a_k) \mid s_t=s \right]$
Bellman Consistency Equation enables efficient policy evaluation
Optimal policies have optimal (highest) value $V^\star_t(s)$ for all $t,s$
Bellman Optimality Equation (BOE) enables efficient policy optimization
The optimal policy is greed wrt optimal value $$\pi_t^\star(s) = \arg\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$

Recap: State Distributions

The probability of state trajectory $\tau=(s_0,s_1,\dots,s_t)$ from $s_0\sim\mu_0$ and deterministic policy $\pi$: $$ \mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^{t} {P}(s_i \mid s_{i-1}, \pi_{i-1}(s_{i-1})) $$
State distribution vector & transition matrix $$ d_t = \begin{bmatrix} \mathbb{P}\{s_t=1\}\\ \vdots \\ \mathbb{P}\{s_t=S\}\end{bmatrix} ,\quad P_{\pi_t} = \begin{bmatrix} P(1\mid 1,\pi_t(1)) & \cdots & P(S\mid 1,\pi_t(1)) \\ \vdots && \vdots \\ P(1\mid S,\pi_t(S)) & \cdots & P(S\mid S,\pi_t(S))\end{bmatrix}$$
State distribution evolves by transition matrix $d_{t+1} = P_{\pi_t}^\top d_t$

Agenda

1. Recap

2. Infinite Horizon & Value Function

3. Bellman Equation

4. Policy Evaluation

Infinite Horizon Discounted MDP

$\mathcal{S}, \mathcal{A}$ state and action space
$r$ reward function and $P$ transition function
$\gamma$ discount factor between $0$ and $1$

Goal: achieve high cumulative reward:

$$\sum_{t=0}^\infty \gamma^t r_t$$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

action $a_t\in\mathcal A$

state $s_t\in\mathcal S$

reward

$r_t= r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

The value of a state $s$ under a policy $\pi$ is the expected cumulative discounted reward starting from that state

Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), \right]$$

this lecture we take $r(s,a)$ to be deterministic and $\pi$ to be stationary and state-dependent

Example

$0$

$1$

stay: $1$

move: $1$

stay: $p_1$

move: $1-p_2$

stay: $1-p_1$

move: $p_2$

Recall simple MDP example
Suppose the reward is:
- $r(0,a)=1$ and $r(1,a)=0$ for all $a$
Consider the policy
- $\pi(s)=$stay for all $s$
Simulate reward sequences
PollEV What is $V^\pi(0)$?
- $\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}$

If $s_0=0$ then $s_t=0$ for all $t$
PollEV: $V^\pi(0) = \sum_{t=0}^\infty \gamma^t r(0,\pi(0))$
- $=\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}$
If $s_0=1$ then at some time $T$ state transitions and then $s_t=0$ for all $t\geq T$
$V^\pi(1) = \mathbb E_{T}[ \sum_{t=0}^{T-1} \gamma^t r(1,\pi(1)) + \sum_{t=T}^\infty \gamma^t r(0,\pi(0))]$
- $\displaystyle =\sum_{T=0}^\infty \mathbb P\{T=t\} \sum_{t=T}^\infty \gamma^t $
- $\displaystyle=\sum_{T=0}^\infty p_1^T(1-p_1) \frac{\gamma^T}{1-\gamma}=\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$

Example

$0$

$1$

$p_1$

$1-p_1$

PSet preview: what distribution determines the value function?

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(s_t)) \mid s_0=s, P, \pi\right]$$

Recall $d_t$ the state distribution at $t$
Discounted "steady-state" distribution $$ d_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_t $$

Steady-state?

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Exercise: review proof (below)

Bellman Consistency Equation:

$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

...

The cumulative reward expression is almost recursive:

$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$

Bellman Consistency Equation

Proof

$V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]$
$= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]$
(linearity of expectation)
$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]$
(simplifying conditional expectation, re-indexing sum)
$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]$ (tower property of conditional expectation)
$= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]$
(definition of value function and linearity of expectation)

Example

$0$

$1$

move: $1$

stay: $p_1$

move: $1-p_2$

stay: $1-p_1$

move: $p_2$

Reward $r(0,a)=1$ and $r(1,a)=0$ for all $a$
Recall $V^\pi(0) = \frac{1}{1-\gamma}$
What is $V^\pi(1)$?
- $V^\pi(1) = 0 + \gamma (p_1 V^\pi(1) + (1-p_1)V^\pi(0))$

$\pi(s)=$stay

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Policy Evaluation

Consider deterministic policy. The Bellman equation:
- $V^{\pi}(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P( s, \pi(s))} [V^\pi(s')] $
- $V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') $
As with the state distribution, consider $V^\pi$ as an $S=|\mathcal S|$ dimensional vector, also $R^\pi$
- $V^{\pi}(s) = R^\pi(s) + \gamma \langle P_\pi(s) , V^\pi \rangle $

$P_\pi=$

$s$

$P(\cdot\mid s,\pi(s))$

Policy Evaluation

$V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') $
The matrix vector form of the Bellman Equation is $$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$

$s$

$s'$

$P(s'\mid s,\pi(s))$

$=$

$+\gamma$

$V^\pi(s)$

$r(s,\pi(s))$

Matrix inversion is slow! $\mathcal O(S^3)$

To exactly compute the value function, we just need to solve the $S\times S$ system of linear equations:

$V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$
(PSet 1 shows inverse exists)

Exact Policy Evaluation

$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$

Approximate Policy Evaluation:

Initialize $V_0$
For $t=0,1,\dots, T$:
- $V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t$

Complexity of each iteration?

$\mathcal O(S^2)$

Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

Example

Recall $r(0,a)=1$ and $r(1,a)=0$

What are $R^\pi$ and $P_\pi $?
- $\displaystyle \begin{bmatrix} 1\\ 0\end{bmatrix} \quad\text{\&}\quad \begin{bmatrix} 1 & 0 \\ 1-p_1 &p_1 \end{bmatrix}$
Exercises:
- Confirm previously computed $V^\pi$ is a fixed point
- Iterative PE with $V_0 = \begin{bmatrix}1& 1\end{bmatrix}^\top $

$0$

$1$

$p_1$

$1-p_2$

$1-p_1$

$p_2$

$\pi(s)=$stay

To show the Approx PE works, we first prove a contraction lemma

Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

$\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty$ by algorithm definition
$= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty$ by Bellman eq
$= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle $ norm definition
$=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|$ expectation definition
$\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]$ basic inequality (PSet 1)
$\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty$ basic inequality (PSet 1)

Proof

First statement follows by induction using the Lemma
For the second statement,
- $\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon$
- Taking $\log$ of both sides,
- $T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon $, then rearrange

Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an $\epsilon$ correct solution requires

$T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}$

Recap

PSet 1 due Monday
PA 1 released soon
Office Hours on Ed

Value Function
Bellman Equation
Policy Evaluation

Next lecture: Optimality

Sp24 CS 4/5789: Lecture 4

By Sarah Dean

Sp24 CS 4/5789: Lecture 4

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Infinite Horizon Discounted MDPs

Announcements

Agenda

Recap: Value & Bellman (finite H)

Recap: State Distributions

Agenda

Infinite Horizon Discounted MDP

Value Function

Example

Example

Steady-state?

Agenda

Bellman Consistency Equation

Example

Agenda

Policy Evaluation

Policy Evaluation

\(=\)

\(+\gamma\)

Exact Policy Evaluation

Approximate Policy Evaluation

Example

Convergence of Approx PE

Convergence of Approx PE

Recap

Sp24 CS 4/5789: Lecture 4

Sp24 CS 4/5789: Lecture 4

Sarah Dean PRO

CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Infinite Horizon Discounted MDPs

Announcements

Agenda

Recap: Value & Bellman (finite H)

Recap: State Distributions

Agenda

Infinite Horizon Discounted MDP

Value Function

Example

Example

Steady-state?

Agenda

Bellman Consistency Equation

Example

Agenda

Policy Evaluation

Policy Evaluation

\(=\)

\(+\gamma\)

Exact Policy Evaluation

Approximate Policy Evaluation

Example

Convergence of Approx PE

Convergence of Approx PE

Recap

Sp24 CS 4/5789: Lecture 4

More from Sarah Dean