CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Bellman Equations

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

Questions about waitlist/enrollment?
- https://www.cs.cornell.edu/courseinfo/enrollment
Homework released this week
- Problem Set 1 released tonight, due 2/6
- Programming Assignment 1 released in a day or two, due 2 weeks later
Office Hours posted on Ed
Materials on Canvas

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Recap: Infinite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$ defined by states, actions, reward, transition, discount factor

action $a_t\in\mathcal A$

state $s_t\in\mathcal S$

reward

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

Recap: Infinite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$ defined by states, actions, reward, transition, discount factor

maximize $\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$

s.t. $s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$

$\pi$

Goal

Recap: Behavioral Cloning

Dataset from expert policy $\pi_\star$: $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$

$\pi$

minimize $\sum_{i=1}^N \ell(\pi(s_i), a_i)$

expert trajectory

learned policy

No training data of "recovery" behavior!

Recap: Trajectory and State Distributions

Probability of trajectory $\tau =(s_0, a_0, s_1, ... s_t, a_t)$ under policy $\pi$ starting from initial distribution $\mu_0$: $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
State distribution $d_t$ as ${|\mathcal S|}$ dimensional vector with $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Recap: Trajectory and State Distributions

Probability of trajectory $\tau =(s_0, a_0, s_1, ... s_t, a_t)$: $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
State distribution $d_t$ as ${|\mathcal S|}$ dimensional vector with $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$

Proposition: The state distribution evolves according to $ d_t = (P_\pi^t)^\top d_0$

$s$

$s'$

$P(s'\mid s,\pi(s))$

$P_\pi^\top=$

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

The value of a state $s$ under a policy $\pi$ is the expected cumulative discounted reward starting from that state

Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), r_t\sim r(s_t, a_t)\right]$$

simplification for the rest of lecture: $r(s,a)$ is deterministic

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Recall simple MDP example
Suppose the reward is:
- $r(0,a)=1$ and $r(1,a)=0$ for all $a$
Consider the policy
- $\pi(s)=$stay for all $s$
Simulate reward sequences

Rollout vs. Expectation

...

The cumulative reward of a given trajectory $$\sum_{t=0}^\infty \gamma^t \mathbb r(s_t, a_t)$$

The expected cumulative reward averages over all possible trajectories

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s,P,\pi\right]$$

If $s_0=0$ then $s_t=0$ for all $t$
PollEV: $V^\pi(0) = \sum_{t=0}^\infty \gamma^t r(0,\pi(0))$
- $=\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}$
If $s_0=1$ then at some time $T$ state transitions and then $s_t=0$ for all $t\geq T$
$V^\pi(1) = \mathbb E_{T}[ \sum_{t=0}^{T-1} \gamma^t r(1,\pi(1)) + \sum_{t=T}^\infty \gamma^t r(0,\pi(0))]$
- $\displaystyle =\sum_{T=0}^\infty \mathbb P\{T=t\} \sum_{t=T}^\infty \gamma^t $
- $\displaystyle=\sum_{T=0}^\infty p_1^T(1-p_1) \frac{\gamma^T}{1-\gamma}=\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$

Example

$0$

$1$

$p_1$

$1-p_1$

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Recall the reward is:
- $r(0,a)=1$ and $r(1,a)=0$
  for all $a$
Consider the policy
- $\pi(1)=$stay
- $\pi(0)=\begin{cases}\textsf{stay} & \text{w.p. } p_1\\ \textsf{switch}& \text{w.p. } 1-p_1\end{cases}$

Example

$0$

$1$

$p_1$

$1-p_1$

$p_1$

$1-p_1$

Recall the reward is:
- $r(0,a)=1$ and $r(1,a)=0$
  for all $a$
What happens?
Is $V^\pi(0)$ or $V^\pi(1)$ larger?

Food for thought: what distribution determines the value function?

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(a_t)) \mid s_0=s, P, \pi\right]$$

Discounted "steady-state" distribution $$ d_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_t $$

Steady-state?

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Exercise: review proof (below)

Bellman Expectation Equation:

$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

...

The cumulative reward expression is almost recursive:

$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$

Bellman Expectation Equation

Proof

$V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]$
$= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]$
(linearity of expectation)
$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]$
(simplifying conditional expectation, re-indexing sum)
$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]$ (tower property of conditional expectation)
$= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]$
(definition of value function and linearity of expectation)

Example

$0$

$1$

$p_1$

$1-p_1$

$p_1$

$1-p_1$

Recall $r(0,a)=1$ and $r(1,a)=0$

$V^\pi(s)=\mathbb E_a[r(s,a)+\gamma \mathbb E_{s'}[V^\pi(s')]]$
$V^\pi(0)=p_1(1+\gamma V^\pi(0)) $
$\qquad \qquad + (1-p_1)(1+\gamma V^\pi(1))$
- $V^\pi(0)=\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}$
$V^\pi(1)=0+\gamma(p_1V^\pi(1) + (1-p_1)V^\pi(0))$
- $V^\pi(1)=\frac{\gamma( 1-p_1)}{1-\gamma p_1}V^\pi(0)$
- $V^\pi(1)=\frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}$

Agenda

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Policy Evaluation

Consider deterministic policy. The Bellman equation:
- $V^{\pi}(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P( s, \pi(s))} [V^\pi(s')] $
- $V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') $
As with the state distribution, consider $V^\pi$ as an $S=|\mathcal S|$ dimensional vector, also $R^\pi$
- $V^{\pi}(s) = R^\pi(s) + \gamma \langle P_\pi(s) , V^\pi \rangle $

$P_\pi=$

$s$

$P(\cdot\mid s,\pi(s))$

Policy Evaluation

$V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') $
The matrix vector form of the Bellman Equation is

$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$

$s$

$s'$

$P(s'\mid s,\pi(s))$

$=$

$+\gamma$

$V^\pi(s)$

$r(s,\pi(s))$

Matrix inversion is slow! $\mathcal O(S^3)$

To exactly compute the value function, we just need to solve the $S\times S$ system of linear equations:

$V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$
(PSet 1)

Exact Policy Evaluation

$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$

Approximate Policy Evaluation:

Initialize $V_0$
For $t=0,1,\dots, T$:
- $V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t$

Complexity of each iteration?

$\mathcal O(S^2)$

Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

Example

$0$

$1$

$p_1$

$1-p_1$

$p_1$

$1-p_1$

Recall $r(0,a)=1$ and $r(1,a)=0$

$R^\pi = \begin{bmatrix} 1\\ 0\end{bmatrix} $
$P_\pi = \begin{bmatrix} p_1 & 1-p_1 \\ 1-p_1 &p_1 \end{bmatrix}$
Exercises:
- Approx PE with $V_0 = [1,1]$
- Confirm $V^\pi=\begin{bmatrix}\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \\ \frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \end{bmatrix}$ is a fixed point

To show the Approx PE works, we first prove a contraction lemma

Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

$\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty$ by algorithm definition
$= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty$ by Bellman eq
$= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle $ norm definition
$=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|$ expectation definition
$\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]$ basic inequality (PSet 1)
$\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty$ basic inequality (PSet 1)

Proof

First statement follows by induction using the Lemma
For the second statement,
- $\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon$
- Taking $\log$ of both sides,
- $T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon $, then rearrange

Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an $\epsilon$ correct solution requires

$T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}$

Recap

PSet 1 released today
Office Hours on Ed

Value Function
Bellman Equation
Policy Evaluation

Next lecture: Optimality

Sp23 CS 4/5789: Lecture 3

By Sarah Dean

Sp23 CS 4/5789: Lecture 3

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Bellman Equations

Announcements

Agenda

Recap: Infinite Horizon MDP

Recap: Infinite Horizon MDP

Recap: Behavioral Cloning

Recap: Trajectory and State Distributions

Recap: Trajectory and State Distributions

Agenda

Value Function

Example

Rollout vs. Expectation

Example

Example

Example

Steady-state?

Agenda

Bellman Expectation Equation

Example

Agenda

Policy Evaluation

Policy Evaluation

\(=\)

\(+\gamma\)

Exact Policy Evaluation

Approximate Policy Evaluation

Example

Convergence of Approx PE

Convergence of Approx PE

Recap

Sp23 CS 4/5789: Lecture 3

Sp23 CS 4/5789: Lecture 3

Sarah Dean PRO

CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Bellman Equations

Announcements

Agenda

Recap: Infinite Horizon MDP

Recap: Infinite Horizon MDP

Recap: Behavioral Cloning

Recap: Trajectory and State Distributions

Recap: Trajectory and State Distributions

Agenda

Value Function

Example

Rollout vs. Expectation

Example

Example

Example

Steady-state?

Agenda

Bellman Expectation Equation

Example

Agenda

Policy Evaluation

Policy Evaluation

\(=\)

\(+\gamma\)

Exact Policy Evaluation

Approximate Policy Evaluation

Example

Convergence of Approx PE

Convergence of Approx PE

Recap

Sp23 CS 4/5789: Lecture 3

More from Sarah Dean