CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Dynamic Programming

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

Questions about waitlist/enrollment?
- https://www.cs.cornell.edu/courseinfo/enrollment
Homework released this week
- Problem Set 1 released, due Monday 2/5 at 11:59pm
- Programming Assignment 1 released Wednesday
My office hours:
- Mondays 4:10-5:10pm in Olin 255 (right after lecture)
- Today they will be shortened: only until 4:30pm
CIS Partner Finding Social: January 31, 5-7 pm at Gates G01

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

Recap: Finite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}$ defined by states, actions, reward, transition, horizon

action $a_t\in\mathcal A$

state $s_t\in\mathcal S$

reward

$r_t= r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

in today's lecture, $r(s,a)$ is deterministic

Goal: maximize expected cumulative reward

$$\max_\pi ~\mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1} r(s_k, a_k) \right]$$

Recap: Finite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}$ defined by states, actions, reward, transition, horizon

probability of trajectory $\tau=(s_0,a_0,...,s_{H-1},a_{H-1})$ under $P$, policy $\pi$, initial distribution $\mu_0$

How to efficiently compute expected reward of a given policy?
How to efficiently find a policy that maximizes expected reward?

Today's lecture: two big questions

$a_t=\pi_t(s_t)$

$r_t= r(s_t, a_t)$

$s_{t}\sim P(s_{t-1}, a_{t-1})$

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

The value of a state $s$ under a policy $\pi$ at time $t$ is the expected cumulative reward-to-go

Value Function

$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k, a_k) \mid s_t=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$

$s_t$

$a_t$

$s_{t+1}$

$a_{t+1}$

$s_{t+2}$

$a_{t+2}$

...

$s_{H-1}$

$a_{H-1}$

Example

$0$

$1$

stay: $1$

move: $1$

stay: $p_1$

move: $1-p_2$

stay: $1-p_1$

move: $p_2$

Recall simple MDP example
- Actions works always in $s=0$ and w.p. $p_1,p_2$ in $s=1$
Suppose the reward is:
- $r(0,a)=1$ and $r(1,a)=0$ for all $a$
Consider the policy
- $\pi(s)=$stay for all $s$
Simulate reward sequences

Example

$0$

$1$

$p_1$

$1-p_1$

If $s_t=0$ then $s_k=0$ for all $t\geq k$
PollEV: $V_t^\pi(0) = \sum_{k=t}^{H-1} r(0,\mathsf{stay})$
- $=\sum_{k=t}^{H-1} 1 = H-t$
If $s_t=1$, $V_t^\pi(1)$...
- consider the time $T\geq t$ such that $s_k=1$ for $k<t$ and $s_k=0$ for $k\geq T$
- after transition, value will be $H-T$
- compute expectation over $T$

$\pi(s)=$stay

Example

$0$

$1$

If $s_t=0$ then $s_k=0$ for all $t\geq k$
PollEV: $V_t^\pi(0) = \sum_{k=t}^{H-1} r(0,\mathsf{stay})$
- $=\sum_{k=t}^{H-1} 1 = H-t$
If $s_t=1$, $V_t^\pi(1)$...
- consider the time $T\geq t$ such that $s_k=1$ for $k<t$ and $s_k=0$ for $k\geq T$
- after transition, value will be $H-T$
- compute expectation over $T$
What about the policy $\pi(s)=$move for all $s$

$1$

$1-p_2$

$p_2$

$\pi(s)=$move

Bellman Consistency Equation

The value function $V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k,a_k) \mid s_t=s \right]$
Bellman Consistency Equation: $\forall s$, $$V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]$$
- Exercise: review the proof below
Enables policy evaluation (i.e. computing $V_t^\pi$) by backwards iteration
1. Initialize $V_H^\pi(s) =0$ for all $s\in\mathcal S$
2. For $t=H-1,H-2,...,0$: $$V_t^{\pi}(s)=\mathbb{E}_{a \sim \pi_t(s)} \left[ Q_t^\pi(s,a) \right] ~~\forall ~s\in\mathcal S$$
Total complexity to compute is $S^2AH$

$=Q_t^\pi(s,a)$

$\underbrace{\qquad\qquad\qquad\qquad}{}$

Proof

$V_t^\pi(s) = \mathbb E\left[r(s_t,a_t) + \sum_{k=t+1}^{H-1} r(s_k,a_k) \mid s_t=s, \pi, P \right]$
$= \mathbb{E}[r(s_t,a_t)\mid s_t=s, \pi, P ] + \mathbb{E}[\sum_{k=t+1}^{H-1} r(s_{k},a_{k}) \mid s_t=s, \pi,P]$
(linearity of expectation)
First term: $= \mathbb{E}[r(s,a)\mid a\sim\pi_t(s) ] $ (simplify dependence)
Second term:
- $=\mathbb{E}[\mathbb{E}[\sum_{k=t+1}^{H-1} r(s_{k},a_{k}) \mid s_{t+1}=s', \pi, P] \mid s_t=s, \pi, P]$
  (tower property of conditional expectation)
- $= \mathbb{E}[\mathbb{E}[\sum_{k=t+1}^{H-1} r(s_{k},a_{k}) \mid s_{t+1}=s', \pi, P] \mid a\sim \pi(s),s'\sim P(s,a)]$
  (Markov property)
- $= \mathbb{E}[\mathbb{E}[V_{t+1}(s')\mid s'\sim P(s,a)] \mid a\sim \pi(s)]$ (defn of $V$ and tower)
Combine terms & use linearity of conditional expectation

Example

$0$

$1$

$p_1$

$1-p_1$

$\pi(s)=$stay

Recall $r(0,a)=1$ and $r(1,a)=0$

$\pi(s)=$stay so:
- $V_t^{\pi}(s) = r(s, \pi(s)) + \mathbb{E}_{s' \sim P( s, \pi(s))} [V_{t+1}^\pi(s')] $
$V^\pi_H = \begin{bmatrix}0\\ 0\end{bmatrix}$
$V^\pi_{H-1} = \begin{bmatrix}1\\ 0\end{bmatrix}$
$V^\pi_{H-2} = \begin{bmatrix}2\\ 1-p_1\end{bmatrix}$
...

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

How to efficiently compute expected reward of a given policy?
How to efficiently find a policy that maximizes expected reward?

Today's lecture: two big questions

$a_t=\pi_t(s_t)$

$r_t= r(s_t, a_t)$

$s_{t}\sim P(s_{t-1}, a_{t-1})$

✓

Optimal Policy

Consider all possible policies $\Pi$, including stochastic, history-dependent, time-dependent (most general)
Define: An optimal policy $\pi_\star$ is one where $V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)$ for all $t$, $s\in\mathcal S$, and policies $\pi\in\Pi$
- i.e. the policy dominates other policies for all states
- vector notation: $V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)~\forall~s\iff V_t^{\pi_\star} \geq V_t^{\pi}$
Thus we can write $V^\star(s) = V^{\pi_\star}(s)$

Goal: maximize expected cumulative reward

$$\max_\pi ~\mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1} r(s_k, a_k) \right]$$

$$=\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$

Optimal Policy

Consider all possible policies $\Pi$, including stochastic, history-dependent, time-dependent (most general)
Define: An optimal policy $\pi_\star$ is one where $V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)$ for all $t$, $s\in\mathcal S$, and policies $\pi\in\Pi$
Thus we can write $V^\star(s) = V^{\pi_\star}(s)$
Notice that the starting distribution $\mu_0$ does not determine the optimal policy $\pi^\star$

Goal: maximize expected cumulative reward

$$\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$

Optimal Policy

Consider all possible policies $\Pi$, including stochastic, history-dependent, time-dependent (most general)
Define: An optimal policy $\pi_\star$ is one where $V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)$ for all $t$, $s\in\mathcal S$, and policies $\pi\in\Pi$
Corollary (of next slide): For any finite horizon MDP, there exists a deterministic, state-dependent, time-dependent policy which is optimal

Goal: maximize expected cumulative reward

$$\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$

Bellman Optimality Equation

Bellman Optimality Equation (BOE): A value function
$V=(V_0,\dots,V_{H-1})$ satisfies the BOE if for all $s$, $$V_t(s)=\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V_{t+1}(s')]$$
Theorem (Bellman Optimality):
1. $\pi$ is an optimal policy if and only if $V^{\pi}$ satisfies the BOE
2. The optimal policy is greedy with respect to the optimal value function $$\pi_t^\star(s) \in \arg\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$

shorthand $Q_t^\star(s,a)$

$\underbrace{\qquad\qquad\qquad\qquad}{}$

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

Dynamic Programming

Initialize $V^\star_H = 0$
For $t=H-1, H-2, ..., 0$:
- $Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$
- $\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)$
- $V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$

The BOE leads to a backwards induction

Example

Reward: $+1$ for $s=0$ and $-\frac{1}{2}$ for $a=$ switch
$Q^{\pi}_{H-1}(s,a)=r(s,a)$ for all $s,a$
$\pi^\star_{H-1}(s)=$stay for all $ s$
$V^\star_{H-1}=\begin{bmatrix}1\\0\end{bmatrix}$, $Q^\star_{H-2}=\begin{bmatrix}2&\frac{1}{2}\\p & -\frac{1}{2}+2p\end{bmatrix}$
$\pi^\star_{H-2}(s)=$stay for all $s$
$V^\star_{H-2}=\begin{bmatrix}2\\p\end{bmatrix}$, $Q^\star_{H-3}=\begin{bmatrix}3&\frac{1}{2}+p\\(1-p)p+2p & -\frac{1}{2}+(1-2p)p+4p\end{bmatrix}$
$\pi^\star_{H-3}(0)=$stay and $\pi^\star_{H-3}(1)=$switch if $p\geq 1-\frac{1}{\sqrt{2}}$
...

$0$

$1$

stay: $1$

switch: $1$

stay: $1-p$

switch: $1-2p$

stay: $p$

switch: $2p$

Recap

PSet 1 released
Office hours right after lecture

Value & Q Function
Optimal Policy
Dynamic Programming

Next lecture: infinite horizon MDPs

Sp24 CS 4/5789: Lecture 3

By Sarah Dean

Sp24 CS 4/5789: Lecture 3

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Dynamic Programming

Announcements

Agenda

Recap: Finite Horizon MDP

Recap: Finite Horizon MDP

Today's lecture: two big questions

Agenda

Value Function

Example

Example

Example

Bellman Consistency Equation

Example

Agenda

Today's lecture: two big questions

Optimal Policy

Optimal Policy

Optimal Policy

Bellman Optimality Equation

Agenda

Dynamic Programming

Example

Recap

Sp24 CS 4/5789: Lecture 3

More from Sarah Dean