CS 4/5789: Introduction to Reinforcement Learning

Lecture 2: Markov Decision Processes

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

Questions about waitlist/enrollment?
- https://www.cs.cornell.edu/courseinfo/enrollment
Homework released next week
- Problem Set 1 due 1 week later
- Programming Assignment 1 due 2 weeks later

Agenda

1. Recap: Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Markov Decision Process (MDP)

action $a_t$

state $s_t$

$\sim \pi_t(s_t)$

reward

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

Agent observes state of environment
Agent takes action
- depending on state according to policy
Environment state updates according to transition function

state
$s_t$

Markovian Assumption: Conditioned on $s_t,a_t$, the reward $r_t$ and next state $s_{t+1}$ are independent of the past.

When state transition are stochastic, we will write either:

$s'\sim P(s,a)$ as a random variable
$P(s'|s,a)$ as a number between $0$ and $1$

Notation: Stochastic Maps

$0$

$1$

Example:

$\mathcal A=\{$stay,move$\}$
From state $0$, action always works
From state $1$, staying works with probability $p_1$ and moving with probability $p_2$

Notation: Stochastic Maps

$0$

$1$

From state $0$, action always works
From state $1$, staying works with probability $p_1$ and moving with probability $p_2$

$P(0,$stay$)=\mathsf{Bernoulli}(0)$
$P(1,$stay$)=\mathsf{Bernoulli}(p_1)$

$P(0\mid 0,$stay$)=$
$P(1\mid 0,$stay$)=$
$P(0\mid 1,$stay$)=$
$P(1\mid 1,$stay$)=$

$P(0\mid 0,$move$)=$
$P(1\mid 0,$move$)=$
$P(0\mid 1,$move$)=$
$P(1\mid 1,$move$)=$

$1$
$0$
$1-p_1$
$p_1$

$0$
$1$
$p_2$
$1-p_2$

$P(0,$move$)=\mathsf{Bernoulli}(1)$
$P(1,$move$)=\mathsf{Bernoulli}(1-p_2)$

Finite Horizon MDP

$\mathcal{S}, \mathcal{A}$ state and action space
$r$ reward function: (stochastic) map from (state, action) to scalar reward
$P$ transition function: stochastic map from current state and action to next state
$H$ is horizon (positive integer)

Goal: achieve high expected cumulative reward:

$$\mathbb E\left[\sum_{t=0}^{H-1} r_t\right]$$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}$

Policies

A policy $\pi$ determines how actions are taken
Policies can be:
- deterministic or stochastic
- state-dependent or history-dependent
- stationary or time-dependent
Every finite-horizon MDP has an optimal policy that is deterministic, state-dependent, and time-dependent
We will find policies using optimization & learning
PollEv.com/sarahdean011 How many stationary, deterministic policies are there if $|\mathcal S|$=S and $|\mathcal A|$=A?

Agenda

1. Recap: Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Trajectory

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

When we deploy or roll out a policy, we will observe a sequence of states and actions
- define the initial state distribution $\mu_0\in\Delta(\mathcal S)$
This sequence is called a trajectory: $$ \tau = (s_0, a_0, s_1, a_1, ... ) $$

observe $s_0\sim \mu_0$
play $a_0\sim \pi_0(s_0)$
observe $s_1\sim P(s_0, a_0)$
play $a_1\sim \pi_1(s_1)$
...

Trajectory Example

$0$

$1$

stay: $1$

move: $1$

stay: $p_1$

move: $1-p_2$

stay: $1-p_1$

move: $p_2$

Probability of a Trajectory

Probability of trajectory $\tau =(s_0, a_0, ..., s_{t}, a_{t})$ under policy $\pi=(\pi_0,\pi_1,\dots,\pi_{t})$ with initial distribution $\mu_0$:
$\mathbb{P}_{\mu_0}^\pi (\tau)=$ is autoregressive
- $=\mu_0(s_0)\pi_0(a_0 \mid s_0) {P}(s_1 \mid s_{0}, a_{0})\pi_1(a_1 \mid s_1) {P}(s_2 \mid s_{1}, a_{1})...$
- $=\mu_0(s_0)\pi_0(a_0 \mid s_0)\displaystyle\prod_{i=1}^{t} {P}(s_i \mid s_{i-1}, a_{i-1}) \pi_i(a_i \mid s_i)$

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

First recall Baye's rule: For events $A$ and $B$,

$$ \mathbb P\{A \cap B\} = \mathbb P\{A\}\mathbb P\{B\mid A\} = \mathbb P\{B\} \mathbb P\{B\mid A\}$$

Why?

Then we have that

$$\mathbb P_{\mu_0}^\pi(s_0, a_0) = \mathbb P\{s_0\} \mathbb P\{a_0\mid s_0\} = \mu_0(s_0)\pi(a_0\mid s_0)$$

then

$$\mathbb P_{\mu_0}^\pi(s_0, a_0, s_1) = \mathbb P_{\mu_0}^\pi(s_0, a_0)\underbrace{\mathbb P\{s_1\mid a_0, s_0\}}_{P(s_1\mid s_0, a_0)}$$

and so on

Probability of a Trajectory

For deterministic policy (rest of lecture), actions are determined by states precisely $a_t = \pi_t(s_t)$
Probability of state trajectory $\tau =(s_0, s_1, ... s_{H-1})$:
- $\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^{t} {P}(s_i \mid s_{i-1}, \pi_{i-1}(s_{i-1})) $

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Goal: maximize expected cumulative reward
Formally, for $\tau=(s_0,a_0,\dots s_{H-1},a_{H-1})$ this goal is encoded in the following optimization problem $$\max_\pi \mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1} r(s_k, a_k) \right]$$
Recall expectation: for finite set $\mathcal X$, distribution $p$ $$\mathbb E_{x\sim p}[f(x)] = \sum_{x\in\mathcal X}p(x)f(x)$$
Difficult to directly expand and compute this expectation!

Formal Problem Statement

Note: the above assumes that the reward function is deterministic

Agenda

1. Recap: Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Probability of a State

Example: $\pi_t(s)=$stay and $\mu_0$ is each state with probability $1/2$.

What are the following probabilities?
$\mathbb{P}\{s_0=1\} = 1/2$
$\mathbb{P}\{s_1=1\} = \mathbb{P}^\pi_{\mu_0} (0, 1) + \mathbb{P}^\pi_{\mu_0} (1, 1)=p_1/2$
$\mathbb{P}\{s_2=1\} = \mathbb{P}^\pi_{\mu_0} (1, 1,1)=p_1^2/2$

Probability of state $s$ at time $t$ $$ \mathbb{P}\{s_t=s\} = \displaystyle\sum_{\substack{s_{0:t-1}\in\mathcal S^t}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, s) $$

$1$

$1-p_1$

$p_1$

$0$

$1$

Why? First recall that

$$ \mathbb P\{A \cup B\} = \mathbb P\{A\}+\mathbb P\{B\} - \mathbb P\{A\cap B\}$$

If $A$ and $B$ are disjoint, the final term is $0$ by definition.

If all $A_i$ are disjoint events, then the probability any one of them happens is

$$ \mathbb P\{\cup_{i} A_i\} = \sum_{i} \mathbb P\{A_i\}$$

State Distribution

Keep track of probability for all states $s\in\mathcal S$ with $d_t$
$d_t$ is a distribution over $\mathcal S$
Can be represented as an $S=|\mathcal S|$ dimensional vector $$ d_t[s] = \mathbb{P}\{s_t=s\} $$

$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example: $\pi_t(s)=$stay and $\mu_0$ is each state with probability $1/2$.

$1$

$1-p_1$

$p_1$

$0$

$1$

Expected Reward

$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example: $\pi_t(s)=$stay and $\mu_0$ is each state with probability $1/2$.

$1$

$1-p_1$

$p_1$

$0$

$1$

Suppose reward is $1$ when $s=0$ and $0$ otherwise
What is $\mathbb E[r_0]$? $\mathbb E[r_1]$? $\mathbb E[r_2]$?

State distribution lets us compute $\mathbb E[r_t] = \sum_{s\in\mathcal S}d_t[s]r(s,\pi_t(s))$

Recursion in State Distribution

How does state distribution change from $t$ to $t+1$?
- Recall, $s_{t+1}\sim P(s_t,\pi_t(s_t))$
- i.e. $s_{t+1} = s'$ with probability $P(s'|s_t, \pi_t(s_t))$
Write as a summation over possible $s_t$:
- $\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\} =\sum_{s\in\mathcal S} P(s'\mid s, \pi_t(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}$
In vector notation:
- $d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi_t(s))d_t[s]$
- $d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi_t(1)) & \dots & P(s'\mid S, \pi_t(S))\end{bmatrix},d_t\rangle $

Transition Matrix

Given a policy $\pi_t(\cdot)$ and a transition function $P(\cdot\mid \cdot,\cdot)$

Define the transition matrix $P_{\pi_t} \in \mathbb R^{S\times S}$
At row $s$ and column $s'$, entry is $P(s'\mid s,\pi_t(s))$

$P_{\pi_t}=$

$s$

$s'$

$P(s'\mid s,\pi_t(s))$

State Evolution

Proposition: The state distribution evolves according to $$ d_t = P_{\pi_{t-1}}^\top\dots P_{\pi_1}^\top P_{\pi_0}^\top d_0\quad t\geq 1$$

Proof: (by induction)

Base case: when $t=0$, $d_0=d_0$
Induction step:
- Need to prove $d_{k+1} = P_{\pi_k}^\top d_k$
- Exercise: review proof below

$s$

$s'$

$P(s'\mid s,\pi_t(s))$

$P_{\pi_t}^\top=$

Proof of claim that $d_{k+1} = P_{\pi_k}^\top d_k$

$d_{k+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi_k(s))d_k[s]$
$\textcolor{orange}{d_{k+1}[s']} =\langle$$\begin{bmatrix} P(s'\mid 1, \pi_k(1)) & \dots & P(s'\mid S, \pi_k(S))\end{bmatrix}$$,\textcolor{red}{d_k}\rangle $
Each entry of $d_{k+1}$ is the inner product of a column of $P_{\pi_k}$ with $d_k$
- in other word, inner product of a row of $P_{\pi_k}^\top$ with $d_k$
By the definition of matrix multiplication, $d_{k+1} = P_{\pi_k}^\top d_k$

$s$

$s'$

$P(s'\mid s,\pi_k(s))$

$P_{\pi_k}=$

$1$

$1-p_1$

$p_1$

$0$

$1$

$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} $
$d_1 = \begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}$
$d_2 =\begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$

Example: $\pi(s)=$stay and $\mu_0$ is each state with probability $1/2$.

State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

Recap

1. Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Announcements

Waitlist/enrollment: https://www.cs.cornell.edu/courseinfo/enrollment
Assignments released next week

Sp24 CS 4/5789: Lecture 2

By Sarah Dean

Sp24 CS 4/5789: Lecture 2

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 2: Markov Decision Processes

Announcements

Agenda

Markov Decision Process (MDP)

Notation: Stochastic Maps

Notation: Stochastic Maps

Finite Horizon MDP

Policies

Agenda

Trajectory

Trajectory Example

Probability of a Trajectory

Probability of a Trajectory

Formal Problem Statement

Agenda

Probability of a State

State Distribution

Expected Reward

Recursion in State Distribution

Transition Matrix

State Evolution

State Evolution Example

Recap

Announcements

Sp24 CS 4/5789: Lecture 2

More from Sarah Dean