CS 4/5789: Introduction to Reinforcement Learning

Lecture 2: Markov Decision Processes

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

Agenda

 

1. Recap: Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Markov Decision Process (MDP)

action \(a_t\)

state \(s_t\)

\(\sim \pi_t(s_t)\)

reward

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

  1. Agent observes state of environment
  2. Agent takes action
    • depending on state according to policy
  3. Environment state updates according to transition function

state
\(s_t\)

Markovian Assumption: Conditioned on \(s_t,a_t\), the reward \(r_t\) and next state \(s_{t+1}\) are independent of the past.

When state transition are stochastic, we will write either:

  • \(s'\sim P(s,a)\) as a random variable
  • \(P(s'|s,a)\) as a number between \(0\) and \(1\)

Notation: Stochastic Maps

\(0\)

\(1\)

Example:

  • \(\mathcal A=\{\)stay,move\(\}\)
  • From state \(0\), action always works
  • From state \(1\), staying works with probability \(p_1\) and moving with probability \(p_2\)

Notation: Stochastic Maps

\(0\)

\(1\)

  • From state \(0\), action always works
  • From state \(1\), staying works with probability \(p_1\) and moving with probability \(p_2\)
  • \(P(0,\)stay\()=\mathsf{Bernoulli}(0)\)
  • \(P(1,\)stay\()=\mathsf{Bernoulli}(p_1)\)
  • \(P(0\mid 0,\)stay\()=\)
  • \(P(1\mid 0,\)stay\()=\)
  • \(P(0\mid 1,\)stay\()=\)
  • \(P(1\mid 1,\)stay\()=\)
  • \(P(0\mid 0,\)move\()=\)
  • \(P(1\mid 0,\)move\()=\)
  • \(P(0\mid 1,\)move\()=\)
  • \(P(1\mid 1,\)move\()=\)
  • \(1\)
  • \(0\)
  • \(1-p_1\)
  • \(p_1\)
  • \(0\)
  • \(1\)
  • \(p_2\)
  • \(1-p_2\)
  • \(P(0,\)move\()=\mathsf{Bernoulli}(1)\)
  • \(P(1,\)move\()=\mathsf{Bernoulli}(1-p_2)\)

Finite Horizon MDP

  • \(\mathcal{S}, \mathcal{A}\) state and action space
  • \(r\) reward function: (stochastic) map from (state, action) to scalar reward
  • \(P\) transition function: stochastic map from current state and action to next state
  • \(H\) is horizon (positive integer)
  1. Goal: achieve high expected cumulative reward:

$$\mathbb E\left[\sum_{t=0}^{H-1} r_t\right]$$

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\)

Policies

  • A policy \(\pi\) determines how actions are taken
  • Policies can be:
    • deterministic or stochastic
    • state-dependent or history-dependent
    • stationary or time-dependent
  • Every finite-horizon MDP has an optimal policy that is deterministic, state-dependent, and time-dependent
  • We will find policies using optimization & learning
  •  PollEv.com/sarahdean011 How many stationary, deterministic policies are there if \(|\mathcal S|\)=S and \(|\mathcal A|\)=A?

Agenda

 

1. Recap: Markov Decision Process

 2. Trajectories and Distributions 

3. State Distributions

Trajectory

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

  • When we deploy or roll out a policy, we will observe a sequence of states and actions
    • define the initial state distribution \(\mu_0\in\Delta(\mathcal S)\)
  • This sequence is called a trajectory: $$ \tau = (s_0, a_0, s_1, a_1, ... ) $$
  • observe \(s_0\sim \mu_0\)
  • play \(a_0\sim \pi_0(s_0)\)
  • observe \(s_1\sim P(s_0, a_0)\)
  • play \(a_1\sim \pi_1(s_1)\)
  • ...

Trajectory Example

\(0\)

\(1\)

stay: \(1\)

move: \(1\)

stay: \(p_1\)

move: \(1-p_2\)

stay: \(1-p_1\)

move: \(p_2\)

Probability of a Trajectory

  • Probability of trajectory \(\tau =(s_0, a_0, ..., s_{t}, a_{t})\) under policy \(\pi=(\pi_0,\pi_1,\dots,\pi_{t})\) with initial distribution \(\mu_0\):
  • \(\mathbb{P}_{\mu_0}^\pi (\tau)=\) is autoregressive
    • \(=\mu_0(s_0)\pi_0(a_0 \mid s_0) {P}(s_1 \mid s_{0}, a_{0})\pi_1(a_1 \mid s_1) {P}(s_2 \mid s_{1}, a_{1})...\)
    • \(=\mu_0(s_0)\pi_0(a_0 \mid s_0)\displaystyle\prod_{i=1}^{t} {P}(s_i \mid s_{i-1}, a_{i-1}) \pi_i(a_i \mid s_i)\)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

First recall Baye's rule: For events \(A\) and \(B\),

$$ \mathbb P\{A \cap B\} = \mathbb P\{A\}\mathbb P\{B\mid A\} = \mathbb P\{B\} \mathbb P\{B\mid A\}$$

Why?

Then we have that

$$\mathbb P_{\mu_0}^\pi(s_0, a_0) = \mathbb P\{s_0\} \mathbb P\{a_0\mid s_0\} = \mu_0(s_0)\pi(a_0\mid s_0)$$

then

$$\mathbb P_{\mu_0}^\pi(s_0, a_0, s_1) = \mathbb P_{\mu_0}^\pi(s_0, a_0)\underbrace{\mathbb P\{s_1\mid a_0, s_0\}}_{P(s_1\mid s_0, a_0)}$$

and so on

Probability of a Trajectory

  • For deterministic policy (rest of lecture), actions are determined by states precisely \(a_t = \pi_t(s_t)\)
  • Probability of state trajectory \(\tau =(s_0, s_1, ... s_{H-1})\):
    • \(\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^{t} {P}(s_i \mid s_{i-1}, \pi_{i-1}(s_{i-1})) \)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

  • Goal: maximize expected cumulative reward
  • Formally, for \(\tau=(s_0,a_0,\dots s_{H-1},a_{H-1})\) this goal is encoded in the following optimization problem $$\max_\pi \mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1}  r(s_k, a_k) \right]$$
  • Recall expectation: for finite set \(\mathcal X\), distribution \(p\) $$\mathbb E_{x\sim p}[f(x)] = \sum_{x\in\mathcal X}p(x)f(x)$$
  • Difficult to directly expand and compute this expectation!

Formal Problem Statement

Note: the above assumes that the reward function is deterministic

Agenda

 

1. Recap: Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Probability of a State

Example:  \(\pi_t(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

  • What are the following probabilities?
  • \(\mathbb{P}\{s_0=1\} = 1/2\)
  • \(\mathbb{P}\{s_1=1\} = \mathbb{P}^\pi_{\mu_0} (0, 1) + \mathbb{P}^\pi_{\mu_0} (1, 1)=p_1/2\)
  • \(\mathbb{P}\{s_2=1\} = \mathbb{P}^\pi_{\mu_0} (1, 1,1)=p_1^2/2\)

Probability of state \(s\) at time \(t\) $$ \mathbb{P}\{s_t=s\} = \displaystyle\sum_{\substack{s_{0:t-1}\in\mathcal S^t}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, s) $$

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

Why? First recall that

$$ \mathbb P\{A \cup B\} = \mathbb P\{A\}+\mathbb P\{B\} - \mathbb P\{A\cap B\}$$

If \(A\) and \(B\) are disjoint, the final term is \(0\) by definition.

If all \(A_i\) are disjoint events, then the probability any one of them happens is

$$ \mathbb P\{\cup_{i} A_i\} = \sum_{i} \mathbb P\{A_i\}$$

State Distribution

  • Keep track of probability for all states \(s\in\mathcal S\) with \(d_t\)
  • \(d_t\) is a distribution over \(\mathcal S\)
  • Can be represented as an \(S=|\mathcal S|\) dimensional vector $$ d_t[s] = \mathbb{P}\{s_t=s\} $$

$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example:  \(\pi_t(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

Expected Reward

$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example:  \(\pi_t(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

  • Suppose reward is \(1\) when \(s=0\) and \(0\) otherwise
  • What is \(\mathbb E[r_0]\)? \(\mathbb E[r_1]\)? \(\mathbb E[r_2]\)?

State distribution lets us compute \(\mathbb E[r_t] = \sum_{s\in\mathcal S}d_t[s]r(s,\pi_t(s))\)

Recursion in State Distribution

  • How does state distribution change from \(t\) to \(t+1\)?
    • Recall, \(s_{t+1}\sim P(s_t,\pi_t(s_t))\)
    • i.e. \(s_{t+1} = s'\) with probability \(P(s'|s_t, \pi_t(s_t))\)
  • Write as a summation over possible \(s_t\):
    • \(\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\}  =\sum_{s\in\mathcal S} P(s'\mid s, \pi_t(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}\)
  • In vector notation:
    • \(d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi_t(s))d_t[s]\)
    • \(d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi_t(1)) & \dots & P(s'\mid S, \pi_t(S))\end{bmatrix},d_t\rangle \)

Transition Matrix

Given a policy \(\pi_t(\cdot)\) and a transition function \(P(\cdot\mid \cdot,\cdot)\)

  • Define the transition matrix \(P_{\pi_t} \in \mathbb R^{S\times S}\)
  • At row \(s\) and column \(s'\), entry is \(P(s'\mid s,\pi_t(s))\)

\(P_{\pi_t}=\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi_t(s))\)

State Evolution

Proposition: The state distribution evolves according to $$ d_t = P_{\pi_{t-1}}^\top\dots P_{\pi_1}^\top P_{\pi_0}^\top d_0\quad t\geq 1$$

Proof: (by induction)

  1. Base case: when \(t=0\), \(d_0=d_0\)
  2. Induction step:
    • Need to prove \(d_{k+1} = P_{\pi_k}^\top d_k\)
    • Exercise: review proof below

\(s\)

\(s'\)

\(P(s'\mid s,\pi_t(s))\)

\(P_{\pi_t}^\top=\)

Proof of claim that \(d_{k+1} = P_{\pi_k}^\top d_k\)

  • \(d_{k+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi_k(s))d_k[s]\)

  • \(\textcolor{orange}{d_{k+1}[s']} =\langle\)\(\begin{bmatrix} P(s'\mid 1, \pi_k(1)) & \dots & P(s'\mid S, \pi_k(S))\end{bmatrix}\)\(,\textcolor{red}{d_k}\rangle \)

  • Each entry of \(d_{k+1}\) is the inner product of a column of \(P_{\pi_k}\) with \(d_k\)

    • in other word, inner product of a row of \(P_{\pi_k}^\top\) with \(d_k\)

  • By the definition of matrix multiplication, \(d_{k+1} = P_{\pi_k}^\top d_k\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi_k(s))\)

\(P_{\pi_k}=\)

=

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

  • \(d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \)
  • \(d_1 = \begin{bmatrix}1& 1-p_1\\0  & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix}   = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\)
  • \(d_2 =\begin{bmatrix}1& 1-p_1\\0  & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}\)

Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

Recap

1. Markov Decision Process

2. Trajectories and Distributions

3. State Distributions

Announcements