CS 4/5789: Introduction to Reinforcement Learning
Lecture 2: MDPs and Imitation Learning
Prof. Sarah Dean
MW 2:454pm
255 Olin Hall
Announcements
 Questions about waitlist/enrollment?
 Homework released next week
 Problem Set 1 due 1 week later
 Programming Assignment 1 due 2 weeks later
Agenda
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
Markov Decision Process (MDP)
action \(a_t\)
state \(s_t\)
\(\sim \pi(s_t)\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
 Agent observes state of environment
 Agent takes action
 depending on state according to policy
 Environment state updates according to transition function
state
\(s_t\)
Markovian Assumption: Conditioned on \(s_t,a_t\), the reward \(r_t\) and next state \(s_{t+1}\) are independent of the past.
When state transition are stochastic, we will write either:
 \(s'\sim P(s,a)\) as a random variable
 \(P(s's,a)\) as a number between \(0\) and \(1\)
Notation: Stochastic Maps
\(0\)
\(1\)
Example:
 \(\mathcal A=\{\)stay,switch\(\}\)
 From state \(0\), action always works
 From state \(1\), staying works with probability \(p_1\) and switching with probability \(p_2\)
Notation: Stochastic Maps
\(0\)
\(1\)
 From state \(0\), action always works
 From state \(1\), staying works with probability \(p_1\) and switching with probability \(p_2\)
 \(P(0,\)stay\()=\mathbf{1}_{0}\)
 \(P(1,\)stay\()=\mathsf{Bernoulli}(p_1)\)
 \(P(0\mid 0,\)stay\()=\)
 \(P(1\mid 0,\)stay\()=\)
 \(P(0\mid 1,\)stay\()=\)
 \(P(1\mid 1,\)stay\()=\)
 \(P(0\mid 0,\)switch\()=\)
 \(P(1\mid 0,\)switch\()=\)
 \(P(0\mid 1,\)switch\()=\)
 \(P(1\mid 1,\)switch\()=\)
 \(1\)
 \(0\)
 \(1p_1\)
 \(p_1\)
 \(0\)
 \(1\)
 \(p_2\)
 \(1p_2\)
 \(P(0,\)switch\()=\mathbf{1}_{1}\)
 \(P(1,\)switch\()=\mathsf{Bernoulli}(1p_2)\)
Infinite Horizon Discounted MDP
 \(\mathcal{S}, \mathcal{A}\) state and action space
 \(r\) reward function: stochastic map from (state, action) to scalar reward
 \(P\) transition function: stochastic map from current state and action to next state
 \(\gamma\) discount factor between \(0\) and \(1\)
Goal: achieve high cumulative reward:
$$\sum_{t=0}^\infty \gamma^t r_t$$
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
Infinite Horizon Discounted MDP
If \(\mathcal S\)=S and \(\mathcal A\)=A, then how many deterministic policies are there? PollEv.com/sarahdean011
 If \(A=1\) there is only one policy: a constant policy
 If \(S=1\) there are \(A\) policies: map the state to each action
 For general \(S\), there are \(A\) ways to map state \(1\) times \(A\) ways to map state \(2\) times \(A\) ways to map state \(3\) .... and so on to \(S\)
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
We will find policies using optimization & learning
Agenda
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
Helicopter Acrobatics (Stanford)
LittleDog Robot (LAIRLab at CMU)
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
Imitation Learning
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex  SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions
Behavioral Cloning
Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
rather than optimize,
imitate!
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\)
Behavioral Cloning
 Policy class \(\Pi\): usually parametrized by some \(w\in\mathbb R^d\), e.g. weights of deep network, SVM, etc
 Loss function \(\ell(\cdot,\cdot)\): quantify accuracy
 Optimization Algorithm: gradient descent, interior point methods,
sklearn
,torch
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\in\Pi\)
Supervised learning with empirical risk minimization (ERM)
Behavioral Cloning
In this class, we assume that supervised learning works!
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\in\Pi\)
Supervised learning with empirical risk minimization (ERM)
i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon\)
For many loss functions, this means that
\(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)
Ex: Learning to Drive
Policy \(\pi\)
Input: Camera Image
Output: Steering Angle
Ex: Learning to Drive
Supervised Learning
Policy
Dataset of expert trajectory
\((x, y)\)
...
\(\pi\)( ) =
Ex: Learning? to Drive
expert trajectory
learned policy
No training data of "recovery" behavior!
Ex: Learning? to Drive
What about assumption \(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)?
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”
Agenda
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
Trajectory
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
 When we deploy or roll out a policy, we will observe a sequence of states and actions
 define the initial state distribution \(\mu_0\in\Delta(\mathcal S)\)
 This sequence is called a trajectory: $$ \tau = (s_0, a_0, s_1, a_1, ... ) $$
 observe \(s_0\sim \mu_0\)
 play \(a_0\sim \pi(s_0)\)
 observe \(s_1\sim P(s_0, a_0)\)
 play \(a_1\sim \pi(s_1)\)
 ...
Trajectory Example
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1p_2\)
stay: \(1p_1\)
switch: \(p_2\)
Probability of a Trajectory
 Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) under policy \(\pi\) starting from initial distribution \(\mu_0\):
 \(\mathbb{P}_{\mu_0}^\pi (\tau)=\)
 \(=\mu_0(s_0)\pi(a_0 \mid s_0) {P}(s_1 \mid s_{0}, a_{0})\pi(a_1 \mid s_1) {P}(s_2 \mid s_{1}, a_{1})...\)
 \(=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i1}, a_{i1}) \pi(a_i \mid s_i)\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
First recall Baye's rule: For events \(A\) and \(B\),
$$ \mathbb P\{A \cap B\} = \mathbb P\{A\}\mathbb P\{B\mid A\} = \mathbb P\{B\} \mathbb P\{B\mid A\}$$
Why?
Then we have that
$$\mathbb P_{\mu_0}^\pi(s_0, a_0) = \mathbb P\{s_0\} \mathbb P\{a_0\mid s_0\} = \mu_0(s_0)\pi(a_0\mid s_0)$$
then
$$\mathbb P_{\mu_0}^\pi(s_0, a_0, s_1) = \mathbb P_{\mu_0}^\pi(s_0, a_0)\underbrace{\mathbb P\{s_1\mid a_0, s_0\}}_{P(s_1\mid s_0, a_0)}$$
and so on
Probability of a Trajectory
 For deterministic policy (rest of lecture), actions are determined by states precisely \(a_t = \pi(s_t)\)
 Probability of state trajectory \(\tau =(s_0, s_1, ... s_t)\):
 \(\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i1}, \pi(s_{i1})) \)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
Probability of a State
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
 What is \(\mathbb{P}\{s_t=1\mid \mu_0,\pi\}\)?
 \(\mathbb{P}\{s_0=1\} = 1/2\)
 \(\mathbb{P}\{s_1=1\} = \mathbb{P}^\pi_{\mu_0} (0, 1) + \mathbb{P}^\pi_{\mu_0} (1, 1)=p_1/2\)
 \(\mathbb{P}\{s_1=2\} = \mathbb{P}^\pi_{\mu_0} (1, 1,1)=p_1^2/2\)
Probability of state \(s\) at time \(t\) $$ \mathbb{P}\{s_t=s\mid \mu_0,\pi\} = \displaystyle\sum_{\substack{s_{0:t1}\in\mathcal S^t}} \mathbb{P}^\pi_{\mu_0} (s_{0:t1}, s) $$
\(1\)
\(1p_1\)
\(p_1\)
\(0\)
\(1\)
Why? First recall that
$$ \mathbb P\{A \cup B\} = \mathbb P\{A\}+\mathbb P\{B\}  \mathbb P\{A\cap B\}$$
If \(A\) and \(B\) are disjoint, the final term is \(0\) by definition.
If all \(A_i\) are disjoint events, then the probability any one of them happens is
$$ \mathbb P\{\cup_{i} A_i\} = \sum_{i} \mathbb P\{A_i\}$$
State Distribution
 Keep track of probability for all states \(s\in\mathcal S\) with \(d_t\)
 \(d_t\) is a distribution over \(\mathcal S\)
 Can be represented as an \(S=\mathcal S\) dimensional vector $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$
$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1p_1^2/2\\ p_1^2/2\end{bmatrix}$$
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
\(1\)
\(1p_1\)
\(p_1\)
\(0\)
\(1\)
State Distribution Transition
 How does state distribution change over time?
 Recall, \(s_{t+1}\sim P(s_t,\pi(s_t))\)
 i.e. \(s_{t+1} = s'\) with probability \(P(s's_t, \pi(s_t))\)
 Write as a summation over possible \(s_t\):
 \(\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\} =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}\)
 In vector notation:
 \(d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]\)
 \(d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle \)
Transition Matrix
Given a policy \(\pi(\cdot)\) and a transition function \(P(\cdot\mid \cdot,\cdot)\)
 Define the transition matrix \(P_\pi \in \mathbb R^{S\times S}\)
 At row \(s\) and column \(s'\), entry is \(P(s'\mid s,\pi(s))\)
\(P_\pi=\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
State Evolution
Proposition: The state distribution evolves according to $$ d_t = (P_\pi^t)^\top d_0$$
Proof: (by induction)
 Base case: when \(t=0\), \(d_0=d_0\)
 Induction step:
 Need to prove \(d_{k+1} = P_\pi^\top d_k\)
 Exercise: review proof below
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(P_\pi^\top=\)
Proof of claim that \(d_{k+1} = P_\pi^\top d_k\)

\(d_{k+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_k[s]\)

\(\textcolor{orange}{d_{k+1}[s']} =\langle\)\(\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix}\)\(,\textcolor{red}{d_k}\rangle \)

Each entry of \(d_{k+1}\) is the inner product of a column of \(P_\pi\) with \(d_k\)

in other word, inner product of a row of \(P_\pi^\top\) with \(d_k\)


By the definition of matrix multiplication, \(d_{k+1} = P_\pi^\top d_k\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(P_\pi=\)
=
\(1\)
\(1p_1\)
\(p_1\)
\(0\)
\(1\)
 \(d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \)
 \(d_1 = \begin{bmatrix}1& 1p_1\\0 & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} = \begin{bmatrix} 1p_1/2\\ p_1/2\end{bmatrix}\)
 \(d_2 =\begin{bmatrix}1& 1p_1\\0 & p_1\end{bmatrix}\begin{bmatrix} 1p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1p_1^2/2\\ p_1^2/2\end{bmatrix}\)
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
State Evolution Example
$$P_\pi = \begin{bmatrix}1& 0\\ 1p_1 & p_1\end{bmatrix}$$
Recap
1. Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
Sp23 CS 4/5789: Lecture 2
By Sarah Dean