CS 4/5789: Introduction to Reinforcement Learning

Lecture 2: MDPs and Imitation Learning

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

Agenda

 

1. Recap: Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

Markov Decision Process (MDP)

action \(a_t\)

state \(s_t\)

\(\sim \pi(s_t)\)

reward

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

  1. Agent observes state of environment
  2. Agent takes action
    • depending on state according to policy
  3. Environment state updates according to transition function

state
\(s_t\)

Markovian Assumption: Conditioned on \(s_t,a_t\), the reward \(r_t\) and next state \(s_{t+1}\) are independent of the past.

When state transition are stochastic, we will write either:

  • \(s'\sim P(s,a)\) as a random variable
  • \(P(s'|s,a)\) as a number between \(0\) and \(1\)

Notation: Stochastic Maps

\(0\)

\(1\)

Example:

  • \(\mathcal A=\{\)stay,switch\(\}\)
  • From state \(0\), action always works
  • From state \(1\), staying works with probability \(p_1\) and switching with probability \(p_2\)

Notation: Stochastic Maps

\(0\)

\(1\)

  • From state \(0\), action always works
  • From state \(1\), staying works with probability \(p_1\) and switching with probability \(p_2\)
  • \(P(0,\)stay\()=\mathbf{1}_{0}\)
  • \(P(1,\)stay\()=\mathsf{Bernoulli}(p_1)\)
  • \(P(0\mid 0,\)stay\()=\)
  • \(P(1\mid 0,\)stay\()=\)
  • \(P(0\mid 1,\)stay\()=\)
  • \(P(1\mid 1,\)stay\()=\)
  • \(P(0\mid 0,\)switch\()=\)
  • \(P(1\mid 0,\)switch\()=\)
  • \(P(0\mid 1,\)switch\()=\)
  • \(P(1\mid 1,\)switch\()=\)
  • \(1\)
  • \(0\)
  • \(1-p_1\)
  • \(p_1\)
  • \(0\)
  • \(1\)
  • \(p_2\)
  • \(1-p_2\)
  • \(P(0,\)switch\()=\mathbf{1}_{1}\)
  • \(P(1,\)switch\()=\mathsf{Bernoulli}(1-p_2)\)

Infinite Horizon Discounted MDP

  • \(\mathcal{S}, \mathcal{A}\) state and action space
  • \(r\) reward function: stochastic map from (state, action) to scalar reward
  • \(P\) transition function: stochastic map from current state and action to next state
  • \(\gamma\) discount factor between \(0\) and \(1\)

Goal: achieve high cumulative reward:

$$\sum_{t=0}^\infty \gamma^t r_t$$

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)

Infinite Horizon Discounted MDP

If \(|\mathcal S|\)=S and \(|\mathcal A|\)=A, then how many deterministic policies are there? PollEv.com/sarahdean011

  • If \(A=1\) there is only one policy: a constant policy
  • If \(S=1\) there are \(A\) policies: map the state to each action
  • For general \(S\), there are \(A\) ways to map state \(1\) times \(A\) ways to map state \(2\) times \(A\) ways to map state \(3\) .... and so on to \(S\)

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

We will find policies using optimization & learning

Agenda

 

1. Recap: Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

Helicopter Acrobatics (Stanford)

LittleDog Robot (LAIRLab at CMU)

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy \(\pi\)

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Behavioral Cloning

Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

rather than optimize,

imitate!

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

\(\pi\)

Behavioral Cloning

  1. Policy class \(\Pi\): usually parametrized by some \(w\in\mathbb R^d\), e.g. weights of deep network, SVM, etc
  2. Loss function \(\ell(\cdot,\cdot)\): quantify accuracy
  3. Optimization Algorithm: gradient descent, interior point methods, sklearn, torch

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

\(\pi\in\Pi\)

Supervised learning with empirical risk minimization (ERM)

Behavioral Cloning

In this class, we assume that supervised learning works!

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

\(\pi\in\Pi\)

Supervised learning with empirical risk minimization (ERM)

i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon\)

For many loss functions, this means that
\(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)

Ex: Learning to Drive

Policy \(\pi\)

Input: Camera Image

Output: Steering Angle

Ex: Learning to Drive

Supervised Learning

Policy

Dataset of expert trajectory

\((x, y)\)

...

\(\pi\)(       ) =

Ex: Learning? to Drive

expert trajectory

learned policy

No training data of "recovery" behavior!

Ex: Learning? to Drive

What about assumption  \(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)?

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”

Agenda

 

1. Recap: Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

Trajectory

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

  • When we deploy or roll out a policy, we will observe a sequence of states and actions
    • define the initial state distribution \(\mu_0\in\Delta(\mathcal S)\)
  • This sequence is called a trajectory: $$ \tau = (s_0, a_0, s_1, a_1, ... ) $$
  • observe \(s_0\sim \mu_0\)
  • play \(a_0\sim \pi(s_0)\)
  • observe \(s_1\sim P(s_0, a_0)\)
  • play \(a_1\sim \pi(s_1)\)
  • ...

Trajectory Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Probability of a Trajectory

  • Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) under policy \(\pi\) starting from initial distribution \(\mu_0\):
  • \(\mathbb{P}_{\mu_0}^\pi (\tau)=\)
    • \(=\mu_0(s_0)\pi(a_0 \mid s_0) {P}(s_1 \mid s_{0}, a_{0})\pi(a_1 \mid s_1) {P}(s_2 \mid s_{1}, a_{1})...\)
    • \(=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)\)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

First recall Baye's rule: For events \(A\) and \(B\),

$$ \mathbb P\{A \cap B\} = \mathbb P\{A\}\mathbb P\{B\mid A\} = \mathbb P\{B\} \mathbb P\{B\mid A\}$$

Why?

Then we have that

$$\mathbb P_{\mu_0}^\pi(s_0, a_0) = \mathbb P\{s_0\} \mathbb P\{a_0\mid s_0\} = \mu_0(s_0)\pi(a_0\mid s_0)$$

then

$$\mathbb P_{\mu_0}^\pi(s_0, a_0, s_1) = \mathbb P_{\mu_0}^\pi(s_0, a_0)\underbrace{\mathbb P\{s_1\mid a_0, s_0\}}_{P(s_1\mid s_0, a_0)}$$

and so on

Probability of a Trajectory

  • For deterministic policy (rest of lecture), actions are determined by states precisely \(a_t = \pi(s_t)\)
  • Probability of state trajectory \(\tau =(s_0, s_1, ... s_t)\):
    • \(\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, \pi(s_{i-1})) \)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

Probability of a State

Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

  • What is \(\mathbb{P}\{s_t=1\mid \mu_0,\pi\}\)?
  • \(\mathbb{P}\{s_0=1\} = 1/2\)
  • \(\mathbb{P}\{s_1=1\} = \mathbb{P}^\pi_{\mu_0} (0, 1) + \mathbb{P}^\pi_{\mu_0} (1, 1)=p_1/2\)
  • \(\mathbb{P}\{s_1=2\} = \mathbb{P}^\pi_{\mu_0} (1, 1,1)=p_1^2/2\)

Probability of state \(s\) at time \(t\) $$ \mathbb{P}\{s_t=s\mid \mu_0,\pi\} = \displaystyle\sum_{\substack{s_{0:t-1}\in\mathcal S^t}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, s) $$

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

Why? First recall that

$$ \mathbb P\{A \cup B\} = \mathbb P\{A\}+\mathbb P\{B\} - \mathbb P\{A\cap B\}$$

If \(A\) and \(B\) are disjoint, the final term is \(0\) by definition.

If all \(A_i\) are disjoint events, then the probability any one of them happens is

$$ \mathbb P\{\cup_{i} A_i\} = \sum_{i} \mathbb P\{A_i\}$$

State Distribution

  • Keep track of probability for all states \(s\in\mathcal S\) with \(d_t\)
  • \(d_t\) is a distribution over \(\mathcal S\)
  • Can be represented as an \(S=|\mathcal S|\) dimensional vector $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$

$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

State Distribution Transition

  • How does state distribution change over time?
    • Recall, \(s_{t+1}\sim P(s_t,\pi(s_t))\)
    • i.e. \(s_{t+1} = s'\) with probability \(P(s'|s_t, \pi(s_t))\)
  • Write as a summation over possible \(s_t\):
    • \(\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\}  =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}\)
  • In vector notation:
    • \(d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]\)
    • \(d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle \)

Transition Matrix

Given a policy \(\pi(\cdot)\) and a transition function \(P(\cdot\mid \cdot,\cdot)\)

  • Define the transition matrix \(P_\pi \in \mathbb R^{S\times S}\)
  • At row \(s\) and column \(s'\), entry is \(P(s'\mid s,\pi(s))\)

\(P_\pi=\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

State Evolution

Proposition: The state distribution evolves according to $$ d_t = (P_\pi^t)^\top d_0$$

Proof: (by induction)

  1. Base case: when \(t=0\), \(d_0=d_0\)
  2. Induction step:
    • Need to prove \(d_{k+1} = P_\pi^\top d_k\)
    • Exercise: review proof below

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

\(P_\pi^\top=\)

Proof of claim that \(d_{k+1} = P_\pi^\top d_k\)

  • \(d_{k+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_k[s]\)

  • \(\textcolor{orange}{d_{k+1}[s']} =\langle\)\(\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix}\)\(,\textcolor{red}{d_k}\rangle \)

  • Each entry of \(d_{k+1}\) is the inner product of a column of \(P_\pi\) with \(d_k\)

    • in other word, inner product of a row of \(P_\pi^\top\) with \(d_k\)

  • By the definition of matrix multiplication, \(d_{k+1} = P_\pi^\top d_k\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

\(P_\pi=\)

=

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

  • \(d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \)
  • \(d_1 = \begin{bmatrix}1& 1-p_1\\0  & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix}   = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\)
  • \(d_2 =\begin{bmatrix}1& 1-p_1\\0  & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}\)

Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

Recap

 

1. Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions