Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Announcements

• Homework released next week
• Problem Set 1 due 1 week later
• Programming Assignment 1 due 2 weeks later

## Agenda

1. Recap: Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

## Markov Decision Process (MDP)

action $$a_t$$

state $$s_t$$

$$\sim \pi(s_t)$$

reward

$$r_t\sim r(s_t, a_t)$$

$$s_{t+1}\sim P(s_t, a_t)$$

1. Agent observes state of environment
2. Agent takes action
• depending on state according to policy
3. Environment state updates according to transition function

state
$$s_t$$

Markovian Assumption: Conditioned on $$s_t,a_t$$, the reward $$r_t$$ and next state $$s_{t+1}$$ are independent of the past.

When state transition are stochastic, we will write either:

• $$s'\sim P(s,a)$$ as a random variable
• $$P(s'|s,a)$$ as a number between $$0$$ and $$1$$

## Notation: Stochastic Maps

$$0$$

$$1$$

Example:

• $$\mathcal A=\{$$stay,switch$$\}$$
• From state $$0$$, action always works
• From state $$1$$, staying works with probability $$p_1$$ and switching with probability $$p_2$$

## Notation: Stochastic Maps

$$0$$

$$1$$

• From state $$0$$, action always works
• From state $$1$$, staying works with probability $$p_1$$ and switching with probability $$p_2$$
• $$P(0,$$stay$$)=\mathbf{1}_{0}$$
• $$P(1,$$stay$$)=\mathsf{Bernoulli}(p_1)$$
• $$P(0\mid 0,$$stay$$)=$$
• $$P(1\mid 0,$$stay$$)=$$
• $$P(0\mid 1,$$stay$$)=$$
• $$P(1\mid 1,$$stay$$)=$$
• $$P(0\mid 0,$$switch$$)=$$
• $$P(1\mid 0,$$switch$$)=$$
• $$P(0\mid 1,$$switch$$)=$$
• $$P(1\mid 1,$$switch$$)=$$
• $$1$$
• $$0$$
• $$1-p_1$$
• $$p_1$$
• $$0$$
• $$1$$
• $$p_2$$
• $$1-p_2$$
• $$P(0,$$switch$$)=\mathbf{1}_{1}$$
• $$P(1,$$switch$$)=\mathsf{Bernoulli}(1-p_2)$$

## Infinite Horizon Discounted MDP

• $$\mathcal{S}, \mathcal{A}$$ state and action space
• $$r$$ reward function: stochastic map from (state, action) to scalar reward
• $$P$$ transition function: stochastic map from current state and action to next state
• $$\gamma$$ discount factor between $$0$$ and $$1$$

Goal: achieve high cumulative reward:

$$\sum_{t=0}^\infty \gamma^t r_t$$

maximize   $$\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$

## Infinite Horizon Discounted MDP

If $$|\mathcal S|$$=S and $$|\mathcal A|$$=A, then how many deterministic policies are there? PollEv.com/sarahdean011

• If $$A=1$$ there is only one policy: a constant policy
• If $$S=1$$ there are $$A$$ policies: map the state to each action
• For general $$S$$, there are $$A$$ ways to map state $$1$$ times $$A$$ ways to map state $$2$$ times $$A$$ ways to map state $$3$$ .... and so on to $$S$$

maximize   $$\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

We will find policies using optimization & learning

## Agenda

1. Recap: Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

Helicopter Acrobatics (Stanford)

LittleDog Robot (LAIRLab at CMU)

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

## Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy $$\pi$$

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

## Behavioral Cloning

Dataset from expert policy $$\pi_\star$$: $$\{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star$$

maximize   $$\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

rather than optimize,

imitate!

minimize   $$\sum_{i=1}^N \ell(\pi(s_i), a_i)$$

$$\pi$$

## Behavioral Cloning

1. Policy class $$\Pi$$: usually parametrized by some $$w\in\mathbb R^d$$, e.g. weights of deep network, SVM, etc
2. Loss function $$\ell(\cdot,\cdot)$$: quantify accuracy
3. Optimization Algorithm: gradient descent, interior point methods, sklearn, torch

minimize   $$\sum_{i=1}^N \ell(\pi(s_i), a_i)$$

$$\pi\in\Pi$$

Supervised learning with empirical risk minimization (ERM)

## Behavioral Cloning

In this class, we assume that supervised learning works!

minimize   $$\sum_{i=1}^N \ell(\pi(s_i), a_i)$$

$$\pi\in\Pi$$

Supervised learning with empirical risk minimization (ERM)

i.e. we successfully optimize and generalize, so that the population loss is small: $$\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon$$

For many loss functions, this means that
$$\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$$

## Ex: Learning to Drive

Policy $$\pi$$

Input: Camera Image

Output: Steering Angle

## Ex: Learning to Drive

Supervised Learning

Policy

Dataset of expert trajectory

$$(x, y)$$

...

$$\pi$$(       ) =

## Ex: Learning? to Drive

expert trajectory

learned policy

No training data of "recovery" behavior!

## Ex: Learning? to Drive

What about assumption  $$\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$$?

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”

## Agenda

1. Recap: Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

## Trajectory

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

• When we deploy or roll out a policy, we will observe a sequence of states and actions
• define the initial state distribution $$\mu_0\in\Delta(\mathcal S)$$
• This sequence is called a trajectory: $$\tau = (s_0, a_0, s_1, a_1, ... )$$
• observe $$s_0\sim \mu_0$$
• play $$a_0\sim \pi(s_0)$$
• observe $$s_1\sim P(s_0, a_0)$$
• play $$a_1\sim \pi(s_1)$$
• ...

## Trajectory Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Probability of a Trajectory

• Probability of trajectory $$\tau =(s_0, a_0, s_1, ... s_t, a_t)$$ under policy $$\pi$$ starting from initial distribution $$\mu_0$$:
• $$\mathbb{P}_{\mu_0}^\pi (\tau)=$$
• $$=\mu_0(s_0)\pi(a_0 \mid s_0) {P}(s_1 \mid s_{0}, a_{0})\pi(a_1 \mid s_1) {P}(s_2 \mid s_{1}, a_{1})...$$
• $$=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

First recall Baye's rule: For events $$A$$ and $$B$$,

$$\mathbb P\{A \cap B\} = \mathbb P\{A\}\mathbb P\{B\mid A\} = \mathbb P\{B\} \mathbb P\{B\mid A\}$$

Why?

Then we have that

$$\mathbb P_{\mu_0}^\pi(s_0, a_0) = \mathbb P\{s_0\} \mathbb P\{a_0\mid s_0\} = \mu_0(s_0)\pi(a_0\mid s_0)$$

then

$$\mathbb P_{\mu_0}^\pi(s_0, a_0, s_1) = \mathbb P_{\mu_0}^\pi(s_0, a_0)\underbrace{\mathbb P\{s_1\mid a_0, s_0\}}_{P(s_1\mid s_0, a_0)}$$

and so on

## Probability of a Trajectory

• For deterministic policy (rest of lecture), actions are determined by states precisely $$a_t = \pi(s_t)$$
• Probability of state trajectory $$\tau =(s_0, s_1, ... s_t)$$:
• $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, \pi(s_{i-1}))$$

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

## Probability of a State

Example: $$\pi(s)=$$stay and $$\mu_0$$ is each state with probability $$1/2$$.

• What is $$\mathbb{P}\{s_t=1\mid \mu_0,\pi\}$$?
• $$\mathbb{P}\{s_0=1\} = 1/2$$
• $$\mathbb{P}\{s_1=1\} = \mathbb{P}^\pi_{\mu_0} (0, 1) + \mathbb{P}^\pi_{\mu_0} (1, 1)=p_1/2$$
• $$\mathbb{P}\{s_1=2\} = \mathbb{P}^\pi_{\mu_0} (1, 1,1)=p_1^2/2$$

Probability of state $$s$$ at time $$t$$ $$\mathbb{P}\{s_t=s\mid \mu_0,\pi\} = \displaystyle\sum_{\substack{s_{0:t-1}\in\mathcal S^t}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, s)$$

$$1$$

$$1-p_1$$

$$p_1$$

$$0$$

$$1$$

Why? First recall that

$$\mathbb P\{A \cup B\} = \mathbb P\{A\}+\mathbb P\{B\} - \mathbb P\{A\cap B\}$$

If $$A$$ and $$B$$ are disjoint, the final term is $$0$$ by definition.

If all $$A_i$$ are disjoint events, then the probability any one of them happens is

$$\mathbb P\{\cup_{i} A_i\} = \sum_{i} \mathbb P\{A_i\}$$

## State Distribution

• Keep track of probability for all states $$s\in\mathcal S$$ with $$d_t$$
• $$d_t$$ is a distribution over $$\mathcal S$$
• Can be represented as an $$S=|\mathcal S|$$ dimensional vector $$d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\}$$

$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example: $$\pi(s)=$$stay and $$\mu_0$$ is each state with probability $$1/2$$.

$$1$$

$$1-p_1$$

$$p_1$$

$$0$$

$$1$$

## State Distribution Transition

• How does state distribution change over time?
• Recall, $$s_{t+1}\sim P(s_t,\pi(s_t))$$
• i.e. $$s_{t+1} = s'$$ with probability $$P(s'|s_t, \pi(s_t))$$
• Write as a summation over possible $$s_t$$:
• $$\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\} =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}$$
• In vector notation:
• $$d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]$$
• $$d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle$$

## Transition Matrix

Given a policy $$\pi(\cdot)$$ and a transition function $$P(\cdot\mid \cdot,\cdot)$$

• Define the transition matrix $$P_\pi \in \mathbb R^{S\times S}$$
• At row $$s$$ and column $$s'$$, entry is $$P(s'\mid s,\pi(s))$$

$$P_\pi=$$

$$s$$

$$s'$$

$$P(s'\mid s,\pi(s))$$

## State Evolution

Proposition: The state distribution evolves according to $$d_t = (P_\pi^t)^\top d_0$$

Proof: (by induction)

1. Base case: when $$t=0$$, $$d_0=d_0$$
2. Induction step:
• Need to prove $$d_{k+1} = P_\pi^\top d_k$$
• Exercise: review proof below

$$s$$

$$s'$$

$$P(s'\mid s,\pi(s))$$

$$P_\pi^\top=$$

Proof of claim that $$d_{k+1} = P_\pi^\top d_k$$

• $$d_{k+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_k[s]$$

• $$\textcolor{orange}{d_{k+1}[s']} =\langle$$$$\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix}$$$$,\textcolor{red}{d_k}\rangle$$

• Each entry of $$d_{k+1}$$ is the inner product of a column of $$P_\pi$$ with $$d_k$$

• in other word, inner product of a row of $$P_\pi^\top$$ with $$d_k$$

• By the definition of matrix multiplication, $$d_{k+1} = P_\pi^\top d_k$$

$$s$$

$$s'$$

$$P(s'\mid s,\pi(s))$$

$$P_\pi=$$

=

$$1$$

$$1-p_1$$

$$p_1$$

$$0$$

$$1$$

• $$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix}$$
• $$d_1 = \begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}$$
• $$d_2 =\begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example: $$\pi(s)=$$stay and $$\mu_0$$ is each state with probability $$1/2$$.

## State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

## Recap

1. Markov Decision Process

2. Imitation Learning

3. Trajectories and Distributions

By Sarah Dean

Private