Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
action \(a_t\)
state \(s_t\)
\(\sim \pi(s_t)\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
state
\(s_t\)
Markovian Assumption: Conditioned on \(s_t,a_t\), the reward \(r_t\) and next state \(s_{t+1}\) are independent of the past.
When state transition are stochastic, we will write either:
\(0\)
\(1\)
Example:
\(0\)
\(1\)
Goal: achieve high cumulative reward:
$$\sum_{t=0}^\infty \gamma^t r_t$$
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
If \(|\mathcal S|\)=S and \(|\mathcal A|\)=A, then how many deterministic policies are there? PollEv.com/sarahdean011
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
We will find policies using optimization & learning
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
Helicopter Acrobatics (Stanford)
LittleDog Robot (LAIRLab at CMU)
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions
Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
rather than optimize,
imitate!
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\)
sklearn
, torch
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\in\Pi\)
Supervised learning with empirical risk minimization (ERM)
In this class, we assume that supervised learning works!
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\in\Pi\)
Supervised learning with empirical risk minimization (ERM)
i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon\)
For many loss functions, this means that
\(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)
Policy \(\pi\)
Input: Camera Image
Output: Steering Angle
Supervised Learning
Policy
Dataset of expert trajectory
\((x, y)\)
...
\(\pi\)( ) =
expert trajectory
learned policy
No training data of "recovery" behavior!
What about assumption \(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)?
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
First recall Baye's rule: For events \(A\) and \(B\),
$$ \mathbb P\{A \cap B\} = \mathbb P\{A\}\mathbb P\{B\mid A\} = \mathbb P\{B\} \mathbb P\{B\mid A\}$$
Why?
Then we have that
$$\mathbb P_{\mu_0}^\pi(s_0, a_0) = \mathbb P\{s_0\} \mathbb P\{a_0\mid s_0\} = \mu_0(s_0)\pi(a_0\mid s_0)$$
then
$$\mathbb P_{\mu_0}^\pi(s_0, a_0, s_1) = \mathbb P_{\mu_0}^\pi(s_0, a_0)\underbrace{\mathbb P\{s_1\mid a_0, s_0\}}_{P(s_1\mid s_0, a_0)}$$
and so on
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
Probability of state \(s\) at time \(t\) $$ \mathbb{P}\{s_t=s\mid \mu_0,\pi\} = \displaystyle\sum_{\substack{s_{0:t-1}\in\mathcal S^t}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, s) $$
\(1\)
\(1-p_1\)
\(p_1\)
\(0\)
\(1\)
Why? First recall that
$$ \mathbb P\{A \cup B\} = \mathbb P\{A\}+\mathbb P\{B\} - \mathbb P\{A\cap B\}$$
If \(A\) and \(B\) are disjoint, the final term is \(0\) by definition.
If all \(A_i\) are disjoint events, then the probability any one of them happens is
$$ \mathbb P\{\cup_{i} A_i\} = \sum_{i} \mathbb P\{A_i\}$$
$$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \quad d_1 = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\quad d_2 = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
\(1\)
\(1-p_1\)
\(p_1\)
\(0\)
\(1\)
Given a policy \(\pi(\cdot)\) and a transition function \(P(\cdot\mid \cdot,\cdot)\)
\(P_\pi=\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
Proposition: The state distribution evolves according to $$ d_t = (P_\pi^t)^\top d_0$$
Proof: (by induction)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(P_\pi^\top=\)
Proof of claim that \(d_{k+1} = P_\pi^\top d_k\)
\(d_{k+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_k[s]\)
\(\textcolor{orange}{d_{k+1}[s']} =\langle\)\(\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix}\)\(,\textcolor{red}{d_k}\rangle \)
Each entry of \(d_{k+1}\) is the inner product of a column of \(P_\pi\) with \(d_k\)
in other word, inner product of a row of \(P_\pi^\top\) with \(d_k\)
By the definition of matrix multiplication, \(d_{k+1} = P_\pi^\top d_k\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(P_\pi=\)
=
\(1\)
\(1-p_1\)
\(p_1\)
\(0\)
\(1\)
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$
1. Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions