Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
action at
state st
∼π(st)
reward
rt∼r(st,at)
st+1∼P(st,at)
state
st
Markovian Assumption: Conditioned on st,at, the reward rt and next state st+1 are independent of the past.
When state transition are stochastic, we will write either:
0
1
Example:
0
1
Goal: achieve high cumulative reward:
t=0∑∞γtrt
maximize E[i=1∑∞γtr(st,at)]
s.t. st+1∼P(st,at), at∼π(st)
π
M={S,A,r,P,γ}
If ∣S∣=S and ∣A∣=A, then how many deterministic policies are there? PollEv.com/sarahdean011
maximize E[i=1∑∞γtr(st,at)]
s.t. st+1∼P(st,at), at∼π(st)
π
We will find policies using optimization & learning
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
Helicopter Acrobatics (Stanford)
LittleDog Robot (LAIRLab at CMU)
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
Expert Demonstrations
Supervised ML Algorithm
Policy π
ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions
Dataset from expert policy π⋆: {(si,ai)}i=1N∼D⋆
maximize E[i=1∑∞γtr(st,at)]
s.t. st+1∼P(st,at), at∼π(st)
π
rather than optimize,
imitate!
minimize ∑i=1Nℓ(π(si),ai)
π
sklearn
, torch
minimize ∑i=1Nℓ(π(si),ai)
π∈Π
Supervised learning with empirical risk minimization (ERM)
In this class, we assume that supervised learning works!
minimize ∑i=1Nℓ(π(si),ai)
π∈Π
Supervised learning with empirical risk minimization (ERM)
i.e. we successfully optimize and generalize, so that the population loss is small: Es,a∼D⋆[ℓ(π(s),a)]≤ϵ
For many loss functions, this means that
Es∼D⋆[1{π(s)=π⋆(s)}]≤ϵ
Policy π
Input: Camera Image
Output: Steering Angle
Supervised Learning
Policy
Dataset of expert trajectory
(x,y)
...
π( ) =
expert trajectory
learned policy
No training data of "recovery" behavior!
What about assumption Es∼D⋆[1{π(s)=π⋆(s)}]≤ϵ?
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”
1. Recap: Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
s0
a0
s1
a1
s2
a2
...
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
s0
a0
s1
a1
s2
a2
...
First recall Baye's rule: For events A and B,
P{A∩B}=P{A}P{B∣A}=P{B}P{B∣A}
Why?
Then we have that
Pμ0π(s0,a0)=P{s0}P{a0∣s0}=μ0(s0)π(a0∣s0)
then
Pμ0π(s0,a0,s1)=Pμ0π(s0,a0)P(s1∣s0,a0)P{s1∣a0,s0}
and so on
s0
a0
s1
a1
s2
a2
...
Example: π(s)=stay and μ0 is each state with probability 1/2.
Probability of state s at time t P{st=s∣μ0,π}=s0:t−1∈St∑Pμ0π(s0:t−1,s)
1
1−p1
p1
0
1
Why? First recall that
P{A∪B}=P{A}+P{B}−P{A∩B}
If A and B are disjoint, the final term is 0 by definition.
If all Ai are disjoint events, then the probability any one of them happens is
P{∪iAi}=i∑P{Ai}
d0=[1/21/2]d1=[1−p1/2p1/2]d2=[1−p12/2p12/2]
Example: π(s)=stay and μ0 is each state with probability 1/2.
1
1−p1
p1
0
1
Given a policy π(⋅) and a transition function P(⋅∣⋅,⋅)
Pπ=
s
s′
P(s′∣s,π(s))
Proposition: The state distribution evolves according to dt=(Pπt)⊤d0
Proof: (by induction)
s
s′
P(s′∣s,π(s))
Pπ⊤=
Proof of claim that dk+1=Pπ⊤dk
dk+1[s′]=∑s∈SP(s′∣s,π(s))dk[s]
dk+1[s′]=⟨[P(s′∣1,π(1))…P(s′∣S,π(S))],dk⟩
Each entry of dk+1 is the inner product of a column of Pπ with dk
in other word, inner product of a row of Pπ⊤ with dk
By the definition of matrix multiplication, dk+1=Pπ⊤dk
s
s′
P(s′∣s,π(s))
Pπ=
=
1
1−p1
p1
0
1
Example: π(s)=stay and μ0 is each state with probability 1/2.
Pπ=[11−p10p1]
1. Markov Decision Process
2. Imitation Learning
3. Trajectories and Distributions
By Sarah Dean