Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Outline:
Infinite Horizon Discounted MDP
M={S,A,r,P,γ}
Finite Horizon MDP
M={S,A,r,P,H,μ0}
ex - Pac-Man as MDP
Optimal Control Problem
ex - UAV as OCP
examples:
Policy results in a trajectory τ=(s0,a0,s1,a1,...)
s0
a0
s1
a1
s2
a2
...
s0
a0
s1
a1
s2
a2
...
s0
a0
s1
a1
s2
a2
...
+γ
+γ2
+...=
Food for thought:
s0
a0
s1
a1
s2
a2
...
examples:
...
...
...
Recursive Bellman Expectation Equation:
...
...
...
Recall: Icy navigation (PSet 2, lecture example), Prelim question
Recall: Verifying optimality in Icy Street example, Prelim
Food for thought: rigorous argument for optimal policy?
ex - UAV
Recall: PSet 4 and Prelim question about cumulative cost and stability
Model-Based RL
h1=h w.p. ∝γh
st
at∼π(st)
rt∼r(st,at)
st+1∼P(st,at)
at+1∼π(st+1)
Food for thought: how to compute off-policy gradient estimate?
max J(θ)
s.t. dKL(θ,θ0)≤δ
max ∇J(θ0)⊤(θ−θ0)
s.t. (θ−θ0)⊤Fθ0(θ−θ0)≤δ
θi+1=θi+α Fi−1gi
Food for thought: performance/regret of softmax policy?
Imitation Learning with BC
Food for thought: Expert in LQR setting? (Linear regression)
Supervised Learning
Policy
Dataset
D=(xi,yi)i=1M
...
π( ) =
Imitation Learning with DAgger
Supervised Learning
Policy
Dataset
D=(xi,yi)i=1M
...
π( ) =
Execute
Query Expert
π∗(s0),π∗(s1),...
s0,s1,s2...
Aggregate
(xi=si,yi=π∗(si))
maximize Ent(π)
s.t. π consistent with expert data
By Sarah Dean