Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
M={S,A,r,P,γ}
Infinite horizon discounted MDP with finite states and actions
maximize E[i=1∑∞γtr(st,at)]
s.t. st+1∼P(st,at), at∼π(st)
π
minimize t=0∑H−1c(st,at)
s.t. st+1=f(st,at), at=πt(st)
π
M={S,A,c,f,H}
Finite horizon deterministic MDP with continuous states/actions
action
state
at
reward
st
rt
When the initial state is fixed to a known s0, i.e. μ0=es0 we write ds0,tπ
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
Algorithm: MBRL with Queries
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
Simulation Lemma: For a deterministic policy π, ∣V^π(s0)−Vπ(s0)∣≤(1−γ)2γEs∼dπs0[∥P^(⋅∣s,π(s))−P(⋅∣s,π(s))∥1]
For a fixed policy, what is the difference in value when computed using P vs. when using P^?
∑s′∈S∣P^(s′∣s,π(s))−P(s′∣s,π(s))∣
total variation distance on distribution over s′
Simulation Lemma: For a deterministic policy, ∣V^π(s0)−Vπ(s0)∣≤(1−γ)2γEs∼ds0π[∥P^(⋅∣s,π(s))−P(⋅∣s,π(s))∥1]
For alternative proof without vector notation
Simulation Lemma: For a deterministic policy, ∣V^π(s0)−Vπ(s0)∣≤(1−γ)2γEs∼ds0π[∥P^(⋅∣s,π(s))−P(⋅∣s,π(s))∥1]
Simulation Lemma: For a deterministic policy, ∣V^π(s0)−Vπ(s0)∣≤(1−γ)2γEs∼ds0π[∥P^(⋅∣s,π(s))−P(⋅∣s,π(s))∥1]
1. Recap: MDPs and Control
2. MBRL with Query Model
3. Sub-Optimality
4. Model Error
Theorem: For 1≤δ≤1, run MBRL Algorithm with N≥ϵ2(1−γ)44γ2S2Alog(2SA/δ). Then with probability at least 1−δ, V⋆(s)−Vπ^⋆(s)≤ϵ∀ s∈S
Algorithm: Tabular MBRL with Queries
Proof Outline: