Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
M={S,A,r,P,γ,μ0}
Goal: achieve high expected cumulative reward:
πmax E[t=0∑∞γtr(st,at)∣s0∼μ0,st+1∼P(st,at),at∼π(st)]
S
A
M={S,A,r,P,γ,μ0}
Goal: achieve high expected cumulative reward:
θmax J(θ)=Eτ∼Pμ0πθ[R(τ)]
We can "rollout" policy πθ to observe:
a sample τ from Pμ0πθ or s,a∼dμ0πθ
the resulting cumulative reward R(τ)
Note: we do not need to know P or r!
S
A
Meta-Algorithm: Policy Optimization
Today we will derive an alternative update: θi+1=θi+αFi−1gi
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
reward: +1 if s=0 and −21 if a= switch
θ
+∞
−∞
stay
switch
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
θ1
θ2
θ1
θ2
2D quadratic function
level sets of quadratic
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
reward: +1 if s=0 and −21 if a= switch
+
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
reward: +1 if s=0 and −21 if a= switch
+
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
reward: +1 if s=0 and −21 if a= switch
+
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
first order approx (gradient g0) second order approx
level sets of quadratic
For proof of claim, refer to
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
reward: +1 if s=0 and −21 if a= switch
Algorithm: Natural PG
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
reward: +1 if s=0 and −21 if a= switch
θ
+∞
−∞
stay
switch
By Sarah Dean