Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
Algorithm: SGA
θ1
θ2
θ1
θ2
2D quadratic function
level sets of quadratic
score
θ
z
∇J(θ)≈2δ1(J(θ+δv)−J(θ−δv))v
J(θ)=−θ2−1
θ
∇J(θ)≈∇θlog(Pθ(z))h(z)
J(θ)=Ez∼Pθ[h(z)]
z
∇θlogPθ(z)=(z−θ)
h(z)=−z2
=Ez∼N(θ,1)[−z2]
(=−θ2−1)
Pθ=N(θ,1)
logPθ(z)∝−21(θ−z)2
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
M={S,A,r,P,γ,μ0}
Goal: achieve high expected cumulative reward:
πmax E[t=0∑∞γtr(st,at)∣s0∼μ0,st+1∼P(st,at),at∼π(st)]
M={S,A,r,P,γ,μ0}
Goal: achieve high expected cumulative reward:
θmax J(θ)=Eτ∼Pμ0πθ[R(τ)]
Assume that we can "rollout" policy πθ to observe:
a sample τ from Pμ0πθ
the resulting cumulative reward R(τ)
Note: we do not need to know P or r!
Meta-Algorithm: Policy Optimization
In today's lecture, we review four ways to construct the estimates gi such that E[gi∣θi]=∇J(θi)
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
Algorithm: Random Policy Search
We have that E[gi∣θi]=∇J(θi) up to accuracy of finite difference approximation
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
i=0
i=1
i=1
i=0
Initialize θ0(1)=θ0(2)=1/2
try perturbation in favor of "switch", then in favor of "stay"
update in direction of policy which receives more cumulative reward
reward: +1 if s=0 and −21 if a= switch
θ(1)
θ(2)
Claim: The gradient estimate is unbiased E[gi∣θi]=∇J(θi)
Algorithm: REINFORCE
0
1
stay: 1
switch: 1
stay: p1
switch: 1−p2
stay: 1−p1
switch: p2
Initialize θ0(1)=θ0(2)=1/2
rollout, then sum score over trajectory g0∝[ # times s=1,a=stay# times s=1,a=switch]
Direction of update depends on empirical action frequency, size depends on R(τ)
reward: +1 if s=0 and −21 if a= switch
Claim: The gradient estimate gi=∑t=0∞∇θ[logπθ(at∣st)]θ=θiR(τ) is unbiased
We have that E[gi∣θi]=∇J(θi)
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
...
...
...
Algorithm: Idealized Actor Critic
Claim: The gradient estimate is unbiased E[gi∣θi]=∇J(θi)
The Advantage function is Aπθi(s,a)=Qπθi(s,a)−Vπθi(s)
Algorithm: Idealized Actor Critic with Advantage
By Sarah Dean