Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Natural PG
3. Proximal Policy Opt
4. Review
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]=\mathop{\mathbb E}_{s\sim \mu_0}\left[V^{\pi_\theta}(s)\right]$$
Meta-Algorithm: Policy Optimization
1. Recap
2. Natural PG
3. Proximal Policy Opt
4. Review
first order approx (gradient \(g_0\)) second order approx
level sets of quadratic
For proof of claim, refer to
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
Algorithm: Natural PG
In practice, common to minibatch samples from \( d^{\pi_i}_{\mu_0}\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta\)
\(+\infty\)
\(-\infty\)
stay
switch
1. Recap
2. Natural PG
3. Proximal Policy Opt
4. Review
Local objective has the same gradient as \(J(\theta)\) when \(\theta=\theta_0\)$$\nabla_{\theta} \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right] \right]= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\nabla_{\theta}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right] \right]$$
Using the importance weighting trick from Lecture 16
$$= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[\nabla_{\theta}\log\pi_\theta(a|s)A^{\pi_{\theta_0}}(s,a) \right] \right]$$
If \(\theta=\theta_0\), this is the gradient expression from Actor-Critic with Advantage (Lecture 17)
$$\max_\theta\quad J(\theta;\theta_0)-\lambda d_{KL}(\theta, \theta_0)$$
$$\max_\theta\quad \mathbb E_{s,a\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \frac{\pi_{\theta}(a|s) }{\pi_{\theta_0}(a|s) }A^{\pi_{\theta_0}}(s,a) \right]+\lambda \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log{\pi_\theta(a|s)}\right]$$
Algorithm: Idealized PPO
In practice, estimate \(\hat A^{\pi_{\theta_i}}\) and minibatch samples from \( d^{\pi_i}_{\mu_0}\)
1. Recap
2. Natural PG
3. Proximal Policy Opt
4. Review
Food for thought: how to compute off-policy gradient estimate?
Food for thought: compare the bias and variance of different gradient estimates or supervised learning labels.
By Sarah Dean