Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap: Policy Optimization
2. Gradients with Q/Value
3. Trust Regions
4. KL Divergence
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Policy \(\pi_\theta\) parametrized by \(\theta\) (e.g. deep network)
Assume that we can "rollout" policy \(\pi_\theta\) to observe:
a sample \(\tau = (s_0, a_0, s_1, a_1, \dots)\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)
the resulting cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)
Note: we do not need to know \(P\)! (Also easy to extend to the case that we don't know \(r\)!)
Meta-Algorithm: Policy Optimization
Last time, we discussed two algorithms (Random Search and REINFORCE) for estimating gradients using a trajectory
\(\pi(a_1|s)\)
\(\vdots\)
\(\pi(a_A|s)\)
\(s\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta\)
\(+\infty\)
\(-\infty\)
stay
switch
1. Recap: Policy Optimization
2. Gradients with Q/Value
3. Trust Regions
4. KL Divergence
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Algorithm: Idealized Actor Critic
Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
The Advantage function is \(A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)\)
Algorithm: Idealized Actor Critic with Advantage
1. Recap: Policy Optimization
2. Gradients with Q/Value
3. Trust Regions
4. KL Divergence
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
1. Recap: Policy Optimization
2. Gradients with Q/Value
3. Trust Regions
4. KL Divergence
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
By Sarah Dean