Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
\(\mathcal S\)
\(\mathcal A\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
We can "rollout" policy \(\pi_\theta\) to observe:
a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\) or \(s,a\sim d^{\pi_\theta}_{\mu_0}\)
the resulting cumulative reward \(R(\tau)\)
Note: we do not need to know \(P\) or \(r\)!
\(\mathcal S\)
\(\mathcal A\)
Meta-Algorithm: Policy Optimization
Today we will derive an alternative update: \(\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta\)
\(+\infty\)
\(-\infty\)
stay
switch
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
first order approx (gradient \(g_0\)) second order approx
level sets of quadratic
For proof of claim, refer to
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
Algorithm: Natural PG
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta\)
\(+\infty\)
\(-\infty\)
stay
switch