Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
Algorithm: SGA
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
\(\underbrace{\qquad\qquad}_{\text{score}}\)
\(\theta\)
\(z\)
\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)
\(J(\theta) = -\theta^2 - 1\)
\(\theta\)
\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(z)) h(z) \)
\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
\(z\)
\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)
\(h(z) = -z^2\)
\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)
(\(=-\theta^2 - 1\))
\(P_\theta = \mathcal N(\theta, 1)\)
\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Assume that we can "rollout" policy \(\pi_\theta\) to observe:
a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)
the resulting cumulative reward \(R(\tau)\)
Note: we do not need to know \(P\) or \(r\)!
Meta-Algorithm: Policy Optimization
In today's lecture, we review four ways to construct the estimates \(g_i\) such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
Algorithm: Random Policy Search
We have that \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\) up to accuracy of finite difference approximation
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(i=0\)
\(i=1\)
\(i=1\)
\(i=0\)
Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)
try perturbation in favor of "switch", then in favor of "stay"
update in direction of policy which receives more cumulative reward
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta^{(1)}\)
\(\theta^{(2)}\)
Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
Algorithm: REINFORCE
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)
rollout, then sum score over trajectory $$g_0 \propto \begin{bmatrix} \text{\# times } s=1,a=\mathsf{stay} \\ \text{\# times } s=1,a=\mathsf{switch} \end{bmatrix} $$
Direction of update depends on empirical action frequency, size depends on \(R(\tau)\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
Claim: The gradient estimate \(g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)\) is unbiased
We have that \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
1. Recap
2. Policy Optimization
3. with Trajectories
4. with Value
...
...
...
Algorithm: Idealized Actor Critic
Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
The Advantage function is \(A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)\)
Algorithm: Idealized Actor Critic with Advantage