Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Policy Optimization
3. REINFORCE
4. Value-based Gradients
Algorithm: SGA
Algorithm: One Point Random Search
\(\nabla J(\theta)\)\( \approx g= \frac{1}{2\delta} J(\textcolor{cyan}{\theta}+{\delta v})\textcolor{LimeGreen}{v}\)
\(J(\theta) = -\theta^2 - 1\)
\(\theta\)
1. Recap
2. Policy Optimization
3. REINFORCE
4. Value-based Gradients
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Assume that we can "rollout" policy \(\pi_\theta\) to observe:
a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)
the resulting cumulative reward \(R(\tau)\)
Note: we do not need to know \(P\)! (Also easy to extend to the case that we don't know \(r\)!)
We consider infinite length trajectories \(\tau\) without worrying about computational feasibility
Random Search Policy Optimization
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)
sample random perturbation, e.g. $$\theta^{(1)}_0+\delta,~~~=\theta^{(2)}_0-\delta$$
update \(\theta_1\) based on magnitude of reward
repeat
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
1. Recap
2. Policy Optimization
3. REINFORCE
4. Value-based Gradients
\(\underbrace{\qquad\qquad}_{\text{score}}\)
Algorithm: Monte-Carlo DFO
\(\nabla J(\theta)\)\( \approx g=\nabla_\theta \log(P_\theta(z)) h(z) \)
\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
\(z\)
\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)
\(h(z) = -z^2\)
\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)
(\(=-\theta^2 - 1\))
\(P_\theta = \mathcal N(\theta, 1)\)
\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)
Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
Algorithm: REINFORCE
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)
rollout, then sum score over trajectory $$g_0 \propto \begin{bmatrix} \text{\# times } s=1,a=\mathsf{stay} \\ \text{\# times } s=1,a=\mathsf{switch} \end{bmatrix} $$
Direction of update depends on empirical action frequency, size depends on \(R(\tau)\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
Claim: The gradient estimate \(g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)\) is unbiased
We have that \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
1. Recap
2. Policy Optimization
3. REINFORCE
4. Value-based Gradients
...
...
...
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Algorithm: Idealized Actor Critic
Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
The Advantage function is \(A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)\)
Algorithm: Idealized Actor Critic with Advantage
Meta-Algorithm: Policy Optimization