CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Policy Optimization

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • PA 3 due Friday 3/31
    • PSet 4 due Wednesday Friday
    • 5789 Paper Reviews due weekly on Mondays
      • Hard deadline for 3 reviews by Friday
  • My Tuesday office hours moved this week to 3-4pm

Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

Recap: SGA

  • Rather than exact gradients, SGA uses unbiased estimates of the gradient \(g_i\), i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

  • Initialize \(\theta_0\); For \(i=0,1,...\):
    • \(\theta_{i+1} = \theta_i + \alpha g_i\)
Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

\(\theta_1\)

\(\theta_2\)

  • gradient ascent
  • stochastic gradient ascent

2D quadratic function

level sets of quadratic

Recap: DFO

  • Random Search
    • Based on finite difference approximation
    • \(g_i=\frac{1}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v\) for \(v\sim \mathcal N(0,I)\)
  • Montecarlo / Importance Weighting
    • Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
    • \(g_i=\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)\) for \(z\sim P_{\theta_i}\)

\(\underbrace{\qquad\qquad}_{\text{score}}\)

Parabola

\(\theta\)

image/svg+xml
Parabola

\(z\)

\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)

Parabola

\(J(\theta) = -\theta^2 - 1\)

\(\theta\)

Recap: Random Search Example

  • start with \(\theta\) positive
  • suppose \(v\) is positive
  • then \(J(\theta+\delta v)<J(\theta-\delta v)\)
  • therefore \(g\) is negative
  • indeed, \(\nabla J(\theta) = -2\theta<0\) when \(\theta>0\)

Recap: Montecarlo Example

\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(z)) h(z) \)

Parabola

\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)

\(z\)

image/svg+xml

\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)

\(h(z) = -z^2\)

\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)

(\(=-\theta^2 - 1\))

\(P_\theta = \mathcal N(\theta, 1)\)

  • start with \(\theta\) positive
  • suppose \(z>\theta\)
  • then score is positive
  • therefore \(g\) is negative (since \(h(z)<0\))

\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)

Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$

  • Recall notation for a trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\) and probability of a trajectory \(\mathbb P^{\pi}_{\mu_0}\)
  • Define cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)
  • For parametric (e.g. deep) policy \(\pi_\theta\), the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Policy Optimization Setting

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

  • Assume that we can "rollout" policy \(\pi_\theta\) to observe:

    • a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)

    • the resulting cumulative reward \(R(\tau)\)

  • Note: we do not need to know \(P\) or \(r\)!

Policy Optimization Setting

Meta-Algorithm: Policy Optimization

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy
    • Estimate \(\nabla J(\theta_i)\) as \(g_i\) using rollouts
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Policy Optimization

In today's lecture, we review four ways to construct the estimates \(g_i\) such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

  • In today's lecture, we review four ways to construct the estimates \(g_i\) such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$
  • We will have
    • two estimates that use trajectories \(\tau\)
    • two estimates that also use Q/Value functions
  • We consider infinite length trajectories \(\tau\) without worrying about computational feasibility
    • note that we could use a similar trick as in Lecture 12-13 to ensure finite sampling size/time

Policy Optimization Overview

Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

  • Successful in continuous action/control settings
  • Can improve accuracy of gradient estimate by sampling more \(v\), which is easily parallelizable in simulation

Random Policy Search

Algorithm: Random Policy Search

  • Given \(\alpha, \delta\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Sample \(v\sim \mathcal N(0, I)\)
    • Rollout policies \(\pi_{\theta_i\pm\delta v}\) and observe trajectories \(\tau_+\) and \(\tau_-\)
    • Estimate \(g_i = \frac{1}{2\delta}\left(R(\tau_+)-R(\tau_-)\right) v\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

We have that \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\) up to accuracy of finite difference approximation

  • \(\mathbb E[g_i| \theta_i] = \mathbb E[\mathbb E[g_i|v, \theta_i]| \theta_i]\) by tower property
  • \(\mathbb E[g_i|v, \theta_i]\)
    • \(= \mathbb E_{\tau_+, \tau_-}[\frac{1}{2\delta}\left(R(\tau_+)-R(\tau_-)\right) v]\)
    • \(=\frac{1}{2\delta}\left( \mathbb E_{\tau_+\sim \mathbb P^{\pi_{\theta_i+\delta v}}_{\mu_0}}[R(\tau_+)]-\mathbb E_{\tau_+\sim \mathbb P^{\pi_{\theta_i+\delta v}}_{\mu_0}}[R(\tau_-)]\right) v\)
    • \(=\frac{1}{2\delta}\left( J(\theta_i+\delta v) - J(\theta_i+\delta v)\right) v\)
    • \(=\nabla J(\theta_i)^\top v v\) if finite different approx is perfect
  • \(\mathbb E[g_i| \theta_i] = \mathbb E_v[\nabla J(\theta_i)^\top v v] = \nabla J(\theta_i)\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

\(i=0\)

\(i=1\)

\(i=1\)

\(i=0\)

  • Parametrized policy: \(\pi_\theta(0)=\)stay, \(\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}\) and \(\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}\).
  • Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)

    • try perturbation in favor of "switch", then in favor of "stay"

    • update in direction of policy which receives more cumulative reward

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(\theta^{(1)}\)

\(\theta^{(2)}\)

Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • Recall Montecarlo gradient and that \(J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]\)

Policy Gradient (REINFORCE)

Algorithm: REINFORCE

  • Given \(\alpha\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy \(\pi_{\theta_i}\) and observe trajectory \(\tau\)
    • Estimate \(g_i = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy: \(\pi_\theta(0)=\)stay, \(\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}\) and \(\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}\).
  • Compute the score PollEV
    • \(\nabla_\theta \log \pi_\theta(a|s)=\begin{bmatrix} 1/\theta^{(1)} \cdot \mathbb 1\{a=\mathsf{stay}\} \\ 1/\theta^{(2)}  \cdot 1\{a=\mathsf{switch}\}\end{bmatrix}\)
  • Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)

    • rollout, then sum score over trajectory $$g_0 \propto \begin{bmatrix}  \text{\# times } s=1,a=\mathsf{stay} \\ \text{\# times } s=1,a=\mathsf{switch} \end{bmatrix} $$

  • Direction of update depends on empirical action frequency, size depends on \(R(\tau)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

Claim: The gradient estimate \(g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)\)  is unbiased

Policy Gradient (REINFORCE)

  • Recall that \(J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]\)
  • by Montecarlo, \(\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)] \)
  • Since \(\tau \sim \mathbb P^{\pi_{\theta_i}}_{\mu_0} \) suffices to show that $$\textstyle \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau) = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]  $$
  • Key ideas:
    • \(\mathbb P^{\pi_{\theta}}_{\mu_0}\) factors into terms depending on \(P\) and \(\pi_\theta\)
    • the logarithm of a product is the sum of the logarithm
    • only terms depending on \(\theta\) affect the gradient

We have that \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • Using the Montecarlo derivation from last lecture $$\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)] $$
  • \(\log \mathbb P^{\pi_{\theta_i}}_{\mu_0}(\tau)\)
    • \( =\log \left(\mu_0(s_0) \pi_{\theta_i} (a_0|s_0) P(s_1|a_0,s_0) \pi_{\theta_i} (a_1|s_1) P(s_2|a_1,s_1)...\right) \)
    • \( =\log \mu_0(s_0) + \sum_{t=0}^\infty \left(\log \pi_{\theta_i} (a_t|s_t))+ \log P(s_{t+1}|a_t,s_t)\right)  \)
  • \(\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}\)
    • \( =\cancel{\nabla_\theta \log \mu_0(s_0)} + \sum_{t=0}^\infty \left(\nabla_\theta \log \pi_{\theta} (a_t|s_t))_{\theta=\theta_i}+ \cancel{\nabla_\theta \log P(s_{t+1}|a_t,s_t)}\right)  \)
    • Thus \(\nabla_\theta \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)\) ends up having no dependence on unknown \(P\)!
  • \(\mathbb E[g_i| \theta_i] = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau) ]\)
    • \(= \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}R(\tau) ]\) by above
    • \(= \nabla J(\theta_i)\)

Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

  • So far, methods depend on entire trajectory rollout
  • This leads to high variance estimates
  • Incorporating (Q) Value function can reduce variance
  • In practice, can only use estimates of Q/Value
    • results in bias (Lecture 15)
    • we ignore this issue today

Motivation: PG with Value

...

...

...

Algorithm: Idealized Actor Critic

  • Given \(\alpha\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Roll "in" policy \(\pi_{\theta_i}\) to sample \(s,a\sim d_{\mu_0}^{\pi_{\theta_i}}\)
    • Estimate \(g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • Product rule on \(J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)] \) to derive recursion $$\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] =  \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$$
  • Starting with a different decomposition of cumulative reward: $$\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$$
  • \(\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)] \)
    • \(= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0)  Q^{\pi_\theta}(s_0, a_0) \)
    • \(=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) +  \pi_\theta(a_0|s_0)  \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right)  \)
  • Considering each term:
    • \(\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A}  \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0) \)
      • \( = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] \)
    • \(\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0)  \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)]  \)
      • \(= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]\)
      • \(=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]\)
    • Recursion \(\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] =  \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]\)
    • Iterating this recursion leads to $$\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] $$ $$= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)] $$

The Advantage function is \(A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)\)

  • Claim: Same as previous slide in expectation over actions: $$ \mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]A^{\pi_{\theta}}(s,a)] = \mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]Q^{\pi_{\theta}}(s,a)]$$
  • Suffices to show that \(\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]V^{\pi_{\theta}}(s)]=0\)

Policy Gradient with Advantage

Algorithm: Idealized Actor Critic with Advantage

  • Same as previous slide, except estimation step
    • Estimate \(g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}A^{\pi_{\theta_i}}(s,a)\)
  • Claim: \(\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = 0\)
  • General principle: subtracting any action-independent "baseline" does not affect expected value
  • Proof of claim:
    • \(\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]\)
      • \(=\sum_{a\in\mathcal A} \pi_\theta(a|s)\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]\)
      • \(=\sum_{a\in\mathcal A} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot b(s)\)
      • \(=\nabla_\theta  \sum_{a\in\mathcal A}\pi_\theta(a|s) \cdot b(s)\)
      • \(=\nabla_\theta b(s) = 0\)

PG with "Baselines"

Recap

  • PSet due Wed Fri
  • PA due Fri

 

  • PG with rollouts: random search and REINFORCE
  • PG with value: Actor-Critic and baselines

 

  • Next lecture: Trust Regions