CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Policy Optimization

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • Friday: PSet 5 due, PSet 6 released
    • PA 3 due next Friday
    • 5789 Paper assignments
  • Second prelim on Wednesday 4/10 in class

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

Recap: SGA & DFO

  • Rather than exact gradients, SGA uses unbiased estimates of the gradient \(g_i\), i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$
  • It is better when \(g_i\) is lower variance
  • Derivative Free Optimization methods construct estimates \(g_i\) with only zero-th order access to \(J(\theta)\)

Algorithm: SGA

  • Initialize \(\theta_0\); For \(i=0,1,...\):
    • \(\theta_{i+1} = \theta_i + \alpha g_i\)

Algorithm: One Point Random Search

  • Initialize \(\theta_0\). For \(i=0,1,...\):
    • sample \(v\sim \mathcal N(0,I)\)
    • \(\theta_{i+1} = \theta_i + \frac{\alpha}{\delta}J(\theta_i+\delta v) v\)

\(\nabla J(\theta)\)\( \approx g= \frac{1}{2\delta} J(\textcolor{cyan}{\theta}+{\delta v})\textcolor{LimeGreen}{v}\)

Parabola

\(J(\theta) = -\theta^2 - 1\)

\(\theta\)

Random Search Example

  • start with \(\theta\) positive
  • suppose \(v\) is positive
  • then \(J(\theta+\delta v)<0\)
  • therefore \(g\) is negative
  • (if we sample \(v\) negative, wrong direction! but smaller magnitude)

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$

  • Recall notation for a trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\) and probability of a trajectory \(\mathbb P^{\pi}_{\mu_0}\)
  • Define cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)
  • For parametric (e.g. deep) policy \(\pi_\theta\), the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Policy Optimization Setting

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

  • Assume that we can "rollout" policy \(\pi_\theta\) to observe:

    • a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)

    • the resulting cumulative reward \(R(\tau)\)

  • Note: we do not need to know \(P\)! (Also easy to extend to the case that we don't know \(r\)!)

  • We consider infinite length trajectories \(\tau\) without worrying about computational feasibility

Policy Optimization Setting

Policy Opt. with Random Search

  • Can be successful in continuous action/control settings
  • Improve performance by sampling more \(v\), which is easily parallelizable in simulation

Random Search Policy Optimization

  • Given \(\alpha, \delta\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Sample \(v\sim \mathcal N(0, I)\)
    • Rollout policy \(\pi_{\theta_i+ \delta v}\) and observe trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\)
    • Estimate \(g_i = \frac{1}{\delta} R(\tau) v\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy: \(\pi_\theta(0)=\)stay, \(\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}\) and \(\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}\).
  • Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)

    • sample random perturbation, e.g. $$\theta^{(1)}_0+\delta,~~~=\theta^{(2)}_0-\delta$$

    • update \(\theta_1\) based on magnitude of reward

    • repeat

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

DFO via Importance Weighting

  • Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
    • E.g. for RL, \(J(\theta)=V^{\pi_\theta}(s_0) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]\)
  • Fact: The gradient \(\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]\)
    • Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty $$
    • Then \(\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}] \)
      • general principle: reweight by ratio of probability distributions (PSet)
    • The gradient \(\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}] \)
    • Set \(\rho = P_\theta\) and notice that \(\nabla_\theta \left[\log P_\theta(z) \right]  = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}\)
  • Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
    • E.g. in reinforcement learning \(V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]\)
  • Fact: The gradient \(\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]\)
  • SGA-inspired algorithm

\(\underbrace{\qquad\qquad}_{\text{score}}\)

Algorithm: Monte-Carlo DFO

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • sample \(z\sim P_{\theta_i}\)
    • \(\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)\)

DFO via Importance Weighting

Montecarlo Example

\(\nabla J(\theta)\)\( \approx g=\nabla_\theta \log(P_\theta(z)) h(z) \)

Parabola

\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)

\(z\)

image/svg+xml

\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)

\(h(z) = -z^2\)

\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)

(\(=-\theta^2 - 1\))

\(P_\theta = \mathcal N(\theta, 1)\)

  • start with \(\theta\) positive
  • suppose \(z>\theta\)
  • then score is positive
  • therefore \(g\) is negative (since \(h(z)<0\))
  • (wrong direction if  sample \(z<\theta\))

\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)

Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • Recall Montecarlo gradient and that \(J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]\)

Policy Gradient (REINFORCE)

Algorithm: REINFORCE

  • Given \(\alpha\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy \(\pi_{\theta_i}\) and observe trajectory \(\tau\)
    • Estimate \(g_i = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy: \(\pi_\theta(0)=\)stay, \(\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}\) and \(\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}\).
  • Compute the score PollEV
    • \(\nabla_\theta \log \pi_\theta(a|s)=\begin{bmatrix} 1/\theta^{(1)} \cdot \mathbb 1\{a=\mathsf{stay}\} \\ 1/\theta^{(2)}  \cdot 1\{a=\mathsf{switch}\}\end{bmatrix}\)
  • Initialize \(\theta^{(1)}_0=\theta^{(2)}_0=1/2\)

    • rollout, then sum score over trajectory $$g_0 \propto \begin{bmatrix}  \text{\# times } s=1,a=\mathsf{stay} \\ \text{\# times } s=1,a=\mathsf{switch} \end{bmatrix} $$

  • Direction of update depends on empirical action frequency, size depends on \(R(\tau)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

Claim: The gradient estimate \(g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)\)  is unbiased

Policy Gradient (REINFORCE)

  • Recall that \(J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]\)
  • by Montecarlo, \(\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)] \)
  • Since \(\tau \sim \mathbb P^{\pi_{\theta_i}}_{\mu_0} \) suffices to show that $$\textstyle \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau) = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]  $$
  • Key ideas:
    • \(\mathbb P^{\pi_{\theta}}_{\mu_0}\) factors into terms depending on \(P\) and \(\pi_\theta\)
    • the logarithm of a product is the sum of the logarithm
    • only terms depending on \(\theta\) affect the gradient

We have that \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • Using the Montecarlo derivation $$\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)] $$
  • \(\log \mathbb P^{\pi_{\theta_i}}_{\mu_0}(\tau)\)
    • \( =\log \left(\mu_0(s_0) \pi_{\theta_i} (a_0|s_0) P(s_1|a_0,s_0) \pi_{\theta_i} (a_1|s_1) P(s_2|a_1,s_1)...\right) \)
    • \( =\log \mu_0(s_0) + \sum_{t=0}^\infty \left(\log \pi_{\theta_i} (a_t|s_t))+ \log P(s_{t+1}|a_t,s_t)\right)  \)
  • \(\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}\)
    • \( =\cancel{\nabla_\theta \log \mu_0(s_0)} + \sum_{t=0}^\infty \left(\nabla_\theta \log \pi_{\theta} (a_t|s_t))_{\theta=\theta_i}+ \cancel{\nabla_\theta \log P(s_{t+1}|a_t,s_t)}\right)  \)
    • Thus \(\nabla_\theta \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)\) ends up having no dependence on unknown \(P\)!
  • \(\mathbb E[g_i| \theta_i] = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau) ]\)
    • \(= \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}R(\tau) ]\) by above
    • \(= \nabla J(\theta_i)\)

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

  • So far, methods depend on entire trajectory rollout
  • This leads to high variance estimates
  • Incorporating (Q) Value function can reduce variance
  • In practice, use estimates of Q/Value (last week)

Motivation: PG with Value

...

...

...

Sampling from \(d_\gamma^{\mu_0,\pi}\)

  • Recall the discounted "steady-state" distribution $$ d^{\mu_0,\pi}_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d^{\mu_0,\pi}_t $$
  • On PSet, you showed that \(V^\pi(s)=\mathbb E_{s'\sim d^{e_{s},\pi}_\gamma}[r(s',\pi(s'))]\)
  • Can we sample from this distribution?
    • Sample \(s_0\sim\mu_0\) and \(h\sim\mathrm{Geom}(1-\gamma)\)
    • Roll out \(\pi\) for \(h\) steps
    • Claim: then \(s_{h}\sim d_\gamma^{\mu_0,\pi}\)
  • Shorthand: \(s,a\sim d_\gamma^{\mu_0,\pi}\) if \(s\sim d_\gamma^{\mu_0,\pi}\)
    and \(a\sim \pi(s)\)

Rollout:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Algorithm: Idealized Actor Critic

  • Given \(\alpha\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Roll out \(\pi_{\theta_i}\) to sample \(s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}\)
    • Estimate \(g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • I.e. \(\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}}[ \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)]=\nabla J(\theta_i)\)
  • Why? Product rule on \(J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)] \)
  • Starting with a different decomposition of cumulative reward: $$\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$$
  • \(\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)] \)
    • \(= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0)  Q^{\pi_\theta}(s_0, a_0) \)
    • \(=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) +  \pi_\theta(a_0|s_0)  \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right)  \)
  • Considering each term:
    • \(\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A}  \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0) \)
      • \( = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] \)
    • \(\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0)  \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)]  \)
      • \(= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]\)
      • \(=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]\)
    • Recursion \(\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] =  \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]\)
    • Iterating this recursion leads to $$\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] $$ $$= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)] $$

The Advantage function is \(A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)\)

  • Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
  • Follows because we can show that \(\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]V^{\pi_{\theta}}(s)]=0\)

Policy Gradient with Advantage

Algorithm: Idealized Actor Critic with Advantage

  • Same as previous slide, except estimation step
    • Estimate \(g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}A^{\pi_{\theta_i}}(s,a)\)
  • Claim: for any \(b(s)\), \(\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = 0\)
  • General principle: subtracting any action-independent "baseline" does not affect expected value
  • Proof of claim:
    • \(\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]\)
      • \(=\sum_{a\in\mathcal A} \pi_\theta(a|s)\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]\)
      • \(=\sum_{a\in\mathcal A} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot b(s)\)
      • \(=\nabla_\theta  \sum_{a\in\mathcal A}\pi_\theta(a|s) \cdot b(s)\)
      • \(=\nabla_\theta b(s) = 0\)

PG with "Baselines"

  • Today we covered multiple ways (Random Search, REINFORCE, Actor-Critic) to construct estimate \(g_i\) such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

Policy Optimization Summary

Meta-Algorithm: Policy Optimization

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy
    • Estimate \(\nabla J(\theta_i)\) as \(g_i\) using data
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Recap

  • PSet due Fri
  • PA due next Fri

 

  • PG with rollouts: random search and REINFORCE
  • PG with value: Actor-Critic and baselines

 

  • Next lecture: Trust Regions