CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Trust Regions and Natural Policy Gradient

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • PA 3 and PSet 4 due Friday
    • 5789 Paper Reviews due weekly on Mondays
  • Office hours cancelled over break

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$

  • Trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\) and distribution \(\mathbb P^{\pi}_{\mu_0}\)
  • Cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)
  • For parametric (e.g. deep) policy \(\pi_\theta\), the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Recap: Policy Optimization Setting

\(\mathcal S\)

\(\mathcal A\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

  • We can "rollout" policy \(\pi_\theta\) to observe:

    • a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\) or \(s,a\sim d^{\pi_\theta}_{\mu_0}\)

    • the resulting cumulative reward \(R(\tau)\)

  • Note: we do not need to know \(P\) or \(r\)!

Recap: Policy Optimization Setting

\(\mathcal S\)

\(\mathcal A\)

Meta-Algorithm: Policy Optimization

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy
    • Estimate \(\nabla J(\theta_i)\) as \(g_i\) using rollouts
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Recap: Policy Optimization

Today we will derive an alternative update: \(\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • What is optimal \(\theta_\star\)?
    • PollEV

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(\theta\)

\(+\infty\)

\(-\infty\)

stay

switch

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

Motivation: Trust Regions

  • Recall: motivation of gradient ascent as first-order approximate maximization
    • \(\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)\)
  • The maximum occurs when \(\theta-\theta_0\) is parallel to \(\nabla J(\theta_0)\)
    • \(\theta - \theta_0 = \alpha \nabla J(\theta_0) \)
  • Q: Why do we normally use a small step size \(\alpha\)?
    • A: The linear approximation is only locally valid, so a small \(\alpha\) ensures that \(\theta\) is close to \(\theta_0\)

Motivation: Trust Regions

  • Q: Why do we normally use a small step size \(\alpha\)?
    • A: The linear approximation is only locally valid, so a small \(\alpha\) ensures that \(\theta\) is close to \(\theta_0\)
Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

\(\theta_1\)

\(\theta_2\)

2D quadratic function

level sets of quadratic

Trust Regions

  • A trust region approach makes the intuition about step size more precise: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
  • The trust region is described by a bounded "distance" from \(\theta_0\)
    • General notion of distance, "divergence"
  • Another motivation relevant in RL: we might estimate \(J(\theta)\) using data collected with \(\theta_0\) (i.e. with policy \(\pi_{\theta_0}\))
    • The estimate may only be good close to \(\theta_0\)
    • e.g. in Conservation PI, recall the incremental update
      • \(\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)\)

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

KL Divergence

  • Motivation: what is a good notion of distance?
  • The KL Divergence measures the "distance" between two distributions
  • Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$

KL Divergence

  • Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
  • Example: if \(P,Q\) are Bernoullis with mean \(p,q\)
    • then \(KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\) (plot)
  • Example: if \(P=\mathcal N(\mu_1, \sigma^2I)\) and \(Q=\mathcal N(\mu_2, \sigma^2I)\)
    • then \(KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2\)
  • Fact: KL is always strictly positive unless \(P=Q\) in which case it is zero.

KL Divergence for Policies

  • We define a measure of "distance" between \(\pi_{\theta_0}\) and \(\pi_\theta\)
    • KL divergence between action distributions
    • averaged over states \(s\) from the discounted steady state distribution of \(\pi_{\theta_0}\)
  • \(d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right] \)
    • \(= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right] \)
    • we will use the shorthand \(s,a\sim d_{\mu_0}^{\pi_{\theta_0}}\)
    • \(= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] \)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Distance between \(\theta_0\) and \(\theta\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(0,\)stay\()\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(0,\)switch\()\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)stay\()\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)switch\()\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}\)
    • \(d_{KL}(\theta_0, \theta)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(+\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Distance between \(\theta_0\) and \(\theta\)
    • 0
    • 0
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)stay\()\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)switch\()\cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} \)
    • \(d_{KL}(\theta_0, \theta)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(+\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Distance between \(\theta_0\) and \(\theta\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1)) \pi_\theta(stay|1)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1) \pi_\theta(switch|1) \cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} \)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1) \left(\log \frac{1+\exp \theta}{1+\exp \theta_0}  + \frac{\exp \theta}{1+\exp \theta} \log \frac{\exp \theta_0}{1+\exp \theta_0}\right)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(+\)

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

Natural Policy Gradient

  • We will derive the update $$ \theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i $$
  • This is called natural policy gradient
  • Intuition: update direction \(g_i\) is "preconditioned" by a matrix \(F_i\) and adapts to geometry



     
  • We derive this update as approximating $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$$

first order approx (gradient \(g_0\))             second order approx

level sets of quadratic

  • \(g_i\)
  • \(F_i^{-1}g_i\)

Second order Divergence Approx

  • Second order approximation of $$\ell(\theta) = d_{KL}(\theta_0,\theta) = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right] $$
  • Given by $$\ell(\theta_0) + \nabla \ell(\theta_0)^\top (\theta-\theta_0) + (\theta-\theta_0)^\top \nabla^2 \ell(\theta_0) (\theta-\theta_0)$$
  • Claim: Zero-th and first order terms are zero \(\ell(\theta_0) = 0\), \(\nabla \ell(\theta_0)=0\), and second order (Hessian) is $$\nabla^2\ell(\theta_0) = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[ \log \pi_\theta(a|s)]_{\theta=\theta_0} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_0}^\top ]$$
  • The Hessian is known as the Fischer information matrix

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Fischer information matrix scalar
    • \(\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\  \frac{\exp \theta}{(1+\exp \theta)^2}\cdot  \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay}  \\  \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot  \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}\)
  • \(F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]\)
    • \(=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)\)
    • \(=d_{\mu_0}^{\pi_{\theta_0}}(1) \left(  \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

Constrained Optimization

  • Our approximation to  \(\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta\) is $$\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$$
  • Claim: The maximum has the closed form expression $$\theta_\star =\theta_0+\alpha F_0^{-1}g_0$$ where \(\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}\)
  • Proof outline:
    • Start with solving \(\max c^\top v \) s.t. \(\|v\|_2^2\leq \delta\)
    • Consider change of variables $$v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$$

Natural Policy Gradient

Algorithm: Natural PG

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy
    • Estimate \(\nabla J(\theta_i)\) with \(g_i\) using rollouts (REINFORCE, value, etc)
    • Estimate the Fischer Information Matrix $$F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_0}_{\mu_0}$$
    • Update \(\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy: \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • NPG: \(\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0\); GA: \(\theta_1=\theta_0 + \alpha g_0 \)
  • \(F_0 \propto  \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0\) as \(\theta_0\to\pm\infty\)
  • NPG takes bigger and bigger steps as \(\theta\) becomes more extreme

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(\theta\)

\(+\infty\)

\(-\infty\)

stay

switch

Recap

  • PSet due Fri
  • PA due Fri

 

  • Trust Regions
  • KL Divergence
  • Natural Policy Gradient

 

  • Next lecture: Unit 3, Exploration