CS 4/5789: Introduction to Reinforcement Learning
Lecture 18: Trust Regions and Natural Policy Gradient
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- PA 3 and PSet 4 due Friday
- 5789 Paper Reviews due weekly on Mondays
- Office hours cancelled over break
Agenda
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
-
Goal: achieve high expected cumulative reward:
$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
- Trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\) and distribution \(\mathbb P^{\pi}_{\mu_0}\)
- Cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)
- For parametric (e.g. deep) policy \(\pi_\theta\), the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$
Recap: Policy Optimization Setting

\(\mathcal S\)
\(\mathcal A\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
-
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
-
We can "rollout" policy \(\pi_\theta\) to observe:
-
a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\) or \(s,a\sim d^{\pi_\theta}_{\mu_0}\)
-
the resulting cumulative reward \(R(\tau)\)
-
-
Note: we do not need to know \(P\) or \(r\)!
Recap: Policy Optimization Setting

\(\mathcal S\)
\(\mathcal A\)
Meta-Algorithm: Policy Optimization
- Initialize \(\theta_0\)
- For \(i=0,1,...\):
- Rollout policy
- Estimate \(\nabla J(\theta_i)\) as \(g_i\) using rollouts
- Update \(\theta_{i+1} = \theta_i + \alpha g_i\)
Recap: Policy Optimization
Today we will derive an alternative update: \(\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
- Parametrized policy:
- \(\pi_\theta(0)=\) stay
- \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
- What is optimal \(\theta_\star\)?
- PollEV
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta\)
\(+\infty\)
\(-\infty\)
stay
switch
Agenda
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
Motivation: Trust Regions
- Recall: motivation of gradient ascent as first-order approximate maximization
- \(\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)\)
- The maximum occurs when \(\theta-\theta_0\) is parallel to \(\nabla J(\theta_0)\)
- \(\theta - \theta_0 = \alpha \nabla J(\theta_0) \)
- Q: Why do we normally use a small step size \(\alpha\)?
- A: The linear approximation is only locally valid, so a small \(\alpha\) ensures that \(\theta\) is close to \(\theta_0\)
Motivation: Trust Regions
- Q: Why do we normally use a small step size \(\alpha\)?
- A: The linear approximation is only locally valid, so a small \(\alpha\) ensures that \(\theta\) is close to \(\theta_0\)
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
Trust Regions
- A trust region approach makes the intuition about step size more precise: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
- The trust region is described by a bounded "distance" from \(\theta_0\)
- General notion of distance, "divergence"
- Another motivation relevant in RL: we might estimate \(J(\theta)\) using data collected with \(\theta_0\) (i.e. with policy \(\pi_{\theta_0}\))
- The estimate may only be good close to \(\theta_0\)
- e.g. in Conservation PI, recall the incremental update
- \(\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)\)
Agenda
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
KL Divergence
- Motivation: what is a good notion of distance?
- The KL Divergence measures the "distance" between two distributions
- Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
KL Divergence
- Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
- Example: if \(P,Q\) are Bernoullis with mean \(p,q\)
- then \(KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\) (plot)
- Example: if \(P=\mathcal N(\mu_1, \sigma^2I)\) and \(Q=\mathcal N(\mu_2, \sigma^2I)\)
- then \(KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2\)
- Fact: KL is always strictly positive unless \(P=Q\) in which case it is zero.
KL Divergence for Policies
- We define a measure of "distance" between \(\pi_{\theta_0}\) and \(\pi_\theta\)
- KL divergence between action distributions
- averaged over states \(s\) from the discounted steady state distribution of \(\pi_{\theta_0}\)
- \(d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right] \)
- \(= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right] \)
- we will use the shorthand \(s,a\sim d_{\mu_0}^{\pi_{\theta_0}}\)
- \(= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] \)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
- Parametrized policy:
- \(\pi_\theta(0)=\) stay
- \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
- Distance between \(\theta_0\) and \(\theta\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(0,\)stay\()\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(0,\)switch\()\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)stay\()\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)switch\()\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}\)
- \(d_{KL}(\theta_0, \theta)\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
- Parametrized policy:
- \(\pi_\theta(0)=\) stay
- \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
- Distance between \(\theta_0\) and \(\theta\)
- 0
- 0
- \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)stay\()\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)switch\()\cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} \)
- \(d_{KL}(\theta_0, \theta)\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
- Parametrized policy:
- \(\pi_\theta(0)=\) stay
- \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
- Distance between \(\theta_0\) and \(\theta\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(1)) \pi_\theta(stay|1)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}\)
- \(d_{\mu_0}^{\pi_{\theta_0}}(1) \pi_\theta(switch|1) \cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} \)
- \(d_{\mu_0}^{\pi_{\theta_0}}(1) \left(\log \frac{1+\exp \theta}{1+\exp \theta_0} + \frac{\exp \theta}{1+\exp \theta} \log \frac{\exp \theta_0}{1+\exp \theta_0}\right)\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(+\)
Agenda
1. Recap
2. Trust Regions
3. KL Divergence
4. Natural PG
Natural Policy Gradient
- We will derive the update $$ \theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i $$
- This is called natural policy gradient
- Intuition: update direction \(g_i\) is "preconditioned" by a matrix \(F_i\) and adapts to geometry
- We derive this update as approximating $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$$
first order approx (gradient \(g_0\)) second order approx
level sets of quadratic
- \(g_i\)
- \(F_i^{-1}g_i\)
Second order Divergence Approx
- Second order approximation of $$\ell(\theta) = d_{KL}(\theta_0,\theta) = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right] $$
- Given by $$\ell(\theta_0) + \nabla \ell(\theta_0)^\top (\theta-\theta_0) + (\theta-\theta_0)^\top \nabla^2 \ell(\theta_0) (\theta-\theta_0)$$
- Claim: Zero-th and first order terms are zero \(\ell(\theta_0) = 0\), \(\nabla \ell(\theta_0)=0\), and second order (Hessian) is $$\nabla^2\ell(\theta_0) = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[ \log \pi_\theta(a|s)]_{\theta=\theta_0} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_0}^\top ]$$
- The Hessian is known as the Fischer information matrix
For proof of claim, refer to

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
- Parametrized policy:
- \(\pi_\theta(0)=\) stay
- \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
- Fischer information
matrixscalar- \(\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\ \frac{\exp \theta}{(1+\exp \theta)^2}\cdot \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay} \\ \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}\)
- \(F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]\)
- \(=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)\)
- \(=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)\)
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
Constrained Optimization
- Our approximation to \(\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta\) is $$\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$$
- Claim: The maximum has the closed form expression $$\theta_\star =\theta_0+\alpha F_0^{-1}g_0$$ where \(\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}\)
- Proof outline:
- Start with solving \(\max c^\top v \) s.t. \(\|v\|_2^2\leq \delta\)
- Consider change of variables $$v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$$
Natural Policy Gradient
Algorithm: Natural PG
- Initialize \(\theta_0\)
- For \(i=0,1,...\):
- Rollout policy
- Estimate \(\nabla J(\theta_i)\) with \(g_i\) using rollouts (REINFORCE, value, etc)
- Estimate the Fischer Information Matrix $$F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_0}_{\mu_0}$$
- Update \(\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
- Parametrized policy: \(\pi_\theta(0)=\) stay
- \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
- NPG: \(\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0\); GA: \(\theta_1=\theta_0 + \alpha g_0 \)
- \(F_0 \propto \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0\) as \(\theta_0\to\pm\infty\)
- NPG takes bigger and bigger steps as \(\theta\) becomes more extreme
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\theta\)
\(+\infty\)
\(-\infty\)
stay
switch
Recap
- PSet due Fri
- PA due Fri
- Trust Regions
- KL Divergence
- Natural Policy Gradient
- Next lecture: Unit 3, Exploration
Sp23 CS 4/5789: Lecture 18
By Sarah Dean