Sp23 CS 4/5789: Lecture 18

CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Trust Regions and Natural Policy Gradient

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- PA 3 and PSet 4 due Friday
- 5789 Paper Reviews due weekly on Mondays
Office hours cancelled over break

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
Trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$ and distribution $\mathbb P^{\pi}_{\mu_0}$
Cumulative reward $R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$
For parametric (e.g. deep) policy $\pi_\theta$, the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Recap: Policy Optimization Setting

$\mathcal S$

$\mathcal A$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
We can "rollout" policy $\pi_\theta$ to observe:
- a sample $\tau$ from $\mathbb P^{\pi_\theta}_{\mu_0}$ or $s,a\sim d^{\pi_\theta}_{\mu_0}$
- the resulting cumulative reward $R(\tau)$
Note: we do not need to know $P$ or $r$!

Recap: Policy Optimization Setting

$\mathcal S$

$\mathcal A$

Meta-Algorithm: Policy Optimization

Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy
- Estimate $\nabla J(\theta_i)$ as $g_i$ using rollouts
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Recap: Policy Optimization

Today we will derive an alternative update: $\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
What is optimal $\theta_\star$?
- PollEV

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\theta$

$+\infty$

$-\infty$

stay

switch

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

Motivation: Trust Regions

Recall: motivation of gradient ascent as first-order approximate maximization
- $\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)$
The maximum occurs when $\theta-\theta_0$ is parallel to $\nabla J(\theta_0)$
- $\theta - \theta_0 = \alpha \nabla J(\theta_0) $
Q: Why do we normally use a small step size $\alpha$?
- A: The linear approximation is only locally valid, so a small $\alpha$ ensures that $\theta$ is close to $\theta_0$

Motivation: Trust Regions

Q: Why do we normally use a small step size $\alpha$?
- A: The linear approximation is only locally valid, so a small $\alpha$ ensures that $\theta$ is close to $\theta_0$

$\theta_1$

$\theta_2$

$\theta_1$

$\theta_2$

2D quadratic function

level sets of quadratic

Trust Regions

A trust region approach makes the intuition about step size more precise: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
The trust region is described by a bounded "distance" from $\theta_0$
- General notion of distance, "divergence"
Another motivation relevant in RL: we might estimate $J(\theta)$ using data collected with $\theta_0$ (i.e. with policy $\pi_{\theta_0}$)
- The estimate may only be good close to $\theta_0$
- e.g. in Conservation PI, recall the incremental update
  - $\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)$

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

KL Divergence

Motivation: what is a good notion of distance?
The KL Divergence measures the "distance" between two distributions
Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$

KL Divergence

Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
Example: if $P,Q$ are Bernoullis with mean $p,q$
- then $KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$ (plot)
Example: if $P=\mathcal N(\mu_1, \sigma^2I)$ and $Q=\mathcal N(\mu_2, \sigma^2I)$
- then $KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2$
Fact: KL is always strictly positive unless $P=Q$ in which case it is zero.

KL Divergence for Policies

We define a measure of "distance" between $\pi_{\theta_0}$ and $\pi_\theta$
- KL divergence between action distributions
- averaged over states $s$ from the discounted steady state distribution of $\pi_{\theta_0}$
$d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right] $
- $= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right] $
- we will use the shorthand $s,a\sim d_{\mu_0}^{\pi_{\theta_0}}$
- $= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] $

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Distance between $\theta_0$ and $\theta$
- $d_{\mu_0}^{\pi_{\theta_0}}(0,$stay$)\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(0,$switch$)\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$stay$)\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$switch$)\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}$
- $d_{KL}(\theta_0, \theta)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Distance between $\theta_0$ and $\theta$
- 0
- 0
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$stay$)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$switch$)\cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} $
- $d_{KL}(\theta_0, \theta)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Distance between $\theta_0$ and $\theta$
- $d_{\mu_0}^{\pi_{\theta_0}}(1)) \pi_\theta(stay|1)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1) \pi_\theta(switch|1) \cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} $
- $d_{\mu_0}^{\pi_{\theta_0}}(1) \left(\log \frac{1+\exp \theta}{1+\exp \theta_0} + \frac{\exp \theta}{1+\exp \theta} \log \frac{\exp \theta_0}{1+\exp \theta_0}\right)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

Natural Policy Gradient

We will derive the update $$ \theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i $$
This is called natural policy gradient
Intuition: update direction $g_i$ is "preconditioned" by a matrix $F_i$ and adapts to geometry
We derive this update as approximating $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$$

first order approx (gradient $g_0$) second order approx

level sets of quadratic

$g_i$
$F_i^{-1}g_i$

Second order Divergence Approx

Second order approximation of $$\ell(\theta) = d_{KL}(\theta_0,\theta) = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right] $$
Given by $$\ell(\theta_0) + \nabla \ell(\theta_0)^\top (\theta-\theta_0) + (\theta-\theta_0)^\top \nabla^2 \ell(\theta_0) (\theta-\theta_0)$$
Claim: Zero-th and first order terms are zero $\ell(\theta_0) = 0$, $\nabla \ell(\theta_0)=0$, and second order (Hessian) is $$\nabla^2\ell(\theta_0) = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[ \log \pi_\theta(a|s)]_{\theta=\theta_0} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_0}^\top ]$$
The Hessian is known as the Fischer information matrix

For proof of claim, refer to

https://sdean.website/cs4789sp22/lec15-notes.pdf

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Fischer information ~~matrix~~ scalar
- $\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\ \frac{\exp \theta}{(1+\exp \theta)^2}\cdot \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay} \\ \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}$
$F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]$
- $=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)$
- $=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

Constrained Optimization

Our approximation to $\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta$ is $$\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$$
Claim: The maximum has the closed form expression $$\theta_\star =\theta_0+\alpha F_0^{-1}g_0$$ where $\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}$
Proof outline:
- Start with solving $\max c^\top v $ s.t. $\|v\|_2^2\leq \delta$
- Consider change of variables $$v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$$

Natural Policy Gradient

Algorithm: Natural PG

Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy
- Estimate $\nabla J(\theta_i)$ with $g_i$ using rollouts (REINFORCE, value, etc)
- Estimate the Fischer Information Matrix $$F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_0}_{\mu_0}$$
- Update $\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy: $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
NPG: $\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0$; GA: $\theta_1=\theta_0 + \alpha g_0 $
$F_0 \propto \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0$ as $\theta_0\to\pm\infty$
NPG takes bigger and bigger steps as $\theta$ becomes more extreme

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\theta$

$+\infty$

$-\infty$

stay

switch

Recap

PSet due Fri
PA due Fri

Trust Regions
KL Divergence
Natural Policy Gradient

Next lecture: Unit 3, Exploration