Sp23 CS 4/5789: Lecture 18

Reminders

Homework
- PA 3 and PSet 4 due Friday
- 5789 Paper Reviews due weekly on Mondays
Office hours cancelled over break

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ]$
Trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$ and distribution $\mathbb P^{\pi}_{\mu_0}$
Cumulative reward $R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$
For parametric (e.g. deep) policy $\pi_\theta$ , the objective is: $J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$

Recap: Policy Optimization Setting

$\mathcal S$

$\mathcal A$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$
We can "rollout" policy $\pi_\theta$ to observe:
- a sample $\tau$ from $\mathbb P^{\pi_\theta}_{\mu_0}$ or $s,a\sim d^{\pi_\theta}_{\mu_0}$
- the resulting cumulative reward $R(\tau)$
Note: we do not need to know $P$ or $r$ !

Recap: Policy Optimization Setting

$\mathcal S$

$\mathcal A$

Meta-Algorithm: Policy Optimization

Initialize $\theta_0$
For $i=0,1,...$ :
- Rollout policy
- Estimate $\nabla J(\theta_i)$ as $g_i$ using rollouts
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Recap: Policy Optimization

Today we will derive an alternative update: $\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$
What is optimal $\theta_\star$ ?
- PollEV

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\theta$

$+\infty$

$-\infty$

stay

switch

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

Motivation: Trust Regions

Recall: motivation of gradient ascent as first-order approximate maximization
- $\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)$
The maximum occurs when $\theta-\theta_0$ is parallel to $\nabla J(\theta_0)$
- $\theta - \theta_0 = \alpha \nabla J(\theta_0)$
Q: Why do we normally use a small step size $\alpha$ ?
- A: The linear approximation is only locally valid, so a small $\alpha$ ensures that $\theta$ is close to $\theta_0$

Motivation: Trust Regions

Q: Why do we normally use a small step size $\alpha$ ?
- A: The linear approximation is only locally valid, so a small $\alpha$ ensures that $\theta$ is close to $\theta_0$

$\theta_1$

$\theta_2$

$\theta_1$

$\theta_2$

2D quadratic function

level sets of quadratic

Trust Regions

A trust region approach makes the intuition about step size more precise: $\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$
The trust region is described by a bounded "distance" from $\theta_0$
- General notion of distance, "divergence"
Another motivation relevant in RL: we might estimate $J(\theta)$ using data collected with $\theta_0$ (i.e. with policy $\pi_{\theta_0}$ )
- The estimate may only be good close to $\theta_0$
- e.g. in Conservation PI, recall the incremental update
  - $\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)$

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

KL Divergence

Motivation: what is a good notion of distance?
The KL Divergence measures the "distance" between two distributions
Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$ , the KL Divergence is $KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$

KL Divergence

Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$ , the KL Divergence is $KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$
Example: if $P,Q$ are Bernoullis with mean $p,q$
- then $KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$ (plot)
Example: if $P=\mathcal N(\mu_1, \sigma^2I)$ and $Q=\mathcal N(\mu_2, \sigma^2I)$
- then $KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2$
Fact: KL is always strictly positive unless $P=Q$ in which case it is zero.

KL Divergence for Policies

We define a measure of "distance" between $\pi_{\theta_0}$ and $\pi_\theta$
- KL divergence between action distributions
- averaged over states $s$ from the discounted steady state distribution of $\pi_{\theta_0}$
$d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right]$
- $= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right]$
- we will use the shorthand $s,a\sim d_{\mu_0}^{\pi_{\theta_0}}$
- $= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}]$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$
Distance between $\theta_0$ and $\theta$
- $d_{\mu_0}^{\pi_{\theta_0}}(0,$ stay $)\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(0,$ switch $)\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$ stay $)\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$ switch $)\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}$
- $d_{KL}(\theta_0, \theta)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

Natural Policy Gradient

We will derive the update $\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$
This is called natural policy gradient
Intuition: update direction $g_i$ is "preconditioned" by a matrix $F_i$ and adapts to geometry
We derive this update as approximating $\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$

first order approx (gradient $g_0$ ) second order approx

level sets of quadratic

$g_i$
$F_i^{-1}g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$
Fischer information ~~matrix~~ scalar
- $\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\ \frac{\exp \theta}{(1+\exp \theta)^2}\cdot \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay} \\ \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}$
$F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]$
- $=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)$
- $=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

Constrained Optimization

Our approximation to $\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta$ is $\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$
Claim: The maximum has the closed form expression $\theta_\star =\theta_0+\alpha F_0^{-1}g_0$ where $\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}$
Proof outline:
- Start with solving $\max c^\top v$ s.t. $\|v\|_2^2\leq \delta$
- Consider change of variables $v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$

Natural Policy Gradient

Algorithm: Natural PG

Initialize $\theta_0$
For $i=0,1,...$ :
- Rollout policy
- Estimate $\nabla J(\theta_i)$ with $g_i$ using rollouts (REINFORCE, value, etc)
- Estimate the Fischer Information Matrix $F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_0}_{\mu_0}$
- Update $\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy: $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$
NPG: $\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0$ ; GA: $\theta_1=\theta_0 + \alpha g_0$
$F_0 \propto \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0$ as $\theta_0\to\pm\infty$
NPG takes bigger and bigger steps as $\theta$ becomes more extreme

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\theta$

$+\infty$

$-\infty$

stay

switch

Recap

PSet due Fri
PA due Fri

Trust Regions
KL Divergence
Natural Policy Gradient

Next lecture: Unit 3, Exploration

CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Trust Regions and Natural Policy Gradient

Reminders

Agenda

Recap: Policy Optimization Setting

Recap: Policy Optimization Setting

Recap: Policy Optimization

Example

Agenda

Motivation: Trust Regions

Motivation: Trust Regions

Trust Regions

Agenda

KL Divergence

KL Divergence

KL Divergence for Policies

Example

Example

Example

Agenda

Natural Policy Gradient

Second order Divergence Approx

Example

Constrained Optimization

Natural Policy Gradient

Example

Recap

Sp23 CS 4/5789: Lecture 18

Sp23 CS 4/5789: Lecture 18

Sarah Dean PRO

CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Trust Regions and Natural Policy Gradient

Sp23 CS 4/5789: Lecture 18

More from Sarah Dean