Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PA 3 and PSet 4 due Friday
• 5789 Paper Reviews due weekly on Mondays
• Office hours cancelled over break

## Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$$

• Goal: achieve high expected cumulative reward:

$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ]$$

• Trajectory $$\tau = (s_0, a_0, s_1, a_1, \dots)$$ and distribution $$\mathbb P^{\pi}_{\mu_0}$$
• Cumulative reward $$R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$$
• For parametric (e.g. deep) policy $$\pi_\theta$$, the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

## Recap: Policy Optimization Setting

$$\mathcal S$$

$$\mathcal A$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$$

• Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

• We can "rollout" policy $$\pi_\theta$$ to observe:

• a sample $$\tau$$ from $$\mathbb P^{\pi_\theta}_{\mu_0}$$ or $$s,a\sim d^{\pi_\theta}_{\mu_0}$$

• the resulting cumulative reward $$R(\tau)$$

• Note: we do not need to know $$P$$ or $$r$$!

## Recap: Policy Optimization Setting

$$\mathcal S$$

$$\mathcal A$$

Meta-Algorithm: Policy Optimization

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• Rollout policy
• Estimate $$\nabla J(\theta_i)$$ as $$g_i$$ using rollouts
• Update $$\theta_{i+1} = \theta_i + \alpha g_i$$

## Recap: Policy Optimization

Today we will derive an alternative update: $$\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy:
• $$\pi_\theta(0)=$$ stay
• $$\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$$
• What is optimal $$\theta_\star$$?
• PollEV

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$\theta$$

$$+\infty$$

$$-\infty$$

stay

switch

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

## Motivation: Trust Regions

• Recall: motivation of gradient ascent as first-order approximate maximization
• $$\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)$$
• The maximum occurs when $$\theta-\theta_0$$ is parallel to $$\nabla J(\theta_0)$$
• $$\theta - \theta_0 = \alpha \nabla J(\theta_0)$$
• Q: Why do we normally use a small step size $$\alpha$$?
• A: The linear approximation is only locally valid, so a small $$\alpha$$ ensures that $$\theta$$ is close to $$\theta_0$$

## Motivation: Trust Regions

• Q: Why do we normally use a small step size $$\alpha$$?
• A: The linear approximation is only locally valid, so a small $$\alpha$$ ensures that $$\theta$$ is close to $$\theta_0$$

$$\theta_1$$

$$\theta_2$$

$$\theta_1$$

$$\theta_2$$

## Trust Regions

• A trust region approach makes the intuition about step size more precise: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
• The trust region is described by a bounded "distance" from $$\theta_0$$
• General notion of distance, "divergence"
• Another motivation relevant in RL: we might estimate $$J(\theta)$$ using data collected with $$\theta_0$$ (i.e. with policy $$\pi_{\theta_0}$$)
• The estimate may only be good close to $$\theta_0$$
• e.g. in Conservation PI, recall the incremental update
• $$\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)$$

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

## KL Divergence

• Motivation: what is a good notion of distance?
• The KL Divergence measures the "distance" between two distributions
• Def: Given $$P\in\Delta(\mathcal X)$$ and $$Q\in\Delta(\mathcal X)$$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$

## KL Divergence

• Def: Given $$P\in\Delta(\mathcal X)$$ and $$Q\in\Delta(\mathcal X)$$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
• Example: if $$P,Q$$ are Bernoullis with mean $$p,q$$
• then $$KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$$ (plot)
• Example: if $$P=\mathcal N(\mu_1, \sigma^2I)$$ and $$Q=\mathcal N(\mu_2, \sigma^2I)$$
• then $$KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2$$
• Fact: KL is always strictly positive unless $$P=Q$$ in which case it is zero.

## KL Divergence for Policies

• We define a measure of "distance" between $$\pi_{\theta_0}$$ and $$\pi_\theta$$
• KL divergence between action distributions
• averaged over states $$s$$ from the discounted steady state distribution of $$\pi_{\theta_0}$$
• $$d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right]$$
• $$= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right]$$
• we will use the shorthand $$s,a\sim d_{\mu_0}^{\pi_{\theta_0}}$$
• $$= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}]$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy:
• $$\pi_\theta(0)=$$ stay
• $$\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$$
• Distance between $$\theta_0$$ and $$\theta$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(0,$$stay$$)\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(0,$$switch$$)\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(1,$$stay$$)\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(1,$$switch$$)\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}$$
• $$d_{KL}(\theta_0, \theta)$$

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$+$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy:
• $$\pi_\theta(0)=$$ stay
• $$\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$$
• Distance between $$\theta_0$$ and $$\theta$$
• 0
• 0
• $$d_{\mu_0}^{\pi_{\theta_0}}(1,$$stay$$)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(1,$$switch$$)\cdot \log \frac{1+\exp \theta}{1+\exp \theta_0}$$
• $$d_{KL}(\theta_0, \theta)$$

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$+$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy:
• $$\pi_\theta(0)=$$ stay
• $$\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$$
• Distance between $$\theta_0$$ and $$\theta$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(1)) \pi_\theta(stay|1)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(1) \pi_\theta(switch|1) \cdot \log \frac{1+\exp \theta}{1+\exp \theta_0}$$
• $$d_{\mu_0}^{\pi_{\theta_0}}(1) \left(\log \frac{1+\exp \theta}{1+\exp \theta_0} + \frac{\exp \theta}{1+\exp \theta} \log \frac{\exp \theta_0}{1+\exp \theta_0}\right)$$

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$+$$

## Agenda

1. Recap

2. Trust Regions

3. KL Divergence

4. Natural PG

• We will derive the update $$\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$$
• This is called natural policy gradient
• Intuition: update direction $$g_i$$ is "preconditioned" by a matrix $$F_i$$ and adapts to geometry

• We derive this update as approximating $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$$

first order approx (gradient $$g_0$$)             second order approx

• $$g_i$$
• $$F_i^{-1}g_i$$

## Second order Divergence Approx

• Second order approximation of $$\ell(\theta) = d_{KL}(\theta_0,\theta) = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right]$$
• Given by $$\ell(\theta_0) + \nabla \ell(\theta_0)^\top (\theta-\theta_0) + (\theta-\theta_0)^\top \nabla^2 \ell(\theta_0) (\theta-\theta_0)$$
• Claim: Zero-th and first order terms are zero $$\ell(\theta_0) = 0$$, $$\nabla \ell(\theta_0)=0$$, and second order (Hessian) is $$\nabla^2\ell(\theta_0) = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[ \log \pi_\theta(a|s)]_{\theta=\theta_0} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_0}^\top ]$$
• The Hessian is known as the Fischer information matrix

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy:
• $$\pi_\theta(0)=$$ stay
• $$\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$$
• Fischer information matrix scalar
• $$\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\ \frac{\exp \theta}{(1+\exp \theta)^2}\cdot \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay} \\ \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}$$
• $$F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]$$
• $$=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)$$
• $$=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)$$

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

## Constrained Optimization

• Our approximation to  $$\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta$$ is $$\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$$
• Claim: The maximum has the closed form expression $$\theta_\star =\theta_0+\alpha F_0^{-1}g_0$$ where $$\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}$$
• Proof outline:
• Start with solving $$\max c^\top v$$ s.t. $$\|v\|_2^2\leq \delta$$
• Consider change of variables $$v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$$

Algorithm: Natural PG

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• Rollout policy
• Estimate $$\nabla J(\theta_i)$$ with $$g_i$$ using rollouts (REINFORCE, value, etc)
• Estimate the Fischer Information Matrix $$F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_0}_{\mu_0}$$
• Update $$\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy: $$\pi_\theta(0)=$$ stay
• $$\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases}$$
• NPG: $$\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0$$; GA: $$\theta_1=\theta_0 + \alpha g_0$$
• $$F_0 \propto \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0$$ as $$\theta_0\to\pm\infty$$
• NPG takes bigger and bigger steps as $$\theta$$ becomes more extreme

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$\theta$$

$$+\infty$$

$$-\infty$$

stay

switch

## Recap

• PSet due Fri
• PA due Fri

• Trust Regions
• KL Divergence