Sp24 CS 4/5789: Lecture 18

CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Policy Opt. with Trust Regions

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- PSet 6 due Friday, PA 3 due ~~Friday~~ Sunday
- 5789 Paper Assignments
Break 3/30-4/7: no office hours, lectures
Class & prof office hours cancelled on Monday 4/8
- Extra TA office hours before prelim
- Prelim questions on Ed: use tag!
Prelim on Wednesday 4/10 in class

Prelim on 4/10 in Lecture

Prelim Wednesday 4/10
During lecture (2:55-4:10pm in 255 Olin)
1 hour exam, closed-book, equation sheet provided
Materials:
- slides (Lectures 1-18, emphasis on 11-18)
  - slides.com tips: ESC, /scroll
- PSets 1-6, emphasis on 4-6 (solutions to be posted)
- lecture notes: extra but imperfect resource
Prelim tag on Ed
Extra TA office hours

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]=\mathop{\mathbb E}_{s\sim \mu_0}\left[V^{\pi_\theta}(s)\right]$$

Recap: Policy Optimization

Meta-Algorithm: Policy Optimization

Initialize $\theta_0$. For $i=0,1,...$:
- Rollout policy
- Estimate $\nabla J(\theta_i)$ as $g_i$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Recap: Trust Regions & KL Div

A trust region approach to optimization: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
The trust region is described by a bounded "distance" (divergence) from $\theta_0$
Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
For parametrized policies, we define $$d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right]= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] $$
Notation note: $s, a\sim d_{\mu_0}^{\pi_{\theta_0}}$ means $s\sim d_{\mu_0}^{\pi_{\theta_0}}$, $a\sim\pi_{\theta_0}(s)$

KL Divergence

Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
Example: if $P,Q$ are Bernoullis with mean $p,q$
- then $KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$ (plot)
Example: if $P=\mathcal N(\mu_1, \sigma^2I)$ and $Q=\mathcal N(\mu_2, \sigma^2I)$
- then $KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2$
Fact: KL is always strictly positive unless $P=Q$ in which case it is zero.

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

Natural Policy Gradient

We will derive the update $$ \theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i $$
This is called natural policy gradient
Intuition: update direction $g_i$ is "preconditioned" by a matrix $F_i$ and adapts to geometry
We derive this update as approximating $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$$

first order approx (gradient $g_0$) second order approx

level sets of quadratic

$g_i$
$F_i^{-1}g_i$

Second order Divergence Approx

Second order approximation of $$\ell(\theta) = d_{KL}(\theta_0,\theta) = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right] $$
Given by $$\ell(\theta_0) + \nabla \ell(\theta_0)^\top (\theta-\theta_0) + (\theta-\theta_0)^\top \nabla^2 \ell(\theta_0) (\theta-\theta_0)$$
Claim: Zero-th and first order terms are zero $\ell(\theta_0) = 0$, $\nabla \ell(\theta_0)=0$, and second order (Hessian) is $$\nabla^2\ell(\theta_0) = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[ \log \pi_\theta(a|s)]_{\theta=\theta_0} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_0}^\top ]$$
The Hessian is known as the Fischer information matrix

For proof of claim, refer to

https://sdean.website/cs4789sp22/lec15-notes.pdf

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Fischer information ~~matrix~~ scalar
- $\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\ \frac{\exp \theta}{(1+\exp \theta)^2}\cdot \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay} \\ \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}$
$F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]$
- $=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)$
- $=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

Constrained Optimization

Our approximation to $\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta$ is $$\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$$
Claim: The maximum has the closed form expression $$\theta_\star =\theta_0+\alpha F_0^{-1}g_0$$ where $\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}$
Proof outline:
- Start with solving $\max c^\top v $ s.t. $\|v\|_2^2\leq \delta$ PollEv
- Consider change of variables $$v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$$

Natural Policy Gradient

Algorithm: Natural PG

Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy and stimate $\nabla J(\theta_i)$ with $g_i$ (REINFORCE, Actor-Critic, etc)
- Estimate the Fischer Information Matrix $$F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_i}_{\mu_0}$$
- Update $\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i$

In practice, common to minibatch samples from $ d^{\pi_i}_{\mu_0}$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy: $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
NPG: $\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0$; GA: $\theta_1=\theta_0 + \alpha g_0 $
$F_0 \propto \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0$ as $\theta_0\to\pm\infty$
NPG takes bigger and bigger steps as $\theta$ becomes more extreme

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\theta$

$+\infty$

$-\infty$

stay

switch

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

Motivation: Lagrangian Relaxation

Trust region optimization $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d_{KL}(\theta, \theta_0)<\delta$$
Computationally costly to deal with constraints, matrix inversion in NPG ("second order")
Methods that only use gradients ("first order") are less costly
Idea: run gradient ascent on relaxed objective $$\max_\theta\quad J(\theta)-\lambda d_{KL}(\theta, \theta_0)$$

Local objective $J(\theta; \theta_0)$

Define: New local objective centered at $\theta_0$: $$J(\theta;\theta_0)= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right] \right]$$
Recall the performance difference lemma $$\mathbb E_{s\sim\mu_0}[V^\pi(s) - V^{\pi'}(s)] = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{\mu_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
Intuition: maximize local advantage against $\pi_{\theta_0}$ (while trust region ensures that distribution of $\theta$ and $\theta_0$ are close)
Fact: The gradients coincide: $\nabla_\theta J(\theta;\theta)=\nabla_\theta J(\theta)$
Importance weighting: $$J(\theta;\theta_0)= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_{\theta_0} (s)}\left[\frac{\pi_{\theta}(a|s) }{\pi_{\theta_0}(a|s) }A^{\pi_{\theta_0}}(s,a) \right] \right]$$

Local objective has the same gradient as $J(\theta)$ when $\theta=\theta_0$$$\nabla_{\theta} \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right] \right]= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\nabla_{\theta}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right] \right]$$

Using the importance weighting trick from Lecture 16

$$= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[\nabla_{\theta}\log\pi_\theta(a|s)A^{\pi_{\theta_0}}(s,a) \right] \right]$$

If $\theta=\theta_0$, this is the gradient expression from Actor-Critic with Advantage (Lecture 17)

Distance Penalty

$$\max_\theta\quad J(\theta;\theta_0)-\lambda d_{KL}(\theta, \theta_0)$$

Let's simplify the penalty term
$d_{KL}(\theta, \theta_0)=\mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right]$
- $ = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{1}{\pi_\theta(a|s)}\right]+\mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log \pi_{\theta_0}(a|s)\right]$
- $ = -\mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\pi_\theta(a|s)\right]+$ [term independent of $\theta$]

Proximal Policy Optimization

$$\max_\theta\quad \mathbb E_{s,a\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \frac{\pi_{\theta}(a|s) }{\pi_{\theta_0}(a|s) }A^{\pi_{\theta_0}}(s,a) \right]+\lambda \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log{\pi_\theta(a|s)}\right]$$

Algorithm: Idealized PPO

Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy to sample $s,a\sim d^{\pi_i}_{\mu_0}$
- Define $L(\theta)=\frac{\pi_{\theta}(a|s) }{\pi_{\theta_i}(a|s) }A^{\pi_{\theta_i}}(s,a) +\lambda \log{\pi_\theta(a|s)}$
- Take several gradient steps on $L(\theta)$, resulting in $\theta_{i+1} $

In practice, estimate $\hat A^{\pi_{\theta_i}}$ and minibatch samples from $ d^{\pi_i}_{\mu_0}$

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

Supervised learning: features $x$ and labels $y$
- Goal: predict labels with $\hat f(x)\approx \mathbb E[y|x]$
- Requirements: dataset $\{x_i,y_i\}_{i=1}^N$
- Method: $\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2$
Fitted Value Iteration: fixed point iteration algorithm, like VI
- Instead of Bellman Optimality Operator in iteration $k$, use supervised learning with $$x_i=(s_i,a_i),\quad y_i=r(s_i,a_i)+\gamma \max _{a} Q^k (s_{i+1}, a)$$ to find $Q^{k+1}$
- Dataset can be off policy

Value-based RL

Fitted Policy Iteration: replace Policy Evaluation step with Fitted Policy Evaluation
- Incremental policy updates to avoid oscillation (Performance Difference Lemma)
Fitted Policy Evaluation: given on policy data from $\pi$, estimate $Q^{\pi}$ (PSets: how to use off policy data)
- Approximate: at iteration $j$, replace Bellman Consistency Equation with supervised learning on $$x_i=(s_i,a_i),\quad y_i=r(s_i,a_i)+\gamma Q^j (s_{i+1}, a_{i+1})$$
- Direct: supervised learning on $$x_i=(s_i,a_i),\quad \textstyle y_i=\sum_{\ell=i}^{i+h_i} r\left(s_\ell, a_\ell\right) $$

Value-Based RL

Policy Optimization

$J(\theta)=$ expected cumulative reward under policy $\pi_\theta$
Estimate $\nabla_\theta J(\theta)$ via rollouts $\tau$, observed reward $R(\tau)$
- Random Search: $\theta + \delta v$ , $g=\frac{1}{2\delta}R(\tau) v$
- REINFORCE: $g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)$
- Actor-Critic: $s,a\sim d^{\pi_\theta}_{\mu_0}$ ,
  $g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (\hat Q^{\pi_\theta}(s,a)-b(s)) $

Food for thought: how to compute off-policy gradient estimate?

Food for thought: compare the bias and variance of different gradient estimates or supervised learning labels.

Policy Optimization

Policy Gradient Meta-Algorithm
for $i=0,1,...$
1. collect rollouts using $\theta_i$
2. estimate gradient with $g_i$
3. $\theta_{i+1} = \theta_i+ \alpha g_i$
Trust regions $$ \max ~J(\theta)~~\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta $$
- Natural PG: first/second order approximation
- Proximal PO: Lagrangian relaxation

Recap

PSet due Fri, PA due Mon
OH and Ed changes due to break, prelim
Prelim in lecture 4/10

Natural Policy Gradient
Proximal Policy Optimization

Happy spring break!

CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Policy Opt. with Trust Regions

Reminders

Prelim on 4/10 in Lecture

Agenda

Recap: Policy Optimization

Recap: Trust Regions & KL Div

KL Divergence

Agenda

Natural Policy Gradient

Second order Divergence Approx

Example

Constrained Optimization

Natural Policy Gradient

Example

Agenda

Motivation: Lagrangian Relaxation

Local objective \(J(\theta; \theta_0)\)

Distance Penalty

Proximal Policy Optimization

Agenda

Value-based RL

Value-Based RL

Policy Optimization

Policy Optimization

Recap