Sp24 CS 4/5789: Lecture 16

CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Policy Optimization

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- Friday: PSet 5 due, PSet 6 released
- PA 3 due next Friday
- 5789 Paper assignments
Second prelim on Wednesday 4/10 in class

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

Recap: SGA & DFO

Rather than exact gradients, SGA uses unbiased estimates of the gradient $g_i$, i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$
It is better when $g_i$ is lower variance
Derivative Free Optimization methods construct estimates $g_i$ with only zero-th order access to $J(\theta)$

Algorithm: SGA

Initialize $\theta_0$; For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha g_i$

Algorithm: One Point Random Search

Initialize $\theta_0$. For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{\delta}J(\theta_i+\delta v) v$

$\nabla J(\theta)$$ \approx g= \frac{1}{2\delta} J(\textcolor{cyan}{\theta}+{\delta v})\textcolor{LimeGreen}{v}$

$J(\theta) = -\theta^2 - 1$

$\theta$

Random Search Example

start with $\theta$ positive
suppose $v$ is positive
then $J(\theta+\delta v)<0$
therefore $g$ is negative
(if we sample $v$ negative, wrong direction! but smaller magnitude)

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
Recall notation for a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$ and probability of a trajectory $\mathbb P^{\pi}_{\mu_0}$
Define cumulative reward $R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$
For parametric (e.g. deep) policy $\pi_\theta$, the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Policy Optimization Setting

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Assume that we can "rollout" policy $\pi_\theta$ to observe:
- a sample $\tau$ from $\mathbb P^{\pi_\theta}_{\mu_0}$
- the resulting cumulative reward $R(\tau)$
Note: we do not need to know $P$! (Also easy to extend to the case that we don't know $r$!)
We consider infinite length trajectories $\tau$ without worrying about computational feasibility

Policy Optimization Setting

Policy Opt. with Random Search

Can be successful in continuous action/control settings
Improve performance by sampling more $v$, which is easily parallelizable in simulation

Random Search Policy Optimization

Given $\alpha, \delta$. Initialize $\theta_0$
For $i=0,1,...$:
- Sample $v\sim \mathcal N(0, I)$
- Rollout policy $\pi_{\theta_i+ \delta v}$ and observe trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$
- Estimate $g_i = \frac{1}{\delta} R(\tau) v$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy: $\pi_\theta(0)=$stay, $\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}$ and $\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}$.
Initialize $\theta^{(1)}_0=\theta^{(2)}_0=1/2$
- sample random perturbation, e.g. $$\theta^{(1)}_0+\delta,~~~=\theta^{(2)}_0-\delta$$
- update $\theta_1$ based on magnitude of reward
- repeat

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

DFO via Importance Weighting

Suppose that $J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$
- E.g. for RL, $J(\theta)=V^{\pi_\theta}(s_0) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$
Fact: The gradient $\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$
- Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty $$
- Then $\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}] $
  - general principle: reweight by ratio of probability distributions (PSet)
- The gradient $\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}] $
- Set $\rho = P_\theta$ and notice that $\nabla_\theta \left[\log P_\theta(z) \right] = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}$

Suppose that $J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$
- E.g. in reinforcement learning $V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$
Fact: The gradient $\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$
SGA-inspired algorithm

$\underbrace{\qquad\qquad}_{\text{score}}$

Algorithm: Monte-Carlo DFO

Initialize $\theta_0$
For $i=0,1,...$:
- sample $z\sim P_{\theta_i}$
- $\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)$

DFO via Importance Weighting

Montecarlo Example

$\nabla J(\theta)$$ \approx g=\nabla_\theta \log(P_\theta(z)) h(z) $

$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$

$z$

$\nabla_\theta \log P_\theta(z)= (z-\theta)$

$h(z) = -z^2$

$=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]$

($=-\theta^2 - 1$)

$P_\theta = \mathcal N(\theta, 1)$

start with $\theta$ positive
suppose $z>\theta$
then score is positive
therefore $g$ is negative (since $h(z)<0$)
(wrong direction if sample $z<\theta$)

$\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2$

Claim: The gradient estimate is unbiased $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$

Recall Montecarlo gradient and that $J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$

Policy Gradient (REINFORCE)

Algorithm: REINFORCE

Given $\alpha$. Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy $\pi_{\theta_i}$ and observe trajectory $\tau$
- Estimate $g_i = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy: $\pi_\theta(0)=$stay, $\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}$ and $\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}$.
Compute the score PollEV
- $\nabla_\theta \log \pi_\theta(a|s)=\begin{bmatrix} 1/\theta^{(1)} \cdot \mathbb 1\{a=\mathsf{stay}\} \\ 1/\theta^{(2)} \cdot 1\{a=\mathsf{switch}\}\end{bmatrix}$
Initialize $\theta^{(1)}_0=\theta^{(2)}_0=1/2$
- rollout, then sum score over trajectory $$g_0 \propto \begin{bmatrix} \text{\# times } s=1,a=\mathsf{stay} \\ \text{\# times } s=1,a=\mathsf{switch} \end{bmatrix} $$
Direction of update depends on empirical action frequency, size depends on $R(\tau)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

Claim: The gradient estimate $g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)$ is unbiased

Policy Gradient (REINFORCE)

Recall that $J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$
by Montecarlo, $\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)] $
Since $\tau \sim \mathbb P^{\pi_{\theta_i}}_{\mu_0} $ suffices to show that $$\textstyle \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau) = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)] $$
Key ideas:
- $\mathbb P^{\pi_{\theta}}_{\mu_0}$ factors into terms depending on $P$ and $\pi_\theta$
- the logarithm of a product is the sum of the logarithm
- only terms depending on $\theta$ affect the gradient

We have that $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$

Using the Montecarlo derivation $$\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)] $$
$\log \mathbb P^{\pi_{\theta_i}}_{\mu_0}(\tau)$
- $ =\log \left(\mu_0(s_0) \pi_{\theta_i} (a_0|s_0) P(s_1|a_0,s_0) \pi_{\theta_i} (a_1|s_1) P(s_2|a_1,s_1)...\right) $
- $ =\log \mu_0(s_0) + \sum_{t=0}^\infty \left(\log \pi_{\theta_i} (a_t|s_t))+ \log P(s_{t+1}|a_t,s_t)\right) $
$\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}$
- $ =\cancel{\nabla_\theta \log \mu_0(s_0)} + \sum_{t=0}^\infty \left(\nabla_\theta \log \pi_{\theta} (a_t|s_t))_{\theta=\theta_i}+ \cancel{\nabla_\theta \log P(s_{t+1}|a_t,s_t)}\right) $
- Thus $\nabla_\theta \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)$ ends up having no dependence on unknown $P$!
$\mathbb E[g_i| \theta_i] = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau) ]$
- $= \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}R(\tau) ]$ by above
- $= \nabla J(\theta_i)$

Agenda

1. Recap

2. Policy Optimization

3. REINFORCE

4. Value-based Gradients

So far, methods depend on entire trajectory rollout
This leads to high variance estimates
Incorporating (Q) Value function can reduce variance
In practice, use estimates of Q/Value (last week)

Motivation: PG with Value

...

Sampling from $d_\gamma^{\mu_0,\pi}$

Recall the discounted "steady-state" distribution $$ d^{\mu_0,\pi}_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d^{\mu_0,\pi}_t $$
On PSet, you showed that $V^\pi(s)=\mathbb E_{s'\sim d^{e_{s},\pi}_\gamma}[r(s',\pi(s'))]$
Can we sample from this distribution?
- Sample $s_0\sim\mu_0$ and $h\sim\mathrm{Geom}(1-\gamma)$
- Roll out $\pi$ for $h$ steps
- Claim: then $s_{h}\sim d_\gamma^{\mu_0,\pi}$
Shorthand: $s,a\sim d_\gamma^{\mu_0,\pi}$ if $s\sim d_\gamma^{\mu_0,\pi}$
and $a\sim \pi(s)$

Rollout:

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

Algorithm: Idealized Actor Critic

Given $\alpha$. Initialize $\theta_0$
For $i=0,1,...$:
- Roll out $\pi_{\theta_i}$ to sample $s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}$
- Estimate $g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$

I.e. $\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}}[ \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)]=\nabla J(\theta_i)$
Why? Product rule on $J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)] $

Starting with a different decomposition of cumulative reward: $$\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$$
$\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)] $
- $= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) Q^{\pi_\theta}(s_0, a_0) $
- $=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) + \pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right) $
Considering each term:
- $\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0) $
  - $ = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] $
- $\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)] $
  - $= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]$
  - $=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$
- Recursion $\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] = \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$
- Iterating this recursion leads to $$\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] $$ $$= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)] $$

The Advantage function is $A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)$

Claim: The gradient estimate is unbiased $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$
Follows because we can show that $\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]V^{\pi_{\theta}}(s)]=0$

Policy Gradient with Advantage

Algorithm: Idealized Actor Critic with Advantage

Same as previous slide, except estimation step
- Estimate $g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}A^{\pi_{\theta_i}}(s,a)$

Claim: for any $b(s)$, $\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = 0$
General principle: subtracting any action-independent "baseline" does not affect expected value
Proof of claim:
- $\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]$
  - $=\sum_{a\in\mathcal A} \pi_\theta(a|s)\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]$
  - $=\sum_{a\in\mathcal A} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot b(s)$
  - $=\nabla_\theta \sum_{a\in\mathcal A}\pi_\theta(a|s) \cdot b(s)$
  - $=\nabla_\theta b(s) = 0$

PG with "Baselines"

Today we covered multiple ways (Random Search, REINFORCE, Actor-Critic) to construct estimate $g_i$ such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

Policy Optimization Summary

Meta-Algorithm: Policy Optimization

Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy
- Estimate $\nabla J(\theta_i)$ as $g_i$ using data
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Recap

PSet due Fri
PA due next Fri

PG with rollouts: random search and REINFORCE
PG with value: Actor-Critic and baselines

Next lecture: Trust Regions

CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Policy Optimization

Reminders

Agenda

Recap: SGA & DFO

Random Search Example

Agenda

Policy Optimization Setting

Policy Optimization Setting

Policy Opt. with Random Search

Example

Agenda

DFO via Importance Weighting

DFO via Importance Weighting

Montecarlo Example

Policy Gradient (REINFORCE)

Example

Policy Gradient (REINFORCE)

Agenda

Motivation: PG with Value

Sampling from \(d_\gamma^{\mu_0,\pi}\)

Policy Gradient with (Q) Value

Policy Gradient with Advantage

PG with "Baselines"

Policy Optimization Summary

Recap