CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Policy Optimization

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Successful in continuous action/control settings
Can improve accuracy of gradient estimate by sampling more $v$ , which is easily parallelizable in simulation

Random Policy Search

Algorithm: Random Policy Search

Given $\alpha, \delta$ . Initialize $\theta_0$
For $i=0,1,...$ :
- Sample $v\sim \mathcal N(0, I)$
- Rollout policies $\pi_{\theta_i\pm\delta v}$ and observe trajectories $\tau_+$ and $\tau_-$
- Estimate $g_i = \frac{1}{2\delta}\left(R(\tau_+)-R(\tau_-)\right) v$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

We have that $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$ up to accuracy of finite difference approximation

$\mathbb E[g_i| \theta_i] = \mathbb E[\mathbb E[g_i|v, \theta_i]| \theta_i]$ by tower property
$\mathbb E[g_i|v, \theta_i]$
- $= \mathbb E_{\tau_+, \tau_-}[\frac{1}{2\delta}\left(R(\tau_+)-R(\tau_-)\right) v]$
- $=\frac{1}{2\delta}\left( \mathbb E_{\tau_+\sim \mathbb P^{\pi_{\theta_i+\delta v}}_{\mu_0}}[R(\tau_+)]-\mathbb E_{\tau_+\sim \mathbb P^{\pi_{\theta_i+\delta v}}_{\mu_0}}[R(\tau_-)]\right) v$
- $=\frac{1}{2\delta}\left( J(\theta_i+\delta v) - J(\theta_i+\delta v)\right) v$
- $=\nabla J(\theta_i)^\top v v$ if finite different approx is perfect
$\mathbb E[g_i| \theta_i] = \mathbb E_v[\nabla J(\theta_i)^\top v v] = \nabla J(\theta_i)$

Claim: The gradient estimate $g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)$ is unbiased

Policy Gradient (REINFORCE)

Recall that $J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$
by Montecarlo, $\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)]$
Since $\tau \sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}$ suffices to show that $\textstyle \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau) = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]$
Key ideas:
- $\mathbb P^{\pi_{\theta}}_{\mu_0}$ factors into terms depending on $P$ and $\pi_\theta$
- the logarithm of a product is the sum of the logarithm
- only terms depending on $\theta$ affect the gradient

We have that $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$

Using the Montecarlo derivation from last lecture $\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)]$
$\log \mathbb P^{\pi_{\theta_i}}_{\mu_0}(\tau)$
- $=\log \left(\mu_0(s_0) \pi_{\theta_i} (a_0|s_0) P(s_1|a_0,s_0) \pi_{\theta_i} (a_1|s_1) P(s_2|a_1,s_1)...\right)$
- $=\log \mu_0(s_0) + \sum_{t=0}^\infty \left(\log \pi_{\theta_i} (a_t|s_t))+ \log P(s_{t+1}|a_t,s_t)\right)$
$\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}$
- $=\cancel{\nabla_\theta \log \mu_0(s_0)} + \sum_{t=0}^\infty \left(\nabla_\theta \log \pi_{\theta} (a_t|s_t))_{\theta=\theta_i}+ \cancel{\nabla_\theta \log P(s_{t+1}|a_t,s_t)}\right)$
- Thus $\nabla_\theta \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)$ ends up having no dependence on unknown $P$ !
$\mathbb E[g_i| \theta_i] = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau) ]$
- $= \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}R(\tau) ]$ by above
- $= \nabla J(\theta_i)$

Algorithm: Idealized Actor Critic

Given $\alpha$ . Initialize $\theta_0$
For $i=0,1,...$ :
- Roll "in" policy $\pi_{\theta_i}$ to sample $s,a\sim d_{\mu_0}^{\pi_{\theta_i}}$
- Estimate $g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$

Product rule on $J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)]$ to derive recursion $\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] = \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$

Starting with a different decomposition of cumulative reward: $\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$
$\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)]$
- $= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) Q^{\pi_\theta}(s_0, a_0)$
- $=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) + \pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right)$
Considering each term:
- $\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0)$
  - $= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)]$
- $\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)]$
  - $= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]$
  - $=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$
- Recursion $\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] = \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$
- Iterating this recursion leads to $\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]]$ $= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)]$

CS 4/5789: Introduction to Reinforcement Learning Lecture 17: Policy Optimization Prof. Sarah Dean MW 2:45-4pm 255 Olin Hall

Sp23 CS 4/5789: Lecture 17

By Sarah Dean

Sp23 CS 4/5789: Lecture 17

2 years ago

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Policy Optimization

Reminders

Agenda

Recap: SGA

Recap: DFO

Recap: Random Search Example

Recap: Montecarlo Example

Agenda

Policy Optimization Setting

Policy Optimization Setting

Policy Optimization

Policy Optimization Overview

Agenda

Random Policy Search

Example

Policy Gradient (REINFORCE)

Example

Policy Gradient (REINFORCE)

Agenda

Motivation: PG with Value

Policy Gradient with (Q) Value

Policy Gradient with Advantage

PG with "Baselines"

Recap

Sp23 CS 4/5789: Lecture 17

Sp23 CS 4/5789: Lecture 17

Sarah Dean PRO

CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Policy Optimization

Sp23 CS 4/5789: Lecture 17

More from Sarah Dean