Sp24 CS 4/5789: Lecture 17

CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Trust Regions

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- PA 3 and PSet 6 due Friday
- 5789 Paper Assignments
Break 3/30-4/7: no office hours, lectures
Class & prof office hours cancelled on Monday 4/8
- Extra TA office hours before prelim
- Prelim questions on Ed: use tag!
Second prelim on Wednesday 4/10 in class

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Policy $\pi_\theta$ parametrized by $\theta$ (e.g. deep network)
Assume that we can "rollout" policy $\pi_\theta$ to observe:
- a sample $\tau = (s_0, a_0, s_1, a_1, \dots)$ from $\mathbb P^{\pi_\theta}_{\mu_0}$
- the resulting cumulative reward $R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$
Note: we do not need to know $P$! (Also easy to extend to the case that we don't know $r$!)

Recap: Setting

Meta-Algorithm: Policy Optimization

Initialize $\theta_0$
For $i=0,1,...$:
- Rollout policy
- Estimate $\nabla J(\theta_i)$ as $g_i$ using rollouts
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Recap: Policy Optimization

Last time, we discussed two algorithms (Random Search and REINFORCE) for estimating gradients using a trajectory

Parametrized Policies

Policy $\pi_\theta$ maps $\mathcal S\to\Delta(\mathcal A)$
Example: tabular policy, $\theta\in\mathbb R^{SA}$, $$\pi_\theta(a|s) = \theta_{s,a}$$
Example: softmax linear policy, $\theta\in\mathbb R^{d}$, $\varphi:\mathcal S\times\mathcal A\to \mathbb R^d$, $$\pi_\theta(a|s) = \frac{\exp(\theta^\top\varphi(s,a))}{\sum_{a'\in\mathcal A}\exp(\theta^\top\varphi(s,a'))}$$
Example: neural network, weights $\theta\in\mathbb R^{d}$

$\pi(a_1|s)$

$\vdots$

$\pi(a_A|s)$

$s$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
What is optimal $\theta_\star$?
- PollEV

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\theta$

$+\infty$

$-\infty$

stay

switch

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

Gradient Estimates with Value

We now cover gradient estimates $g_i$ which depend on the value or Q function
These estimates are lower variance than trajectory-based estimates (Random Search and REINFORCE)
However, they are only unbiased when we can use the true Q function $Q^\pi$
- In reality, we will use estimates $\hat Q^\pi$ created with supervised learning (recall previous lectures)
- This makes the gradient estimates biased, but often worth it for the lower variance

Sampling from $d_\gamma^{\mu_0,\pi}$

Recall the discounted "steady-state" distribution $$ d^{\mu_0,\pi}_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d^{\mu_0,\pi}_t $$
On PSet, you showed that $V^\pi(s)=\mathbb E_{s'\sim d^{e_{s},\pi}_\gamma}[r(s',\pi(s'))]$
Can we sample from this distribution?
- Sample $s_0\sim\mu_0$ and $h\sim\mathrm{Geom}(1-\gamma)$
- Roll out $\pi$ for $h$ steps
- Claim: then $s_{h}\sim d_\gamma^\pi$
Shorthand: $s,a\sim d^{\mu_0,\pi}$ if $s\sim d^{\mu_0,\pi}$
and $a\sim \pi(s)$

Rollout:

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

Algorithm: Idealized Actor Critic

Given $\alpha$. Initialize $\theta_0$
For $i=0,1,...$:
- Roll out $\pi_{\theta_i}$ to sample $s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}$
- Estimate $g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$

I.e. $\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}}[ \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)]=\nabla J(\theta_i)$
Why? Product rule on $J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)] $

Starting with a different decomposition of cumulative reward: $$\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$$
$\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)] $
- $= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) Q^{\pi_\theta}(s_0, a_0) $
- $=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) + \pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right) $
Considering each term:
- $\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0) $
  - $ = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] $
- $\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)] $
  - $= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]$
  - $=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$
- Recursion $\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] = \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$
- Iterating this recursion leads to $$\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] $$ $$= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)] $$

Sampling from $d_\gamma^{\mu_0,\pi}$

Recall the discounted "steady-state" distribution $$ d^{\mu_0,\pi}_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d^{\mu_0,\pi}_t $$
On PSet, you showed that $V^\pi(s)=\mathbb E_{s'\sim d^{e_{s},\pi}_\gamma}[r(s',\pi(s'))]$
Can we sample from this distribution?
- Sample $s_0\sim\mu_0$ and $h\sim\mathrm{Geom}(1-\gamma)$
- Roll out $\pi$ for $h$ steps
- Claim: then $s_{h}\sim d_\gamma^\pi$
Shorthand: $s,a\sim d^{\mu_0,\pi}$ if $s\sim d^{\mu_0,\pi}$
and $a\sim \pi(s)$

Rollout:

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

The Advantage function is $A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)$

Claim: The gradient estimate is unbiased $\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$
Follows because we can show that $\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]V^{\pi_{\theta}}(s)]=0$
- (next slide)

Policy Gradient with Advantage

Algorithm: Idealized Actor Critic with Advantage

Same as previous slide, except estimation step
- Estimate $g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}A^{\pi_{\theta_i}}(s,a)$

Claim: for any $b(s)$, $\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = 0$
- General principle: subtracting any action-independent "baseline" does not affect expected value $$g_i=\frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}\left(Q^{\pi_{\theta_i}}(s,a)-b(s)\right)\quad\text{is unbiased}$$
Proof of claim:
- $\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]$
  - $=\sum_{a\in\mathcal A} \pi_\theta(a|s)\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]$
  - $=\sum_{a\in\mathcal A} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot b(s)$
  - $=\nabla_\theta \sum_{a\in\mathcal A}\pi_\theta(a|s) \cdot b(s)$
  - $=\nabla_\theta b(s) = 0$

PG with "Baselines"

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

Motivation: Trust Regions

Recall: motivation of gradient ascent as first-order approximate maximization
- $\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)$
The maximum occurs when $\theta-\theta_0$ is parallel to $\nabla J(\theta_0)$
- $\theta - \theta_0 = \alpha \nabla J(\theta_0) $
Q: Why do we normally use a small step size $\alpha$?
- A: The linear approximation is only locally valid, so a small $\alpha$ ensures that $\theta$ is close to $\theta_0$

Motivation: Trust Regions

Q: Why do we normally use a small step size $\alpha$?
- A: The linear approximation is only locally valid, so a small $\alpha$ ensures that $\theta$ is close to $\theta_0$

$\theta_1$

$\theta_2$

$\theta_1$

$\theta_2$

2D quadratic function

level sets of quadratic

Trust Regions

A trust region approach makes the intuition about step size more precise: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
The trust region is described by a bounded "distance" from $\theta_0$
- General notion of distance, "divergence"
Another motivation relevant in RL: we might estimate $J(\theta)$ using data collected with $\theta_0$ (i.e. with policy $\pi_{\theta_0}$)
- The estimate may only be good close to $\theta_0$
- e.g. in Conservation PI, recall the incremental update
  - $\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)$

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

KL Divergence

Motivation: what is a good notion of distance?
The KL Divergence measures the "distance" between two distributions
Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$

KL Divergence

Def: Given $P\in\Delta(\mathcal X)$ and $Q\in\Delta(\mathcal X)$, the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
Example: if $P,Q$ are Bernoullis with mean $p,q$
- then $KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$ (plot)
Example: if $P=\mathcal N(\mu_1, \sigma^2I)$ and $Q=\mathcal N(\mu_2, \sigma^2I)$
- then $KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2$
Fact: KL is always strictly positive unless $P=Q$ in which case it is zero.

KL Divergence for Policies

We define a measure of "distance" between $\pi_{\theta_0}$ and $\pi_\theta$
- KL divergence between action distributions
- averaged over states $s$ from the discounted steady state distribution of $\pi_{\theta_0}$
$d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right] $
- $= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right] $
- we will use the shorthand $s,a\sim d_{\mu_0}^{\pi_{\theta_0}}$
- $= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] $

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Distance between $\theta_0$ and $\theta$
- $d_{\mu_0}^{\pi_{\theta_0}}(0,$stay$)\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(0,$switch$)\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$stay$)\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$switch$)\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}$
- $d_{KL}(\theta_0, \theta)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Distance between $\theta_0$ and $\theta$
- 0
- 0
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$stay$)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1,$switch$)\cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} $
- $d_{KL}(\theta_0, \theta)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

Parametrized policy:
- $\pi_\theta(0)=$ stay
- $\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} $
Distance between $\theta_0$ and $\theta$
- $d_{\mu_0}^{\pi_{\theta_0}}(1)) \pi_\theta(stay|1)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}$
- $d_{\mu_0}^{\pi_{\theta_0}}(1) \pi_\theta(switch|1) \cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} $
- $d_{\mu_0}^{\pi_{\theta_0}}(1) \left(\log \frac{1+\exp \theta}{1+\exp \theta_0} + \frac{\exp \theta}{1+\exp \theta} \log \frac{\exp \theta_0}{1+\exp \theta_0}\right)$

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$+$

Recap

PSet/PA due Fri
OH and Ed changes due to break, prelim
No lecture Monday 4/8
Prelim in lecture 4/10

PO with value
Trust Regions
KL Divergence

Next lecture: Policy Optimization Algorithms based on Trust Regions

CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Trust Regions

Reminders

Agenda

Recap: Setting

Recap: Policy Optimization

Parametrized Policies

Example

Agenda

Gradient Estimates with Value

Sampling from \(d_\gamma^{\mu_0,\pi}\)

Policy Gradient with (Q) Value

Sampling from \(d_\gamma^{\mu_0,\pi}\)

Policy Gradient with Advantage

PG with "Baselines"

Agenda

Motivation: Trust Regions

Motivation: Trust Regions

Trust Regions

Agenda

KL Divergence

KL Divergence

KL Divergence for Policies

Example

Example

Example

Recap