CS 4/5789: Introduction to Reinforcement Learning

Lecture 17: Trust Regions

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • PA 3 and PSet 6 due Friday
    • 5789 Paper Assignments
  • Break 3/30-4/7: no office hours, lectures
  • Class & prof office hours cancelled on Monday 4/8
    • Extra TA office hours before prelim
    • Prelim questions on Ed: use tag!
  • Second prelim on Wednesday 4/10 in class

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

  • Policy \(\pi_\theta\) parametrized by \(\theta\) (e.g. deep network)

  • Assume that we can "rollout" policy \(\pi_\theta\) to observe:

    • a sample \(\tau = (s_0, a_0, s_1, a_1, \dots)\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)

    • the resulting cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)

  • Note: we do not need to know \(P\)! (Also easy to extend to the case that we don't know \(r\)!)

Recap: Setting

Meta-Algorithm: Policy Optimization

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy
    • Estimate \(\nabla J(\theta_i)\) as \(g_i\) using rollouts
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Recap: Policy Optimization

Last time, we discussed two algorithms (Random Search and REINFORCE) for estimating gradients using a trajectory

Parametrized Policies

  • Policy \(\pi_\theta\) maps \(\mathcal S\to\Delta(\mathcal A)\)
  • Example: tabular policy, \(\theta\in\mathbb R^{SA}\), $$\pi_\theta(a|s) = \theta_{s,a}$$
  • Example: softmax linear policy, \(\theta\in\mathbb R^{d}\), \(\varphi:\mathcal S\times\mathcal A\to \mathbb R^d\), $$\pi_\theta(a|s) = \frac{\exp(\theta^\top\varphi(s,a))}{\sum_{a'\in\mathcal A}\exp(\theta^\top\varphi(s,a'))}$$
  • Example: neural network, weights \(\theta\in\mathbb R^{d}\)

\(\pi(a_1|s)\)

\(\vdots\)

\(\pi(a_A|s)\)

\(s\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • What is optimal \(\theta_\star\)?
    • PollEV

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(\theta\)

\(+\infty\)

\(-\infty\)

stay

switch

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

Gradient Estimates with Value

  • We now cover gradient estimates \(g_i\) which depend on the value or Q function
  • These estimates are lower variance than trajectory-based estimates (Random Search and REINFORCE)
  • However, they are only unbiased when we can use the true Q function \(Q^\pi\)
    • In reality, we will use estimates \(\hat Q^\pi\) created with supervised learning (recall previous lectures)
    • This makes the gradient estimates biased, but often worth it for the lower variance

Sampling from \(d_\gamma^{\mu_0,\pi}\)

  • Recall the discounted "steady-state" distribution $$ d^{\mu_0,\pi}_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d^{\mu_0,\pi}_t $$
  • On PSet, you showed that \(V^\pi(s)=\mathbb E_{s'\sim d^{e_{s},\pi}_\gamma}[r(s',\pi(s'))]\)
  • Can we sample from this distribution?
    • Sample \(s_0\sim\mu_0\) and \(h\sim\mathrm{Geom}(1-\gamma)\)
    • Roll out \(\pi\) for \(h\) steps
    • Claim: then \(s_{h}\sim d_\gamma^\pi\)
  • Shorthand: \(s,a\sim d^{\mu_0,\pi}\) if \(s\sim d^{\mu_0,\pi}\)
    and \(a\sim \pi(s)\)

Rollout:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Algorithm: Idealized Actor Critic

  • Given \(\alpha\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Roll out \(\pi_{\theta_i}\) to sample \(s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}\)
    • Estimate \(g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)

  • I.e. \(\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_\gamma^{\mu_0,\pi_{\theta_i}}}[ \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)]=\nabla J(\theta_i)\)
  • Why? Product rule on \(J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)] \)
  • Starting with a different decomposition of cumulative reward: $$\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$$
  • \(\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)] \)
    • \(= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0)  Q^{\pi_\theta}(s_0, a_0) \)
    • \(=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) +  \pi_\theta(a_0|s_0)  \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right)  \)
  • Considering each term:
    • \(\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A}  \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0) \)
      • \( = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] \)
    • \(\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0)  \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)]  \)
      • \(= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]\)
      • \(=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]\)
    • Recursion \(\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] =  \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]\)
    • Iterating this recursion leads to $$\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] $$ $$= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)] $$

Sampling from \(d_\gamma^{\mu_0,\pi}\)

  • Recall the discounted "steady-state" distribution $$ d^{\mu_0,\pi}_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d^{\mu_0,\pi}_t $$
  • On PSet, you showed that \(V^\pi(s)=\mathbb E_{s'\sim d^{e_{s},\pi}_\gamma}[r(s',\pi(s'))]\)
  • Can we sample from this distribution?
    • Sample \(s_0\sim\mu_0\) and \(h\sim\mathrm{Geom}(1-\gamma)\)
    • Roll out \(\pi\) for \(h\) steps
    • Claim: then \(s_{h}\sim d_\gamma^\pi\)
  • Shorthand: \(s,a\sim d^{\mu_0,\pi}\) if \(s\sim d^{\mu_0,\pi}\)
    and \(a\sim \pi(s)\)

Rollout:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

The Advantage function is \(A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)\)

  • Claim: The gradient estimate is unbiased \(\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)\)
  • Follows because we can show that \(\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]V^{\pi_{\theta}}(s)]=0\)
    • (next slide)

Policy Gradient with Advantage

Algorithm: Idealized Actor Critic with Advantage

  • Same as previous slide, except estimation step
    • Estimate \(g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}A^{\pi_{\theta_i}}(s,a)\)
  • Claim: for any \(b(s)\), \(\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = 0\)
    • General principle: subtracting any action-independent "baseline" does not affect expected value $$g_i=\frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}\left(Q^{\pi_{\theta_i}}(s,a)-b(s)\right)\quad\text{is unbiased}$$
  • Proof of claim:
    • \(\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]\)
      • \(=\sum_{a\in\mathcal A} \pi_\theta(a|s)\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]\)
      • \(=\sum_{a\in\mathcal A} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot b(s)\)
      • \(=\nabla_\theta  \sum_{a\in\mathcal A}\pi_\theta(a|s) \cdot b(s)\)
      • \(=\nabla_\theta b(s) = 0\)

PG with "Baselines"

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

Motivation: Trust Regions

  • Recall: motivation of gradient ascent as first-order approximate maximization
    • \(\max_\theta J(\theta) \approx \max_{\theta} J(\theta_0) + \nabla J(\theta_0)^\top (\theta-\theta_0)\)
  • The maximum occurs when \(\theta-\theta_0\) is parallel to \(\nabla J(\theta_0)\)
    • \(\theta - \theta_0 = \alpha \nabla J(\theta_0) \)
  • Q: Why do we normally use a small step size \(\alpha\)?
    • A: The linear approximation is only locally valid, so a small \(\alpha\) ensures that \(\theta\) is close to \(\theta_0\)

Motivation: Trust Regions

  • Q: Why do we normally use a small step size \(\alpha\)?
    • A: The linear approximation is only locally valid, so a small \(\alpha\) ensures that \(\theta\) is close to \(\theta_0\)
Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

\(\theta_1\)

\(\theta_2\)

2D quadratic function

level sets of quadratic

Trust Regions

  • A trust region approach makes the intuition about step size more precise: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
  • The trust region is described by a bounded "distance" from \(\theta_0\)
    • General notion of distance, "divergence"
  • Another motivation relevant in RL: we might estimate \(J(\theta)\) using data collected with \(\theta_0\) (i.e. with policy \(\pi_{\theta_0}\))
    • The estimate may only be good close to \(\theta_0\)
    • e.g. in Conservation PI, recall the incremental update
      • \(\pi^{i+1}(a|s) = (1-\alpha) \pi^i(a|s) + \alpha \bar\pi(a|s)\)

Agenda

1. Recap: Policy Optimization

2. Gradients with Q/Value

3. Trust Regions

4. KL Divergence

KL Divergence

  • Motivation: what is a good notion of distance?
  • The KL Divergence measures the "distance" between two distributions
  • Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$

KL Divergence

  • Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
  • Example: if \(P,Q\) are Bernoullis with mean \(p,q\)
    • then \(KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\) (plot)
  • Example: if \(P=\mathcal N(\mu_1, \sigma^2I)\) and \(Q=\mathcal N(\mu_2, \sigma^2I)\)
    • then \(KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2\)
  • Fact: KL is always strictly positive unless \(P=Q\) in which case it is zero.

KL Divergence for Policies

  • We define a measure of "distance" between \(\pi_{\theta_0}\) and \(\pi_\theta\)
    • KL divergence between action distributions
    • averaged over states \(s\) from the discounted steady state distribution of \(\pi_{\theta_0}\)
  • \(d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right] \)
    • \(= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\mathbb E_{a\sim \pi_{\theta_0}}\right [\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\left]\right] \)
    • we will use the shorthand \(s,a\sim d_{\mu_0}^{\pi_{\theta_0}}\)
    • \(= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] \)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Distance between \(\theta_0\) and \(\theta\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(0,\)stay\()\cdot \log \frac{\pi_{\theta_0}(stay|0)}{\pi_\theta(stay|0)}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(0,\)switch\()\cdot \log \frac{\pi_{\theta_0}(switch|0)}{\pi_\theta(switch|0)}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)stay\()\cdot \log \frac{\pi_{\theta_0}(stay|1)}{\pi_\theta(stay|1)}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)switch\()\cdot \log \frac{\pi_{\theta_0}(switch|1)}{\pi_\theta(switch|1)}\)
    • \(d_{KL}(\theta_0, \theta)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(+\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Distance between \(\theta_0\) and \(\theta\)
    • 0
    • 0
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)stay\()\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1,\)switch\()\cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} \)
    • \(d_{KL}(\theta_0, \theta)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(+\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Distance between \(\theta_0\) and \(\theta\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1)) \pi_\theta(stay|1)\cdot \log \frac{\exp \theta_0(1+\exp \theta)}{(1+\exp \theta_0)\exp \theta}\)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1) \pi_\theta(switch|1) \cdot \log \frac{1+\exp \theta}{1+\exp \theta_0} \)
    • \(d_{\mu_0}^{\pi_{\theta_0}}(1) \left(\log \frac{1+\exp \theta}{1+\exp \theta_0}  + \frac{\exp \theta}{1+\exp \theta} \log \frac{\exp \theta_0}{1+\exp \theta_0}\right)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(+\)

Recap

  • PSet/PA due Fri
  • OH and Ed changes due to break, prelim
  • No lecture Monday 4/8
  • Prelim in lecture 4/10

 

  • PO with value
  • Trust Regions
  • KL Divergence

 

  • Next lecture: Policy Optimization Algorithms based on Trust Regions

Sp24 CS 4/5789: Lecture 17

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 17