CS 4/5789: Introduction to Reinforcement Learning

Lecture 18: Policy Opt. with Trust Regions

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • PSet 6 due Friday, PA 3 due Friday Sunday
    • 5789 Paper Assignments
  • Break 3/30-4/7: no office hours, lectures
  • Class & prof office hours cancelled on Monday 4/8
    • Extra TA office hours before prelim
    • Prelim questions on Ed: use tag!
  • Prelim on Wednesday 4/10 in class

Prelim on 4/10 in Lecture

  • Prelim Wednesday 4/10
  • During lecture (2:55-4:10pm in 255 Olin)
  • 1 hour exam, closed-book, equation sheet provided
  • Materials:
    • slides (Lectures 1-18, emphasis on 11-18)
      • slides.com tips: ESC, /scroll
    • PSets 1-6, emphasis on 4-6 (solutions to be posted)
    • lecture notes: extra but imperfect resource
  • Prelim tag on Ed
  • Extra TA office hours

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]=\mathop{\mathbb E}_{s\sim \mu_0}\left[V^{\pi_\theta}(s)\right]$$

Recap: Policy Optimization

Meta-Algorithm: Policy Optimization

  • Initialize \(\theta_0\). For \(i=0,1,...\):
    • Rollout policy
    • Estimate \(\nabla J(\theta_i)\) as \(g_i\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Recap: Trust Regions & KL Div

  • A trust region approach to optimization: $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta_0, \theta)<\delta$$
  • The trust region is described by a bounded "distance" (divergence) from \(\theta_0\)
  • Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
  • For parametrized policies, we define $$d_{KL}(\theta_0, \theta)= \mathbb E_{s\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[KL(\pi_{\theta_0}(\cdot |s)|\pi_\theta(\cdot |s))\right]= \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}] $$
  • Notation note:  \(s, a\sim d_{\mu_0}^{\pi_{\theta_0}}\) means \(s\sim d_{\mu_0}^{\pi_{\theta_0}}\), \(a\sim\pi_{\theta_0}(s)\)

KL Divergence

  • Def: Given \(P\in\Delta(\mathcal X)\) and \(Q\in\Delta(\mathcal X)\), the KL Divergence is $$KL(P|Q) = \mathbb E_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = \sum_{x\in\mathcal X} P(x)\log\frac{P(x)}{Q(x)}$$
  • Example: if \(P,Q\) are Bernoullis with mean \(p,q\)
    • then \(KL(P|Q) = p\log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\) (plot)
  • Example: if \(P=\mathcal N(\mu_1, \sigma^2I)\) and \(Q=\mathcal N(\mu_2, \sigma^2I)\)
    • then \(KL(P|Q) = \|\mu_1-\mu_2\|_2^2/\sigma^2\)
  • Fact: KL is always strictly positive unless \(P=Q\) in which case it is zero.

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

Natural Policy Gradient

  • We will derive the update $$ \theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i $$
  • This is called natural policy gradient
  • Intuition: update direction \(g_i\) is "preconditioned" by a matrix \(F_i\) and adapts to geometry



     
  • We derive this update as approximating $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d(\theta, \theta_0)<\delta$$

first order approx (gradient \(g_0\))             second order approx

level sets of quadratic

  • \(g_i\)
  • \(F_i^{-1}g_i\)

Second order Divergence Approx

  • Second order approximation of $$\ell(\theta) = d_{KL}(\theta_0,\theta) = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right] $$
  • Given by $$\ell(\theta_0) + \nabla \ell(\theta_0)^\top (\theta-\theta_0) + (\theta-\theta_0)^\top \nabla^2 \ell(\theta_0) (\theta-\theta_0)$$
  • Claim: Zero-th and first order terms are zero \(\ell(\theta_0) = 0\), \(\nabla \ell(\theta_0)=0\), and second order (Hessian) is $$\nabla^2\ell(\theta_0) = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[ \log \pi_\theta(a|s)]_{\theta=\theta_0} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_0}^\top ]$$
  • The Hessian is known as the Fischer information matrix

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy:
    • \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • Fischer information matrix scalar
    • \(\nabla\log \pi_\theta(a|s) = \begin{cases}0 & s=0\\  \frac{\exp \theta}{(1+\exp \theta)^2}\cdot  \frac{1+\exp \theta}{\exp \theta} & s=1,a=\mathsf{stay}  \\  \frac{-\exp \theta}{(1+\exp \theta)^2} \cdot  \frac{1+\exp \theta}{1} & s=1,a=\mathsf{switch} \end{cases}\)
  • \(F_0 = \mathbb E_{s,a\sim d_{\mu_0}^{\pi_{\theta_0}}}[\nabla_\theta[\log \pi_\theta(a|s)]^2_{\theta=\theta_0} ]\)
    • \(=d_{\mu_0}^{\pi_{\theta_0}}(1) \left( \frac{\exp \theta_0}{1+\exp \theta_0} \cdot \frac{1}{(1+\exp \theta_0)^2} + \frac{1}{1+\exp \theta_0}\cdot \frac{(-\exp \theta_0)^2}{(1+\exp \theta_0)^2}\right)\)
    • \(=d_{\mu_0}^{\pi_{\theta_0}}(1) \left(  \frac{\exp \theta_0}{(1+\exp \theta_0)^2} \right)\)

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

Constrained Optimization

  • Our approximation to  \(\max_\theta J(\theta)~ \text{s.t.} ~d(\theta, \theta_0)<\delta\) is $$\max_\theta\quad g_0^\top(\theta-\theta_0) \quad \text{s.t.}\quad (\theta-\theta_0)^\top F_{0} (\theta-\theta_0)<\delta$$
  • Claim: The maximum has the closed form expression $$\theta_\star =\theta_0+\alpha F_0^{-1}g_0$$ where \(\alpha = (\delta /g_0^\top F_0^{-1} g_0)^{1/2}\)
  • Proof outline:
    • Start with solving \(\max c^\top v \) s.t. \(\|v\|_2^2\leq \delta\) PollEv
    • Consider change of variables $$v=F_0^{1/2}(\theta-\theta_0),\quad c=F_0^{-1/2}g_0$$

Natural Policy Gradient

Algorithm: Natural PG

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy and stimate \(\nabla J(\theta_i)\) with \(g_i\) (REINFORCE, Actor-Critic, etc)
    • Estimate the Fischer Information Matrix $$F_i = \nabla \log \pi_{\theta_i}(a|s) \nabla \log \pi_{\theta_i}(a|s)^\top ,\quad s,a\sim d^{\pi_i}_{\mu_0}$$
    • Update \(\theta_{i+1} = \theta_i + \alpha F_i^{-1} g_i\)

In practice, common to minibatch samples from \( d^{\pi_i}_{\mu_0}\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

  • Parametrized policy: \(\pi_\theta(0)=\) stay
    • \(\pi_\theta(a|1) = \begin{cases} \frac{\exp \theta}{1+\exp \theta} & a=\mathsf{stay}\\ \frac{1}{1+\exp \theta} & a=\mathsf{switch}\end{cases} \)
  • NPG: \(\theta_1=\theta_0 + \alpha \frac{1}{F_0}g_0\); GA: \(\theta_1=\theta_0 + \alpha g_0 \)
  • \(F_0 \propto  \frac{\exp \theta_0}{(1+\exp \theta_0)^2}\to 0\) as \(\theta_0\to\pm\infty\)
  • NPG takes bigger and bigger steps as \(\theta\) becomes more extreme

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(\theta\)

\(+\infty\)

\(-\infty\)

stay

switch

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

Motivation: Lagrangian Relaxation

  • Trust region optimization $$\max_\theta\quad J(\theta)\quad \text{s.t.}\quad d_{KL}(\theta, \theta_0)<\delta$$
  • Computationally costly to deal with constraints, matrix inversion in NPG ("second order")
  • Methods that only use gradients ("first order") are less costly
  • Idea: run gradient ascent on relaxed objective $$\max_\theta\quad J(\theta)-\lambda d_{KL}(\theta, \theta_0)$$

Local objective \(J(\theta; \theta_0)\)

  • Define: New local objective centered at \(\theta_0\): $$J(\theta;\theta_0)= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right]  \right]$$
  • Recall the performance difference lemma $$\mathbb E_{s\sim\mu_0}[V^\pi(s) - V^{\pi'}(s)] = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{\mu_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right]  \right] $$
  • Intuition: maximize local advantage against \(\pi_{\theta_0}\) (while trust region ensures that distribution of \(\theta\) and \(\theta_0\) are close)
  • Fact: The gradients coincide: \(\nabla_\theta J(\theta;\theta)=\nabla_\theta J(\theta)\)
  • Importance weighting: $$J(\theta;\theta_0)= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_{\theta_0} (s)}\left[\frac{\pi_{\theta}(a|s) }{\pi_{\theta_0}(a|s) }A^{\pi_{\theta_0}}(s,a) \right]  \right]$$

Local objective has the same gradient as \(J(\theta)\) when \(\theta=\theta_0\)$$\nabla_{\theta} \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right]  \right]= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\nabla_{\theta}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[A^{\pi_{\theta_0}}(s,a) \right]  \right]$$

Using the importance weighting trick from Lecture 16

$$= \mathbb E_{s\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_\theta (s)}\left[\nabla_{\theta}\log\pi_\theta(a|s)A^{\pi_{\theta_0}}(s,a) \right]  \right]$$

If \(\theta=\theta_0\), this is the gradient expression from Actor-Critic with Advantage (Lecture 17)

Distance Penalty

$$\max_\theta\quad J(\theta;\theta_0)-\lambda d_{KL}(\theta, \theta_0)$$

  • Let's simplify the penalty term
  • \(d_{KL}(\theta, \theta_0)=\mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{\pi_{\theta_0}(a|s)}{\pi_\theta(a|s)}\right]\)
    • \( = \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\frac{1}{\pi_\theta(a|s)}\right]+\mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log \pi_{\theta_0}(a|s)\right]\)
    • \( = -\mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log\pi_\theta(a|s)\right]+\) [term independent of \(\theta\)]

Proximal Policy Optimization

$$\max_\theta\quad \mathbb E_{s,a\sim d^{\pi_{\theta_0}}_{\mu_0}}\left[ \frac{\pi_{\theta}(a|s) }{\pi_{\theta_0}(a|s) }A^{\pi_{\theta_0}}(s,a)   \right]+\lambda  \mathbb E_{s, a\sim d_{\mu_0}^{\pi_{\theta_0}}}\left[\log{\pi_\theta(a|s)}\right]$$

Algorithm: Idealized PPO

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Rollout policy to sample \(s,a\sim d^{\pi_i}_{\mu_0}\)
    • Define \(L(\theta)=\frac{\pi_{\theta}(a|s) }{\pi_{\theta_i}(a|s) }A^{\pi_{\theta_i}}(s,a)  +\lambda \log{\pi_\theta(a|s)}\)
    • Take several gradient steps on \(L(\theta)\), resulting in \(\theta_{i+1} \)

In practice, estimate \(\hat A^{\pi_{\theta_i}}\) and minibatch samples from \( d^{\pi_i}_{\mu_0}\)

Agenda

1. Recap

2. Natural PG

3. Proximal Policy Opt

4. Review

  • Supervised learning: features \(x\) and labels \(y\)
    • Goal: predict labels with \(\hat f(x)\approx \mathbb E[y|x]\)
    • Requirements: dataset \(\{x_i,y_i\}_{i=1}^N\)
    • Method: \(\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2\)
  • Fitted Value Iteration: fixed point iteration algorithm, like VI
    • Instead of Bellman Optimality Operator in iteration \(k\), use supervised learning with $$x_i=(s_i,a_i),\quad y_i=r(s_i,a_i)+\gamma \max _{a} Q^k (s_{i+1}, a)$$ to find \(Q^{k+1}\)
    • Dataset can be off policy

Value-based RL

  • Fitted Policy Iteration: replace Policy Evaluation step with Fitted Policy Evaluation
    • Incremental policy updates to avoid oscillation (Performance Difference Lemma)
  • Fitted Policy Evaluation: given on policy data from \(\pi\), estimate \(Q^{\pi}\) (PSets: how to use off policy data)
    • Approximate: at iteration \(j\), replace Bellman Consistency Equation with supervised learning on $$x_i=(s_i,a_i),\quad y_i=r(s_i,a_i)+\gamma Q^j (s_{i+1}, a_{i+1})$$
    • Direct: supervised learning on $$x_i=(s_i,a_i),\quad \textstyle y_i=\sum_{\ell=i}^{i+h_i} r\left(s_\ell, a_\ell\right) $$

Value-Based RL

Policy Optimization

  • \(J(\theta)=\) expected cumulative reward under policy \(\pi_\theta\)
  • Estimate \(\nabla_\theta J(\theta)\) via rollouts \(\tau\), observed reward \(R(\tau)\)
    • Random Search: \(\theta + \delta v\) , \(g=\frac{1}{2\delta}R(\tau) v\)
    • REINFORCE: \(g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\)
    • Actor-Critic: \(s,a\sim d^{\pi_\theta}_{\mu_0}\) ,
      \(g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (\hat Q^{\pi_\theta}(s,a)-b(s)) \)

Food for thought: how to compute off-policy gradient estimate?

Food for thought: compare the bias and variance of different gradient estimates or supervised learning labels.

Policy Optimization

  • Policy Gradient Meta-Algorithm
    for \(i=0,1,...\)
    1. collect rollouts using \(\theta_i\)
    2. estimate gradient with \(g_i\)
    3. \(\theta_{i+1} = \theta_i+ \alpha g_i\)
  • Trust regions $$ \max ~J(\theta)~~\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta $$
    • Natural PG: first/second order approximation
    • Proximal PO: Lagrangian relaxation

Recap

  • PSet due Fri, PA due Mon
  • OH and Ed changes due to break, prelim
  • Prelim in lecture 4/10

 

  • Natural Policy Gradient
  • Proximal Policy Optimization

 

  • Happy spring break!