CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Optimization Overview

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • PA 3 released Friday, due 3/31
    • PSet 4 released Wednesday
    • 5789 Paper Reviews due weekly on Mondays
  • Prelim
    • Regrade requests open until Wednesday 11:59pm

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Recap: Value-based RL

action

state,  reward

policy

data

experience

Key components of a value-based RL algorithm:

  1. Rollout policy
  2. Construct/update dataset
  3. Learn/update Q function

1. PI with MC

Data-driven PI

  • Alternate learning \(Q^\pi\) w/ improving \(\pi\)
  • \(\sum_{k=t}^{h} r_k\)




     
  • On policy

Recap: Comparison

2. PI with TD

Data-driven PI

  • Alternate learning \(Q^\pi\) w/ improving \(\pi\)
  • \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})\)




     
  • On policy

3. Q-learning

Data-driven VI

  • Alternate learning \(Q^\star\) w/ updating \(\pi\)
  • \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)\)




     
  • Off policy
  • Ultimate Goal: find (near) optimal policy
  • Value-based RL estimates intermediate quantities
    • \(Q^{\pi}\) or \(Q^{\star}\) are indirectly useful for finding optimal policy
  • Imitation learning had no intermediaries, but requires data from an expert policy
  • Idea: optimize policy without relying on intermediaries:
    • objective as a function of policy: \(J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]\)
    • For parametric (e.g. deep) policy \(\pi_\theta\): $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Parabola

\(J(\theta)\)

\(\theta\)

Motivation: Optimization

  • So far, we have discussed tabular and quadratic optimization
    • np.amin(J, axis=1)
    • for \(J(\theta) = a\theta^2 + b\theta +c\), maximum \(\theta^\star = -\frac{b}{2a}\)
  • Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

\(\theta^\star\)

Maxima

Consider a function \(J(\theta) :\mathbb R^d \to\mathbb R\).

  • Def: A global maximum is a point \(\theta_\star\) such that \(J(\theta_\star)\geq J(\theta)\) for all \(\theta\in\mathbb R^d\). A local maximum is a point satisfying the inequality for all \(\theta\) s.t. \(\|\theta_\star-\theta\|\leq \epsilon\) for some \(\epsilon>0\).
Parabola
Parabola
Parabola

single global max

\(\underbrace{\qquad}\)

many global max

global max

local max

Ascent Directions

Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

  • Definition: An ascent direction at \(\theta_0\) is any \(v\) such that \(J(\theta_0+\alpha v)\geq J(\theta_0)\) for all \(0<\alpha<\alpha_0\) for some \(\alpha_0>0\).
    • ascent directions help us find maxima
  • The gradient of a differentiable function is the direction of steepest ascent

\(\theta_1\)

\(\theta_2\)

  • point \(\theta_0\)
  • ascent directions
  • gradient \(\nabla J(\theta_0)\)

2D quadratic function

level sets of quadratic

Gradient Ascent

Parabola
Parabola
  • GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
    • the RHS is maximized when \(\theta-\theta_i\) is parallel to \(\nabla J(\theta_i)\)
    • step size \(\alpha\) prevents \(\theta_{i+1}\) from moving too far away from \(\theta_i\), where approximation would be inaccurate

Algorithm: Gradient Ascent

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • \(\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)\)

Critical Points

  • The gradient is equal to zero at a local maximum
    • by definition, no ascent directions at local max
  • Def: a critical point is a point \(\theta_0\) where \(\nabla J(\theta_0) = 0\)
  • Critical points are fixed points of the gradient ascent algorithm
  • Critical points can be (local or global) maxima, (local or global) minima, or saddle points
Parabola
Parabola
image/svg+xml

saddle point

Concave Functions

  • A function is concave if the line connecting any two points on the function lies entirely below the function
  • If \(J\) is concave, then \(\nabla J(\theta_0)=0 \implies \theta_0\) is a global maximum
Parabola
Parabola
Parabola

concave

concave

not concave

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Stochastic Gradient Ascent

  • Rather than exact gradients, SGA uses unbiased estimates of the gradient \(g_i\), i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

  • Initialize \(\theta_0\); For \(i=0,1,...\):
    • \(\theta_{i+1} = \theta_i + \alpha g_i\)
Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

\(\theta_1\)

\(\theta_2\)

  • gradient ascent
  • stochastic gradient ascent

2D quadratic function

level sets of quadratic

Example: Linear Regression

  • Supervised learning with linear functions \(\theta^\top x\) and dataset of \(N\) training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
  • The gradient is \(\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j\)
  • Training with SGD means sampling \(j\) uniformly and
    • \(g_i = \nabla (\theta_i^\top x_j - y_j)^2 =  2(\theta_i^\top x_j - y_j)x_j\)
  • Verifying that \(g_i\) is unbiased:
    • \(\mathbb E[g_i] =  \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)\)

PollEV

SGA Convergence

  • SGA converges to critical points (when \(J\) is concave, this means that SGA converges to the global max)
  • Specifically, for "well behaved" function \(J\) and "well chosen" step size, running SGA for \(T\) iterations:
    • norm \(\| \nabla J(\theta_i)\|_2\) converges in expectation
    • at a rate of \(\sqrt{\sigma^2/T}\) where \(\sigma^2\) is the variance \(g_i\)
  • Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD

Additional Details

Theorem: Suppose that \(J\) is differentiable and "well behaved"

  • i.e. \(\beta\) smooth and \(M\) bounded
  • i.e. \(\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2\) and \(\sup_\theta J(\theta)\leq M\).

Then for SGA with independent gradient estimates which are

  • unbiased \(\mathbb E[g_i] = \nabla J(\theta_i)\) and variance \(\mathbb E[\|g_i\|_2^2] = \sigma^2\)

$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$

Example: Minibatching

  • Continuing linear regression example
  • Minibatching means sampling \(j_1, j_2, ..., j_M\) uniformly and
    • \(g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 =  2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}\)
  • Same argument verifies that \(g_i\) is unbiased
  • Variance:
    • \(\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}\)
  • Where we define \(\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] \) to be the variance of a single data-point estimate
  • Larger minibatch size \(M\) means lower variance!

Gradients in RL

  • Can we use sampled trajectories to estimate the gradient of \(J(\theta) = V^{\pi_\theta}(s_0)\) analogous to SGD for supervised learning?
  • Sampled trajectories can estimate \(V^{\pi_\theta}(s_0)\) as we saw in the past several lectures
  • They cannot be used to estimate \(\nabla_\theta V^{\pi_\theta}(s_0)\) analagously to supervised learning
    • RL: \(\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)\) but \(P\) is unknown :(
    • SL: \(\theta\to \)loss function\(\to J(\theta)\) and loss function is known!

Gradients in RL

  • Can we use sampled trajectories to estimate the gradient of \(J(\theta) = V^{\pi_\theta}(s_0)\) analogous to SGD for supervised learning?
  • Simple example: consider \(s_0=1\), \(\pi_\theta(0) =\) stay, and
    \(\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}\)

\(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}\)

  • One step horizon:
  • \(V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]\)
  • \(= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(1-p\)

switch: \(1\)

stay: \(p\)

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Derivative Free Optimization

Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

  • Setting: we can query \(J(\theta_i)\) but not \(\nabla J(\theta_i)\).
  • Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
  • Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies

\(\theta_1\)

\(\theta_2\)

  • point \(\theta_0\)
  • test points
  • ascent directions

2D quadratic function

level sets of quadratic

1) Random Search

  • Recall the finite difference approximation:
    • In one dimension: \(J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta} \)
    • For multivariate functions, \(\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} \)

Algorithm: Random Search

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • sample \(v\sim \mathcal N(0,I)\)
    • \(\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta}( J(\theta_i+\delta v) - J(\theta_i - \delta v))\)

1) Random Search

  • For multivariate functions, \(\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} \)

Algorithm: Random Search

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • sample \(v\sim \mathcal N(0,I)\)
    • \(\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v\)
  • We can understand this as SGA
  • \(\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i] \)
    • \(=\mathbb E[2\delta v v^\top  \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i) \)

\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)

Parabola

\(J(\theta) = -\theta^2 - 1\)

\(\theta\)

Random Search Example

  • start with \(\theta\) positive
  • suppose \(v\) is positive
  • then \(J(\theta+\delta v)<J(\theta-\delta v)\)
  • therefore \(g\) is negative
  • indeed, \(\nabla J(\theta) = -2\theta<0\) when \(\theta>0\)

2) Importance Weighting

  • Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
    • E.g. in reinforcement learning \(V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]\)
  • Fact: The gradient \(\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]\)
    • Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty $$
    • Then \(\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}] \)
      • general principle: reweight by ratio of probability distributions (PSet 5)
    • The gradient \(\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}] \)
    • Set \(\rho = P_\theta\) and notice that \(\nabla_\theta \left[\log P_\theta(z) \right]  = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}\)

2) Importance Weighting

  • Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
    • E.g. in reinforcement learning \(V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]\)
  • Fact: The gradient \(\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]\)
  • SGA-inspired algorithm

\(\underbrace{\qquad\qquad}_{\text{score}}\)

Algorithm: Monte-Carlo DFO

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • sample \(z\sim P_{\theta_i}\)
    • \(\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)\)

Importance Weighting Example

\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(z)) h(z) \)

Parabola

\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)

\(z\)

image/svg+xml

\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)

\(h(z) = -z^2\)

\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)

(\(=-\theta^2 - 1\))

\(P_\theta = \mathcal N(\theta, 1)\)

  • start with \(\theta\) positive
  • suppose \(z>\theta\)
  • then score is positive
  • therefore \(g\) is negative (since \(h(z)<0\))

\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)

Recap

  • PSet released tonight
  • PA was released Fri

 

  • Maxima and critical points
  • Stochastic gradient ascent

 

  • Next lecture: Policy Optimization

Sp23 CS 4/5789: Lecture 16

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 16