CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Optimization Overview

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- PA 3 released Friday, due 3/31
- PSet 4 released Wednesday
- 5789 Paper Reviews due weekly on Mondays
Prelim
- Regrade requests open until Wednesday 11:59pm

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Recap: Value-based RL

action

state, reward

policy

data

experience

Key components of a value-based RL algorithm:

Rollout policy
Construct/update dataset
Learn/update Q function

1. PI with MC

Data-driven PI

Alternate learning $Q^\pi$ w/ improving $\pi$
$\sum_{k=t}^{h} r_k$
On policy

Recap: Comparison

2. PI with TD

Data-driven PI

Alternate learning $Q^\pi$ w/ improving $\pi$
$r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})$
On policy

3. Q-learning

Data-driven VI

Alternate learning $Q^\star$ w/ updating $\pi$
$r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)$
Off policy

Ultimate Goal: find (near) optimal policy
Value-based RL estimates intermediate quantities
- $Q^{\pi}$ or $Q^{\star}$ are indirectly useful for finding optimal policy
Imitation learning had no intermediaries, but requires data from an expert policy
Idea: optimize policy without relying on intermediaries:
- objective as a function of policy: $J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]$
- For parametric (e.g. deep) policy $\pi_\theta$: $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

$J(\theta)$

$\theta$

Motivation: Optimization

So far, we have discussed tabular and quadratic optimization
- ```
np.amin(J, axis=1)
```
- for $J(\theta) = a\theta^2 + b\theta +c$, maximum $\theta^\star = -\frac{b}{2a}$
Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

$\theta^\star$

Maxima

Consider a function $J(\theta) :\mathbb R^d \to\mathbb R$.

Def: A global maximum is a point $\theta_\star$ such that $J(\theta_\star)\geq J(\theta)$ for all $\theta\in\mathbb R^d$. A local maximum is a point satisfying the inequality for all $\theta$ s.t. $\|\theta_\star-\theta\|\leq \epsilon$ for some $\epsilon>0$.

single global max

$\underbrace{\qquad}$

many global max

global max

local max

Ascent Directions

$\theta_1$

$\theta_2$

Definition: An ascent direction at $\theta_0$ is any $v$ such that $J(\theta_0+\alpha v)\geq J(\theta_0)$ for all $0<\alpha<\alpha_0$ for some $\alpha_0>0$.
- ascent directions help us find maxima
The gradient of a differentiable function is the direction of steepest ascent

$\theta_1$

$\theta_2$

point $\theta_0$
ascent directions
gradient $\nabla J(\theta_0)$

2D quadratic function

level sets of quadratic

Gradient Ascent

GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
- the RHS is maximized when $\theta-\theta_i$ is parallel to $\nabla J(\theta_i)$
- step size $\alpha$ prevents $\theta_{i+1}$ from moving too far away from $\theta_i$, where approximation would be inaccurate

Algorithm: Gradient Ascent

Initialize $\theta_0$
For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)$

Critical Points

The gradient is equal to zero at a local maximum
- by definition, no ascent directions at local max
Def: a critical point is a point $\theta_0$ where $\nabla J(\theta_0) = 0$
Critical points are fixed points of the gradient ascent algorithm
Critical points can be (local or global) maxima, (local or global) minima, or saddle points

saddle point

Concave Functions

A function is concave if the line connecting any two points on the function lies entirely below the function
If $J$ is concave, then $\nabla J(\theta_0)=0 \implies \theta_0$ is a global maximum

concave

not concave

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Stochastic Gradient Ascent

Rather than exact gradients, SGA uses unbiased estimates of the gradient $g_i$, i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

Initialize $\theta_0$; For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha g_i$

$\theta_1$

$\theta_2$

$\theta_1$

$\theta_2$

gradient ascent
stochastic gradient ascent

2D quadratic function

level sets of quadratic

Example: Linear Regression

Supervised learning with linear functions $\theta^\top x$ and dataset of $N$ training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
The gradient is $\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j$
Training with SGD means sampling $j$ uniformly and
- $g_i = \nabla (\theta_i^\top x_j - y_j)^2 = 2(\theta_i^\top x_j - y_j)x_j$
Verifying that $g_i$ is unbiased:
- $\mathbb E[g_i] = \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)$

PollEV

SGA Convergence

SGA converges to critical points (when $J$ is concave, this means that SGA converges to the global max)
Specifically, for "well behaved" function $J$ and "well chosen" step size, running SGA for $T$ iterations:
- norm $\| \nabla J(\theta_i)\|_2$ converges in expectation
- at a rate of $\sqrt{\sigma^2/T}$ where $\sigma^2$ is the variance $g_i$
Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD

Additional Details

Theorem: Suppose that $J$ is differentiable and "well behaved"

i.e. $\beta$ smooth and $M$ bounded
i.e. $\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2$ and $\sup_\theta J(\theta)\leq M$.

Then for SGA with independent gradient estimates which are

unbiased $\mathbb E[g_i] = \nabla J(\theta_i)$ and variance $\mathbb E[\|g_i\|_2^2] = \sigma^2$

$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$

Example: Minibatching

Continuing linear regression example
Minibatching means sampling $j_1, j_2, ..., j_M$ uniformly and
- $g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 = 2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}$
Same argument verifies that $g_i$ is unbiased
Variance:
- $\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}$
Where we define $\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] $ to be the variance of a single data-point estimate
Larger minibatch size $M$ means lower variance!

Gradients in RL

Can we use sampled trajectories to estimate the gradient of $J(\theta) = V^{\pi_\theta}(s_0)$ analogous to SGD for supervised learning?
Sampled trajectories can estimate $V^{\pi_\theta}(s_0)$ as we saw in the past several lectures
They cannot be used to estimate $\nabla_\theta V^{\pi_\theta}(s_0)$ analagously to supervised learning
- RL: $\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)$ but $P$ is unknown :(
- SL: $\theta\to $loss function$\to J(\theta)$ and loss function is known!

Gradients in RL

Can we use sampled trajectories to estimate the gradient of $J(\theta) = V^{\pi_\theta}(s_0)$ analogous to SGD for supervised learning?
Simple example: consider $s_0=1$, $\pi_\theta(0) =$ stay, and
$\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}$

$r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}$

One step horizon:
$V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]$
$= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))$

$0$

$1$

stay: $1$

switch: $1$

stay: $1-p$

switch: $1$

stay: $p$

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Derivative Free Optimization

$\theta_1$

$\theta_2$

Setting: we can query $J(\theta_i)$ but not $\nabla J(\theta_i)$.
Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies

$\theta_1$

$\theta_2$

point $\theta_0$
test points
ascent directions

2D quadratic function

level sets of quadratic

1) Random Search

Recall the finite difference approximation:
- In one dimension: $J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta} $
- For multivariate functions, $\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $

Algorithm: Random Search

Initialize $\theta_0$
For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta}( J(\theta_i+\delta v) - J(\theta_i - \delta v))$

1) Random Search

For multivariate functions, $\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $

Algorithm: Random Search

Initialize $\theta_0$
For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v$

We can understand this as SGA
$\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i] $
- $=\mathbb E[2\delta v v^\top \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i) $

$\nabla J(\theta)$$ \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$

$J(\theta) = -\theta^2 - 1$

$\theta$

Random Search Example

start with $\theta$ positive
suppose $v$ is positive
then $J(\theta+\delta v)<J(\theta-\delta v)$
therefore $g$ is negative
indeed, $\nabla J(\theta) = -2\theta<0$ when $\theta>0$

2) Importance Weighting

Suppose that $J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$
- E.g. in reinforcement learning $V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$
Fact: The gradient $\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$
- Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty $$
- Then $\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}] $
  - general principle: reweight by ratio of probability distributions (PSet 5)
- The gradient $\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}] $
- Set $\rho = P_\theta$ and notice that $\nabla_\theta \left[\log P_\theta(z) \right] = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}$

2) Importance Weighting

Suppose that $J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$
- E.g. in reinforcement learning $V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$
Fact: The gradient $\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$
SGA-inspired algorithm

$\underbrace{\qquad\qquad}_{\text{score}}$

Algorithm: Monte-Carlo DFO

Initialize $\theta_0$
For $i=0,1,...$:
- sample $z\sim P_{\theta_i}$
- $\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)$

Importance Weighting Example

$\nabla J(\theta)$$ \approx \nabla_\theta \log(P_\theta(z)) h(z) $

$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$

$z$

$\nabla_\theta \log P_\theta(z)= (z-\theta)$

$h(z) = -z^2$

$=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]$

($=-\theta^2 - 1$)

$P_\theta = \mathcal N(\theta, 1)$

start with $\theta$ positive
suppose $z>\theta$
then score is positive
therefore $g$ is negative (since $h(z)<0$)

$\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2$

Recap

PSet released tonight
PA was released Fri

Maxima and critical points
Stochastic gradient ascent

Next lecture: Policy Optimization

Sp23 CS 4/5789: Lecture 16

By Sarah Dean

Sp23 CS 4/5789: Lecture 16

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Optimization Overview

Reminders

Agenda

Recap: Value-based RL

Recap: Comparison

Preview: Policy Optimization

Agenda

Motivation: Optimization

Maxima

Ascent Directions

Gradient Ascent

Critical Points

Concave Functions

Agenda

Stochastic Gradient Ascent

Example: Linear Regression

SGA Convergence

Additional Details

Example: Minibatching

Gradients in RL

Gradients in RL

Agenda

Derivative Free Optimization

1) Random Search

1) Random Search

Random Search Example

2) Importance Weighting

2) Importance Weighting

Importance Weighting Example

Recap

Sp23 CS 4/5789: Lecture 16

More from Sarah Dean