Sp23 CS 4/5789: Lecture 16

CS 4/5789: Introduction to Reinforcement Learning

Lecture 16: Optimization Overview

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- PA 3 released Friday, due 3/31
- PSet 4 released Wednesday
- 5789 Paper Reviews due weekly on Mondays
Prelim
- Regrade requests open until Wednesday 11:59pm

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Recap: Value-based RL

action

state, reward

policy

data

experience

Key components of a value-based RL algorithm:

Rollout policy
Construct/update dataset
Learn/update Q function

1. PI with MC

Data-driven PI

Alternate learning $Q^\pi$ w/ improving $\pi$
$\sum_{k=t}^{h} r_k$
On policy

Recap: Comparison

2. PI with TD

Data-driven PI

Alternate learning $Q^\pi$ w/ improving $\pi$
$r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})$
On policy

3. Q-learning

Data-driven VI

Alternate learning $Q^\star$ w/ updating $\pi$
$r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)$
Off policy

Ultimate Goal: find (near) optimal policy
Value-based RL estimates intermediate quantities
- $Q^{\pi}$ or $Q^{\star}$ are indirectly useful for finding optimal policy
Imitation learning had no intermediaries, but requires data from an expert policy
Idea: optimize policy without relying on intermediaries:
- objective as a function of policy: $J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]$
- For parametric (e.g. deep) policy $\pi_\theta$: $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

$J(\theta)$

$\theta$

Motivation: Optimization

So far, we have discussed tabular and quadratic optimization
- ```
np.amin(J, axis=1)
```
- for $J(\theta) = a\theta^2 + b\theta +c$, maximum $\theta^\star = -\frac{b}{2a}$
Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

$\theta^\star$

Maxima

Consider a function $J(\theta) :\mathbb R^d \to\mathbb R$.

Def: A global maximum is a point $\theta_\star$ such that $J(\theta_\star)\geq J(\theta)$ for all $\theta\in\mathbb R^d$. A local maximum is a point satisfying the inequality for all $\theta$ s.t. $\|\theta_\star-\theta\|\leq \epsilon$ for some $\epsilon>0$.

single global max

$\underbrace{\qquad}$

many global max

global max

local max

Ascent Directions

$\theta_1$

$\theta_2$

Definition: An ascent direction at $\theta_0$ is any $v$ such that $J(\theta_0+\alpha v)\geq J(\theta_0)$ for all $0<\alpha<\alpha_0$ for some $\alpha_0>0$.
- ascent directions help us find maxima
The gradient of a differentiable function is the direction of steepest ascent

$\theta_1$

$\theta_2$

point $\theta_0$
ascent directions
gradient $\nabla J(\theta_0)$

2D quadratic function

level sets of quadratic

Gradient Ascent

GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
- the RHS is maximized when $\theta-\theta_i$ is parallel to $\nabla J(\theta_i)$
- step size $\alpha$ prevents $\theta_{i+1}$ from moving too far away from $\theta_i$, where approximation would be inaccurate

Algorithm: Gradient Ascent

Initialize $\theta_0$
For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)$

Critical Points

The gradient is equal to zero at a local maximum
- by definition, no ascent directions at local max
Def: a critical point is a point $\theta_0$ where $\nabla J(\theta_0) = 0$
Critical points are fixed points of the gradient ascent algorithm
Critical points can be (local or global) maxima, (local or global) minima, or saddle points

saddle point

Concave Functions

A function is concave if the line connecting any two points on the function lies entirely below the function
If $J$ is concave, then $\nabla J(\theta_0)=0 \implies \theta_0$ is a global maximum

concave

not concave

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Stochastic Gradient Ascent

Rather than exact gradients, SGA uses unbiased estimates of the gradient $g_i$, i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

Initialize $\theta_0$; For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha g_i$

$\theta_1$

$\theta_2$

$\theta_1$

$\theta_2$

gradient ascent
stochastic gradient ascent

2D quadratic function

level sets of quadratic

Example: Linear Regression

Supervised learning with linear functions $\theta^\top x$ and dataset of $N$ training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
The gradient is $\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j$
Training with SGD means sampling $j$ uniformly and
- $g_i = \nabla (\theta_i^\top x_j - y_j)^2 = 2(\theta_i^\top x_j - y_j)x_j$
Verifying that $g_i$ is unbiased:
- $\mathbb E[g_i] = \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)$

PollEV

SGA Convergence

SGA converges to critical points (when $J$ is concave, this means that SGA converges to the global max)
Specifically, for "well behaved" function $J$ and "well chosen" step size, running SGA for $T$ iterations:
- norm $\| \nabla J(\theta_i)\|_2$ converges in expectation
- at a rate of $\sqrt{\sigma^2/T}$ where $\sigma^2$ is the variance $g_i$
Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD

Additional Details

Theorem: Suppose that $J$ is differentiable and "well behaved"

i.e. $\beta$ smooth and $M$ bounded
i.e. $\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2$ and $\sup_\theta J(\theta)\leq M$.

Then for SGA with independent gradient estimates which are

unbiased $\mathbb E[g_i] = \nabla J(\theta_i)$ and variance $\mathbb E[\|g_i\|_2^2] = \sigma^2$

$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$

Example: Minibatching

Continuing linear regression example
Minibatching means sampling $j_1, j_2, ..., j_M$ uniformly and
- $g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 = 2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}$
Same argument verifies that $g_i$ is unbiased
Variance:
- $\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}$
Where we define $\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] $ to be the variance of a single data-point estimate
Larger minibatch size $M$ means lower variance!

Gradients in RL

Can we use sampled trajectories to estimate the gradient of $J(\theta) = V^{\pi_\theta}(s_0)$ analogous to SGD for supervised learning?
Sampled trajectories can estimate $V^{\pi_\theta}(s_0)$ as we saw in the past several lectures
They cannot be used to estimate $\nabla_\theta V^{\pi_\theta}(s_0)$ analagously to supervised learning
- RL: $\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)$ but $P$ is unknown :(
- SL: $\theta\to $loss function$\to J(\theta)$ and loss function is known!

Gradients in RL

Can we use sampled trajectories to estimate the gradient of $J(\theta) = V^{\pi_\theta}(s_0)$ analogous to SGD for supervised learning?
Simple example: consider $s_0=1$, $\pi_\theta(0) =$ stay, and
$\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}$

$r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}$

One step horizon:
$V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]$
$= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))$

$0$

$1$

stay: $1$

switch: $1$

stay: $1-p$

switch: $1$

stay: $p$

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

Derivative Free Optimization

$\theta_1$

$\theta_2$

Setting: we can query $J(\theta_i)$ but not $\nabla J(\theta_i)$.
Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies

$\theta_1$

$\theta_2$

point $\theta_0$
test points
ascent directions

2D quadratic function

level sets of quadratic

1) Random Search

Recall the finite difference approximation:
- In one dimension: $J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta} $
- For multivariate functions, $\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $

Algorithm: Random Search

Initialize $\theta_0$
For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta}( J(\theta_i+\delta v) - J(\theta_i - \delta v))$

1) Random Search

For multivariate functions, $\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $

Algorithm: Random Search

Initialize $\theta_0$
For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v$

We can understand this as SGA
$\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i] $
- $=\mathbb E[2\delta v v^\top \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i) $

$\nabla J(\theta)$$ \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$

$J(\theta) = -\theta^2 - 1$

$\theta$

Random Search Example

start with $\theta$ positive
suppose $v$ is positive
then $J(\theta+\delta v)<J(\theta-\delta v)$
therefore $g$ is negative
indeed, $\nabla J(\theta) = -2\theta<0$ when $\theta>0$

2) Importance Weighting

Suppose that $J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$
- E.g. in reinforcement learning $V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$
Fact: The gradient $\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$
- Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty $$
- Then $\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}] $
  - general principle: reweight by ratio of probability distributions (PSet 5)
- The gradient $\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}] $
- Set $\rho = P_\theta$ and notice that $\nabla_\theta \left[\log P_\theta(z) \right] = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}$

2) Importance Weighting

Suppose that $J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$
- E.g. in reinforcement learning $V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$
Fact: The gradient $\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$
SGA-inspired algorithm

$\underbrace{\qquad\qquad}_{\text{score}}$

Algorithm: Monte-Carlo DFO

Initialize $\theta_0$
For $i=0,1,...$:
- sample $z\sim P_{\theta_i}$
- $\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)$

Importance Weighting Example

$\nabla J(\theta)$$ \approx \nabla_\theta \log(P_\theta(z)) h(z) $

$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$

$z$

$\nabla_\theta \log P_\theta(z)= (z-\theta)$

$h(z) = -z^2$

$=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]$

($=-\theta^2 - 1$)

$P_\theta = \mathcal N(\theta, 1)$

start with $\theta$ positive
suppose $z>\theta$
then score is positive
therefore $g$ is negative (since $h(z)<0$)

$\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2$

Recap

PSet released tonight
PA was released Fri

Maxima and critical points
Stochastic gradient ascent

Next lecture: Policy Optimization