CS 4/5789: Introduction to Reinforcement Learning

Lecture 15: Optimization Overview

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- PSet 5 due Friday
- PA 3 due 3/29
- 5789: Paper assignments posted on Canvas
Prelims
- Regrade requests open until tonight at 11:59pm
- Next prelim is Wednesday, April 10

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Recap: Value-based RL

action

state

policy

data

experience

We use dataset containing trajectory $\tau=(s_i,a_i)_{i=0}^n$
Fitted Value Iteration: use supervised learning to update $Q^{k+1}$
- $y_i=r(s_i,a_i)+\gamma\max_a Q^k(s_{i+1},a)$
Fitted Policy Iteration: use supervised learning for Policy Evaluation
- iterative: $y_i=r(s_i,a_i)+\gamma Q^k(s_{i+1},a_{i+1})$
- direct: $y_i = \sum_{\ell=i}^{i+h_i}r(s_\ell,a_\ell) $

Ultimate Goal: find (near) optimal policy
Value-based RL estimates intermediate quantities
- $Q^{\pi}$ or $Q^{\star}$ are indirectly useful for finding optimal policy
Idea: optimize policy without relying on intermediaries:
- objective as a function of policy: $J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]$
- For parametric (e.g. deep) policy $\pi_\theta$: $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

$J(\theta)$

$\theta$

Motivation: Optimization

So far, we have discussed tabular and quadratic optimization
- ```
np.amin(J, axis=1)
```
- for $J(\theta) = a\theta^2 + b\theta +c$, maximum $\theta^\star = -\frac{b}{2a}$
Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

$\theta^\star$

Maxima

Consider a function $J(\theta) :\mathbb R^d \to\mathbb R$.

Def: A global maximum is a point $\theta_\star$ such that $J(\theta_\star)\geq J(\theta)$ for all $\theta\in\mathbb R^d$. A local maximum is a point satisfying the inequality for all $\theta$ s.t. $\|\theta_\star-\theta\|\leq \epsilon$ for some $\epsilon>0$.

single global max

$\underbrace{\qquad}$

many global max

global max

local max

Ascent Directions

$\theta_1$

$\theta_2$

Definition: An ascent direction at $\theta_0$ is any $v$ such that $J(\theta_0+\alpha v)\geq J(\theta_0)$ for all $0<\alpha<\alpha_0$ for some $\alpha_0>0$.
- ascent directions help us find maxima
The gradient of a differentiable function is the direction of steepest ascent

$\theta_1$

$\theta_2$

point $\theta_0$
ascent directions
gradient $\nabla J(\theta_0)$

2D quadratic function

level sets of quadratic

Gradient Ascent

GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
- the RHS is maximized when $\theta-\theta_i$ is parallel to $\nabla J(\theta_i)$
- step size $\alpha$ prevents $\theta_{i+1}$ from moving too far away from $\theta_i$, where approximation would be inaccurate

Algorithm: Gradient Ascent

Initialize $\theta_0$
For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)$

Critical Points

The gradient is equal to zero at a local maximum
- by definition, no ascent directions at local max
Def: a critical point is a point $\theta_0$ where $\nabla J(\theta_0) = 0$
Critical points are fixed points of the gradient ascent algorithm
Critical points can be (local or global) maxima, (local or global) minima, or saddle points

saddle point

Concave Functions

A function is concave if the line connecting any two points on the function lies entirely below the function
If $J$ is concave, then $\nabla J(\theta_0)=0 \implies \theta_0$ is a global maximum

concave

not concave

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Stochastic Gradient Ascent

Rather than exact gradients, SGA uses unbiased estimates of the gradient $g_i$, i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

Initialize $\theta_0$; For $i=0,1,...$:
- $\theta_{i+1} = \theta_i + \alpha g_i$

$\theta_1$

$\theta_2$

$\theta_1$

$\theta_2$

gradient ascent
stochastic gradient ascent

2D quadratic function

level sets of quadratic

Example: Linear Regression

Supervised learning with linear functions $\theta^\top x$ and dataset of $N$ training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
The gradient is $\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j$
Training with SGD means sampling $j$ uniformly and
- $g_i = \nabla (\theta_i^\top x_j - y_j)^2 = 2(\theta_i^\top x_j - y_j)x_j$
Verifying that $g_i$ is unbiased:
- $\mathbb E[g_i] = \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)$

SGA Convergence

SGA converges to critical points (when $J$ is concave, this means that SGA converges to the global max)
Specifically, for "well behaved" function $J$ and "well chosen" step size, running SGA for $T$ iterations:
- norm $\| \nabla J(\theta_i)\|_2$ converges in expectation
- at a rate of $\sqrt{\sigma^2/T}$ where $\sigma^2$ is the variance $g_i$
Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD

Additional Details

Theorem: Suppose that $J$ is differentiable and "well behaved"

i.e. $\beta$ smooth and $M$ bounded
i.e. $\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2$ and $\sup_\theta J(\theta)\leq M$.

Then for SGA with independent gradient estimates which are

unbiased $\mathbb E[g_i] = \nabla J(\theta_i)$ and variance $\mathbb E[\|g_i\|_2^2] = \sigma^2$

$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$

Example: Minibatching

Continuing linear regression example
Minibatching means sampling $j_1, j_2, ..., j_M$ uniformly and
- $g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 = 2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}$
Same argument verifies that $g_i$ is unbiased
Variance:
- $\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}$
Where we define $\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] $ to be the variance of a single data-point estimate
Larger minibatch size $M$ means lower variance!

Gradients in RL

Can we use sampled trajectories to estimate the gradient of $J(\theta) = V^{\pi_\theta}(s_0)$ analogous to SGD for supervised learning?
Sampled trajectories can estimate $V^{\pi_\theta}(s_0)$ as we saw in the past several lectures
They cannot be used to estimate $\nabla_\theta V^{\pi_\theta}(s_0)$ analagously to supervised learning
- SL: $\theta\to $loss function$\to J(\theta)$ and loss function is known!
- RL: $\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)$ but $P$ is unknown :(

Gradients in RL

Can we use sampled trajectories to estimate the gradient of $J(\theta) = V^{\pi_\theta}(s_0)$ analogous to SGD for supervised learning?
Simple example: consider $s_0=1$, $\pi_\theta(0) =$ stay, and
$\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}$

$r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}$

One step horizon:
$V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]$
$= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))$

$0$

$1$

stay: $1$

switch: $1$

stay: $1-p$

switch: $1$

stay: $p$

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Derivative Free Optimization

$\theta_1$

$\theta_2$

Setting: we can query $J(\theta_i)$ but not $\nabla J(\theta_i)$. Brainstorm!
Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies

$\theta_1$

$\theta_2$

point $\theta_0$
test points
ascent directions

2D quadratic function

level sets of quadratic

Random Finite Difference Approx.

Recall the finite difference approximation:
- In one dimension: $J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta} $
- Partial derivatives $\frac{\partial J}{\partial \theta_i} \approx \frac{J(\theta+\delta e_i)-J(\theta-\delta e_i)}{2\delta} $

How have we used this before?
- Nonlinear control: construct gradients by sampling every $i$
More generally: $$\nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $$
Thought experiment: use random $v$

Random Search (two-point)

For multivariate functions, $ \nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $

Algorithm: Two Point Random Search

Initialize $\theta_0$
For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v$

$\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i] $
- $=\mathbb E[2\delta v v^\top \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i) $
We can understand this as SGA with $g_i=\frac{1}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v$

PollEV

$\nabla J(\theta)$$ \approx g= \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$

$J(\theta) = -\theta^2 - 1$

$\theta$

Random Search Example

start with $\theta$ positive
suppose $v$ is positive
then $J(\theta+\delta v)<J(\theta-\delta v)$
therefore $g$ is negative
indeed, $\nabla J(\theta) = -2\theta<0$ when $\theta>0$

Random Search (one point)

Algorithm: One Point Random Search

Initialize $\theta_0$. For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{\delta}J(\theta_i+\delta v) v$

Claim: This is SGA with $g_i=\frac{1}{\delta} J(\theta_i+\delta v) v$
Off centered finite diff. approx: $\nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta)}{\delta} $
First notice $\mathbb E[(J(\theta_i+\delta v) - J(\theta_i))v|\theta_i]=$
- $=\mathbb E[J(\theta_i+\delta v)v|\theta_i] - \mathbb E[J(\theta_i) v|\theta_i]=\mathbb E[J(\theta_i+\delta v)v|\theta_i]-0$

Random Search (one point)

Algorithm: One Point Random Search

Initialize $\theta_0$. For $i=0,1,...$:
- sample $v\sim \mathcal N(0,I)$
- $\theta_{i+1} = \theta_i + \frac{\alpha}{\delta}J(\theta_i+\delta v) v$

Claim: This is SGA with $g_i=\frac{1}{\delta} J(\theta_i+\delta v) v$
Off centered finite diff. approx: $\nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta)}{\delta} $
$\mathbb E[(J(\theta_i+\delta v) - J(\theta_i))v|\theta_i]=\mathbb E[J(\theta_i+\delta v)v|\theta_i]$
$\mathbb E[(J(\theta_i+\delta v) - J(\theta_i))v|\theta_i] \approx \mathbb E[\delta \nabla J(\theta_i)^\top v v|\theta_i] $
- $=\mathbb E[\delta v v^\top \nabla J(\theta_i)|\theta_i] = \delta \mathbb E[v v^\top] \nabla J(\theta_i) = \delta \nabla J(\theta_i) $

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
Recall notation for a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$ and probability of a trajectory $\mathbb P^{\pi}_{\mu_0}$
Define cumulative reward $R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$
For parametric (e.g. deep) policy $\pi_\theta$, the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Policy Optimization Setting

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$

Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Assume that we can "rollout" policy $\pi_\theta$ to observe:
- a sample $\tau$ from $\mathbb P^{\pi_\theta}_{\mu_0}$
- the resulting cumulative reward $R(\tau)$
Note: we do not need to know $P$! (Also easy to extend to not knowing $r$!)

Policy Optimization Setting

Random Search Policy Optimization

Given $\alpha, \delta$. Initialize $\theta_0$
For $i=0,1,...$:
- Sample $v\sim \mathcal N(0, I)$
- Rollout policy $\pi_{\theta_i+ \delta v}$ and observe trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$
- Estimate $g_i = \frac{1}{\delta} R(\tau) v$
- Update $\theta_{i+1} = \theta_i + \alpha g_i$

Policy Opt. with Random Search

Recap

PSet due Friday
PA due next Friday

Maxima and critical points
Stochastic gradient ascent
Random search

Next lecture: Policy Optimization

Sp24 CS 4/5789: Lecture 15

By Sarah Dean

Sp24 CS 4/5789: Lecture 15

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 15: Optimization Overview

Reminders

Agenda

Recap: Value-based RL

Preview: Policy Optimization

Agenda

Motivation: Optimization

Maxima

Ascent Directions

Gradient Ascent

Critical Points

Concave Functions

Agenda

Stochastic Gradient Ascent

Example: Linear Regression

SGA Convergence

Additional Details

Example: Minibatching

Gradients in RL

Gradients in RL

Agenda

Derivative Free Optimization

Random Finite Difference Approx.

Random Search (two-point)

Random Search Example

Random Search (one point)

Random Search (one point)

Agenda

Policy Optimization Setting

Policy Optimization Setting

Policy Opt. with Random Search

Recap

Sp24 CS 4/5789: Lecture 15

More from Sarah Dean