CS 4/5789: Introduction to Reinforcement Learning
Lecture 16: Optimization Overview
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- PA 3 released Friday, due 3/31
- PSet 4 released Wednesday
- 5789 Paper Reviews due weekly on Mondays
- Prelim
- Regrade requests open until Wednesday 11:59pm
Agenda
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
Recap: Value-based RL


action
state, reward

policy
data
experience
Key components of a value-based RL algorithm:
- Rollout policy
- Construct/update dataset
- Learn/update Q function
1. PI with MC
Data-driven PI
- Alternate learning \(Q^\pi\) w/ improving \(\pi\)
- \(\sum_{k=t}^{h} r_k\)
- On policy
Recap: Comparison
2. PI with TD
Data-driven PI
- Alternate learning \(Q^\pi\) w/ improving \(\pi\)
- \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})\)
- On policy
3. Q-learning
Data-driven VI
- Alternate learning \(Q^\star\) w/ updating \(\pi\)
- \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)\)
- Off policy
- Ultimate Goal: find (near) optimal policy
- Value-based RL estimates intermediate quantities
- \(Q^{\pi}\) or \(Q^{\star}\) are indirectly useful for finding optimal policy
- Imitation learning had no intermediaries, but requires data from an expert policy
- Idea: optimize policy without relying on intermediaries:
- objective as a function of policy: \(J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]\)
- For parametric (e.g. deep) policy \(\pi_\theta\): $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization
Agenda
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
\(J(\theta)\)
\(\theta\)
Motivation: Optimization
- So far, we have discussed tabular and quadratic optimization
-
np.amin(J, axis=1)
- for \(J(\theta) = a\theta^2 + b\theta +c\), maximum \(\theta^\star = -\frac{b}{2a}\)
-
- Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)
\(\theta^\star\)
Maxima
Consider a function \(J(\theta) :\mathbb R^d \to\mathbb R\).
- Def: A global maximum is a point \(\theta_\star\) such that \(J(\theta_\star)\geq J(\theta)\) for all \(\theta\in\mathbb R^d\). A local maximum is a point satisfying the inequality for all \(\theta\) s.t. \(\|\theta_\star-\theta\|\leq \epsilon\) for some \(\epsilon>0\).
single global max
\(\underbrace{\qquad}\)
many global max
global max
local max
Ascent Directions
\(\theta_1\)
\(\theta_2\)
-
Definition: An ascent direction at \(\theta_0\) is any \(v\) such that \(J(\theta_0+\alpha v)\geq J(\theta_0)\) for all \(0<\alpha<\alpha_0\) for some \(\alpha_0>0\).
- ascent directions help us find maxima
- The gradient of a differentiable function is the direction of steepest ascent
\(\theta_1\)
\(\theta_2\)
- point \(\theta_0\)
- ascent directions
- gradient \(\nabla J(\theta_0)\)
2D quadratic function
level sets of quadratic
Gradient Ascent
- GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
- the RHS is maximized when \(\theta-\theta_i\) is parallel to \(\nabla J(\theta_i)\)
- step size \(\alpha\) prevents \(\theta_{i+1}\) from moving too far away from \(\theta_i\), where approximation would be inaccurate
Algorithm: Gradient Ascent
- Initialize \(\theta_0\)
- For \(i=0,1,...\):
- \(\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)\)
Critical Points
- The gradient is equal to zero at a local maximum
- by definition, no ascent directions at local max
- Def: a critical point is a point \(\theta_0\) where \(\nabla J(\theta_0) = 0\)
- Critical points are fixed points of the gradient ascent algorithm
- Critical points can be (local or global) maxima, (local or global) minima, or saddle points

saddle point
Concave Functions
- A function is concave if the line connecting any two points on the function lies entirely below the function
- If \(J\) is concave, then \(\nabla J(\theta_0)=0 \implies \theta_0\) is a global maximum
concave
concave
not concave
Agenda
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
Stochastic Gradient Ascent
- Rather than exact gradients, SGA uses unbiased estimates of the gradient \(g_i\), i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$
Algorithm: SGA
- Initialize \(\theta_0\); For \(i=0,1,...\):
- \(\theta_{i+1} = \theta_i + \alpha g_i\)
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
- gradient ascent
- stochastic gradient ascent
2D quadratic function
level sets of quadratic
Example: Linear Regression
- Supervised learning with linear functions \(\theta^\top x\) and dataset of \(N\) training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
- The gradient is \(\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j\)
- Training with SGD means sampling \(j\) uniformly and
- \(g_i = \nabla (\theta_i^\top x_j - y_j)^2 = 2(\theta_i^\top x_j - y_j)x_j\)
- Verifying that \(g_i\) is unbiased:
- \(\mathbb E[g_i] = \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)\)
PollEV
SGA Convergence
- SGA converges to critical points (when \(J\) is concave, this means that SGA converges to the global max)
- Specifically, for "well behaved" function \(J\) and "well chosen" step size, running SGA for \(T\) iterations:
- norm \(\| \nabla J(\theta_i)\|_2\) converges in expectation
- at a rate of \(\sqrt{\sigma^2/T}\) where \(\sigma^2\) is the variance \(g_i\)
- Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD
Additional Details
Theorem: Suppose that \(J\) is differentiable and "well behaved"
- i.e. \(\beta\) smooth and \(M\) bounded
- i.e. \(\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2\) and \(\sup_\theta J(\theta)\leq M\).
Then for SGA with independent gradient estimates which are
- unbiased \(\mathbb E[g_i] = \nabla J(\theta_i)\) and variance \(\mathbb E[\|g_i\|_2^2] = \sigma^2\)
$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$
Example: Minibatching
- Continuing linear regression example
- Minibatching means sampling \(j_1, j_2, ..., j_M\) uniformly and
- \(g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 = 2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}\)
- Same argument verifies that \(g_i\) is unbiased
- Variance:
- \(\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}\)
- Where we define \(\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] \) to be the variance of a single data-point estimate
- Larger minibatch size \(M\) means lower variance!
Gradients in RL
- Can we use sampled trajectories to estimate the gradient of \(J(\theta) = V^{\pi_\theta}(s_0)\) analogous to SGD for supervised learning?
- Sampled trajectories can estimate \(V^{\pi_\theta}(s_0)\) as we saw in the past several lectures
- They cannot be used to estimate \(\nabla_\theta V^{\pi_\theta}(s_0)\) analagously to supervised learning
- RL: \(\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)\) but \(P\) is unknown :(
- SL: \(\theta\to \)loss function\(\to J(\theta)\) and loss function is known!
Gradients in RL
- Can we use sampled trajectories to estimate the gradient of \(J(\theta) = V^{\pi_\theta}(s_0)\) analogous to SGD for supervised learning?
- Simple example: consider \(s_0=1\), \(\pi_\theta(0) =\) stay, and
\(\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}\)
\(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}\)
- One step horizon:
- \(V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]\)
- \(= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1\)
stay: \(p\)
Agenda
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
Derivative Free Optimization
\(\theta_1\)
\(\theta_2\)
- Setting: we can query \(J(\theta_i)\) but not \(\nabla J(\theta_i)\).
- Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
- Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies
\(\theta_1\)
\(\theta_2\)
- point \(\theta_0\)
- test points
- ascent directions
2D quadratic function
level sets of quadratic
1) Random Search
- Recall the finite difference approximation:
- In one dimension: \(J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta} \)
- For multivariate functions, \(\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} \)
Algorithm: Random Search
- Initialize \(\theta_0\)
- For \(i=0,1,...\):
- sample \(v\sim \mathcal N(0,I)\)
- \(\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta}( J(\theta_i+\delta v) - J(\theta_i - \delta v))\)
1) Random Search
- For multivariate functions, \(\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} \)
Algorithm: Random Search
- Initialize \(\theta_0\)
- For \(i=0,1,...\):
- sample \(v\sim \mathcal N(0,I)\)
- \(\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v\)
- We can understand this as SGA
- \(\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i] \)
- \(=\mathbb E[2\delta v v^\top \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i) \)
\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)
\(J(\theta) = -\theta^2 - 1\)
\(\theta\)
Random Search Example
- start with \(\theta\) positive
- suppose \(v\) is positive
- then \(J(\theta+\delta v)<J(\theta-\delta v)\)
- therefore \(g\) is negative
- indeed, \(\nabla J(\theta) = -2\theta<0\) when \(\theta>0\)
2) Importance Weighting
- Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
- E.g. in reinforcement learning \(V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]\)
-
Fact: The gradient \(\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]\)
- Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty $$
- Then \(\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}] \)
- general principle: reweight by ratio of probability distributions (PSet 5)
- The gradient \(\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}] \)
- Set \(\rho = P_\theta\) and notice that \(\nabla_\theta \left[\log P_\theta(z) \right] = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}\)
2) Importance Weighting
- Suppose that \(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
- E.g. in reinforcement learning \(V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]\)
- Fact: The gradient \(\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]\)
- SGA-inspired algorithm
\(\underbrace{\qquad\qquad}_{\text{score}}\)
Algorithm: Monte-Carlo DFO
- Initialize \(\theta_0\)
- For \(i=0,1,...\):
- sample \(z\sim P_{\theta_i}\)
- \(\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)\)
Importance Weighting Example
\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(z)) h(z) \)
\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
\(z\)
\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)
\(h(z) = -z^2\)
\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)
(\(=-\theta^2 - 1\))
\(P_\theta = \mathcal N(\theta, 1)\)
- start with \(\theta\) positive
- suppose \(z>\theta\)
- then score is positive
- therefore \(g\) is negative (since \(h(z)<0\))
\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)
Recap
- PSet released tonight
- PA was released Fri
- Maxima and critical points
- Stochastic gradient ascent
- Next lecture: Policy Optimization
Sp23 CS 4/5789: Lecture 16
By Sarah Dean