Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PA 3 released Friday, due 3/31
• PSet 4 released Wednesday
• 5789 Paper Reviews due weekly on Mondays
• Prelim
• Regrade requests open until Wednesday 11:59pm

## Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

## Recap: Value-based RL

action

state,  reward

policy

data

experience

Key components of a value-based RL algorithm:

1. Rollout policy
2. Construct/update dataset
3. Learn/update Q function

1. PI with MC

Data-driven PI

• Alternate learning $$Q^\pi$$ w/ improving $$\pi$$
• $$\sum_{k=t}^{h} r_k$$

• On policy

## Recap: Comparison

2. PI with TD

Data-driven PI

• Alternate learning $$Q^\pi$$ w/ improving $$\pi$$
• $$r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})$$

• On policy

3. Q-learning

Data-driven VI

• Alternate learning $$Q^\star$$ w/ updating $$\pi$$
• $$r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)$$

• Off policy
• Ultimate Goal: find (near) optimal policy
• Value-based RL estimates intermediate quantities
• $$Q^{\pi}$$ or $$Q^{\star}$$ are indirectly useful for finding optimal policy
• Imitation learning had no intermediaries, but requires data from an expert policy
• Idea: optimize policy without relying on intermediaries:
• objective as a function of policy: $$J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]$$
• For parametric (e.g. deep) policy $$\pi_\theta$$: $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

## Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

$$J(\theta)$$

$$\theta$$

## Motivation: Optimization

• So far, we have discussed tabular and quadratic optimization
• np.amin(J, axis=1)
• for $$J(\theta) = a\theta^2 + b\theta +c$$, maximum $$\theta^\star = -\frac{b}{2a}$$
• Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

$$\theta^\star$$

## Maxima

Consider a function $$J(\theta) :\mathbb R^d \to\mathbb R$$.

• Def: A global maximum is a point $$\theta_\star$$ such that $$J(\theta_\star)\geq J(\theta)$$ for all $$\theta\in\mathbb R^d$$. A local maximum is a point satisfying the inequality for all $$\theta$$ s.t. $$\|\theta_\star-\theta\|\leq \epsilon$$ for some $$\epsilon>0$$.

single global max

$$\underbrace{\qquad}$$

many global max

global max

local max

## Ascent Directions

$$\theta_1$$

$$\theta_2$$

• Definition: An ascent direction at $$\theta_0$$ is any $$v$$ such that $$J(\theta_0+\alpha v)\geq J(\theta_0)$$ for all $$0<\alpha<\alpha_0$$ for some $$\alpha_0>0$$.
• ascent directions help us find maxima
• The gradient of a differentiable function is the direction of steepest ascent

$$\theta_1$$

$$\theta_2$$

• point $$\theta_0$$
• ascent directions
• gradient $$\nabla J(\theta_0)$$

level sets of quadratic

• GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
• the RHS is maximized when $$\theta-\theta_i$$ is parallel to $$\nabla J(\theta_i)$$
• step size $$\alpha$$ prevents $$\theta_{i+1}$$ from moving too far away from $$\theta_i$$, where approximation would be inaccurate

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• $$\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)$$

## Critical Points

• The gradient is equal to zero at a local maximum
• by definition, no ascent directions at local max
• Def: a critical point is a point $$\theta_0$$ where $$\nabla J(\theta_0) = 0$$
• Critical points are fixed points of the gradient ascent algorithm
• Critical points can be (local or global) maxima, (local or global) minima, or saddle points

## Concave Functions

• A function is concave if the line connecting any two points on the function lies entirely below the function
• If $$J$$ is concave, then $$\nabla J(\theta_0)=0 \implies \theta_0$$ is a global maximum

concave

concave

not concave

## Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

## Stochastic Gradient Ascent

• Rather than exact gradients, SGA uses unbiased estimates of the gradient $$g_i$$, i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

• Initialize $$\theta_0$$; For $$i=0,1,...$$:
• $$\theta_{i+1} = \theta_i + \alpha g_i$$

$$\theta_1$$

$$\theta_2$$

$$\theta_1$$

$$\theta_2$$

• stochastic gradient ascent

level sets of quadratic

## Example: Linear Regression

• Supervised learning with linear functions $$\theta^\top x$$ and dataset of $$N$$ training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
• The gradient is $$\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j$$
• Training with SGD means sampling $$j$$ uniformly and
• $$g_i = \nabla (\theta_i^\top x_j - y_j)^2 = 2(\theta_i^\top x_j - y_j)x_j$$
• Verifying that $$g_i$$ is unbiased:
• $$\mathbb E[g_i] = \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)$$

PollEV

## SGA Convergence

• SGA converges to critical points (when $$J$$ is concave, this means that SGA converges to the global max)
• Specifically, for "well behaved" function $$J$$ and "well chosen" step size, running SGA for $$T$$ iterations:
• norm $$\| \nabla J(\theta_i)\|_2$$ converges in expectation
• at a rate of $$\sqrt{\sigma^2/T}$$ where $$\sigma^2$$ is the variance $$g_i$$
• Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD

Theorem: Suppose that $$J$$ is differentiable and "well behaved"

• i.e. $$\beta$$ smooth and $$M$$ bounded
• i.e. $$\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2$$ and $$\sup_\theta J(\theta)\leq M$$.

Then for SGA with independent gradient estimates which are

• unbiased $$\mathbb E[g_i] = \nabla J(\theta_i)$$ and variance $$\mathbb E[\|g_i\|_2^2] = \sigma^2$$

$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$

## Example: Minibatching

• Continuing linear regression example
• Minibatching means sampling $$j_1, j_2, ..., j_M$$ uniformly and
• $$g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 = 2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}$$
• Same argument verifies that $$g_i$$ is unbiased
• Variance:
• $$\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}$$
• Where we define $$\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2]$$ to be the variance of a single data-point estimate
• Larger minibatch size $$M$$ means lower variance!

## Gradients in RL

• Can we use sampled trajectories to estimate the gradient of $$J(\theta) = V^{\pi_\theta}(s_0)$$ analogous to SGD for supervised learning?
• Sampled trajectories can estimate $$V^{\pi_\theta}(s_0)$$ as we saw in the past several lectures
• They cannot be used to estimate $$\nabla_\theta V^{\pi_\theta}(s_0)$$ analagously to supervised learning
• RL: $$\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)$$ but $$P$$ is unknown :(
• SL: $$\theta\to$$loss function$$\to J(\theta)$$ and loss function is known!

## Gradients in RL

• Can we use sampled trajectories to estimate the gradient of $$J(\theta) = V^{\pi_\theta}(s_0)$$ analogous to SGD for supervised learning?
• Simple example: consider $$s_0=1$$, $$\pi_\theta(0) =$$ stay, and
$$\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}$$

$$r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}$$

• One step horizon:
• $$V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]$$
• $$= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$1-p$$

switch: $$1$$

stay: $$p$$

## Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

## Derivative Free Optimization

$$\theta_1$$

$$\theta_2$$

• Setting: we can query $$J(\theta_i)$$ but not $$\nabla J(\theta_i)$$.
• Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
• Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies

$$\theta_1$$

$$\theta_2$$

• point $$\theta_0$$
• test points
• ascent directions

level sets of quadratic

## 1) Random Search

• Recall the finite difference approximation:
• In one dimension: $$J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta}$$
• For multivariate functions, $$\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta}$$

Algorithm: Random Search

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• sample $$v\sim \mathcal N(0,I)$$
• $$\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta}( J(\theta_i+\delta v) - J(\theta_i - \delta v))$$

## 1) Random Search

• For multivariate functions, $$\langle \nabla J(\theta), v\rangle \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta}$$

Algorithm: Random Search

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• sample $$v\sim \mathcal N(0,I)$$
• $$\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v$$
• We can understand this as SGA
• $$\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i]$$
• $$=\mathbb E[2\delta v v^\top \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i)$$

$$\nabla J(\theta)$$$$\approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$$

$$J(\theta) = -\theta^2 - 1$$

$$\theta$$

## Random Search Example

• start with $$\theta$$ positive
• suppose $$v$$ is positive
• then $$J(\theta+\delta v)<J(\theta-\delta v)$$
• therefore $$g$$ is negative
• indeed, $$\nabla J(\theta) = -2\theta<0$$ when $$\theta>0$$

## 2) Importance Weighting

• Suppose that $$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$$
• E.g. in reinforcement learning $$V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$$
• Fact: The gradient $$\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$$
• Proof: pick an arbitrary distribution $$\rho\in \Delta(\mathcal Z)\quad \text{s.t.} \quad \frac{P_\theta(z)}{\rho(z)}<\infty$$
• Then $$\mathbb E_{z\sim P_\theta}[h(z)] = \sum_{z\in\mathcal Z} h(z) P_\theta(z) \cdot \frac{\rho(z)}{\rho(z)} = \mathbb E_{z\sim \rho}[h(z) \frac{P_\theta(z) }{\rho(z)}]$$
• general principle: reweight by ratio of probability distributions (PSet 5)
• The gradient $$\nabla J(\theta) = \nabla_\theta \mathbb E_{z\sim P_\theta}[h(z)] = \mathbb E_{z\sim \rho}[h(z) \frac{\nabla_\theta P_\theta(z) }{\rho(z)}]$$
• Set $$\rho = P_\theta$$ and notice that $$\nabla_\theta \left[\log P_\theta(z) \right] = \frac{\nabla_\theta P_\theta(z) }{P_\theta(z)}$$

## 2) Importance Weighting

• Suppose that $$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$$
• E.g. in reinforcement learning $$V^{\pi_\theta}(s_0) = \frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{s_0}^{\pi_\theta}}[r(s,a)]$$
• Fact: The gradient $$\nabla J(\theta) = \mathbb E_{z\sim P_\theta}\left [\nabla_\theta \left[\log P_\theta(z) \right] h(z)\right]$$
• SGA-inspired algorithm

$$\underbrace{\qquad\qquad}_{\text{score}}$$

Algorithm: Monte-Carlo DFO

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• sample $$z\sim P_{\theta_i}$$
• $$\theta_{i+1} = \theta_i + \alpha\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)$$

## Importance Weighting Example

$$\nabla J(\theta)$$$$\approx \nabla_\theta \log(P_\theta(z)) h(z)$$

$$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$$

$$z$$

$$\nabla_\theta \log P_\theta(z)= (z-\theta)$$

$$h(z) = -z^2$$

$$=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]$$

($$=-\theta^2 - 1$$)

$$P_\theta = \mathcal N(\theta, 1)$$

• start with $$\theta$$ positive
• suppose $$z>\theta$$
• then score is positive
• therefore $$g$$ is negative (since $$h(z)<0$$)

$$\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2$$

## Recap

• PSet released tonight
• PA was released Fri

• Maxima and critical points
• Stochastic gradient ascent

• Next lecture: Policy Optimization

By Sarah Dean

Private