CS 4/5789: Introduction to Reinforcement Learning

Lecture 15: Optimization Overview

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • PSet 5 due Friday
    • PA 3 due 3/29
    • 5789: Paper assignments posted on Canvas
  • Prelims
    • Regrade requests open until tonight at 11:59pm
    • Next prelim is Wednesday, April 10

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Recap: Value-based RL

action

state

policy

data

experience

  • We use dataset containing trajectory \(\tau=(s_i,a_i)_{i=0}^n\)
  • Fitted Value Iteration: use supervised learning to update \(Q^{k+1}\)
    • \(y_i=r(s_i,a_i)+\gamma\max_a Q^k(s_{i+1},a)\)
  • Fitted Policy Iteration: use supervised learning for Policy Evaluation
    • iterative: \(y_i=r(s_i,a_i)+\gamma Q^k(s_{i+1},a_{i+1})\)
    • direct: \(y_i = \sum_{\ell=i}^{i+h_i}r(s_\ell,a_\ell) \)
  • Ultimate Goal: find (near) optimal policy
  • Value-based RL estimates intermediate quantities
    • \(Q^{\pi}\) or \(Q^{\star}\) are indirectly useful for finding optimal policy
  • Idea: optimize policy without relying on intermediaries:
    • objective as a function of policy: \(J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]\)
    • For parametric (e.g. deep) policy \(\pi_\theta\): $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Parabola

\(J(\theta)\)

\(\theta\)

Motivation: Optimization

  • So far, we have discussed tabular and quadratic optimization
    • np.amin(J, axis=1)
    • for \(J(\theta) = a\theta^2 + b\theta +c\), maximum \(\theta^\star = -\frac{b}{2a}\)
  • Today, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

\(\theta^\star\)

Maxima

Consider a function \(J(\theta) :\mathbb R^d \to\mathbb R\).

  • Def: A global maximum is a point \(\theta_\star\) such that \(J(\theta_\star)\geq J(\theta)\) for all \(\theta\in\mathbb R^d\). A local maximum is a point satisfying the inequality for all \(\theta\) s.t. \(\|\theta_\star-\theta\|\leq \epsilon\) for some \(\epsilon>0\).
Parabola
Parabola
Parabola

single global max

\(\underbrace{\qquad}\)

many global max

global max

local max

Ascent Directions

Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

  • Definition: An ascent direction at \(\theta_0\) is any \(v\) such that \(J(\theta_0+\alpha v)\geq J(\theta_0)\) for all \(0<\alpha<\alpha_0\) for some \(\alpha_0>0\).
    • ascent directions help us find maxima
  • The gradient of a differentiable function is the direction of steepest ascent

\(\theta_1\)

\(\theta_2\)

  • point \(\theta_0\)
  • ascent directions
  • gradient \(\nabla J(\theta_0)\)

2D quadratic function

level sets of quadratic

Gradient Ascent

Parabola
Parabola
  • GA is a first order method because at each iteration, it locally maximizes a first order approximation $$J(\theta) \approx J(\theta_i) + \nabla J(\theta_i)^\top(\theta-\theta_i)$$
    • the RHS is maximized when \(\theta-\theta_i\) is parallel to \(\nabla J(\theta_i)\)
    • step size \(\alpha\) prevents \(\theta_{i+1}\) from moving too far away from \(\theta_i\), where approximation would be inaccurate

Algorithm: Gradient Ascent

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • \(\theta_{i+1} = \theta_i + \alpha\nabla J(\theta_i)\)

Critical Points

  • The gradient is equal to zero at a local maximum
    • by definition, no ascent directions at local max
  • Def: a critical point is a point \(\theta_0\) where \(\nabla J(\theta_0) = 0\)
  • Critical points are fixed points of the gradient ascent algorithm
  • Critical points can be (local or global) maxima, (local or global) minima, or saddle points
Parabola
Parabola
image/svg+xml

saddle point

Concave Functions

  • A function is concave if the line connecting any two points on the function lies entirely below the function
  • If \(J\) is concave, then \(\nabla J(\theta_0)=0 \implies \theta_0\) is a global maximum
Parabola
Parabola
Parabola

concave

concave

not concave

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Stochastic Gradient Ascent

  • Rather than exact gradients, SGA uses unbiased estimates of the gradient \(g_i\), i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

  • Initialize \(\theta_0\); For \(i=0,1,...\):
    • \(\theta_{i+1} = \theta_i + \alpha g_i\)
Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

\(\theta_1\)

\(\theta_2\)

  • gradient ascent
  • stochastic gradient ascent

2D quadratic function

level sets of quadratic

Example: Linear Regression

  • Supervised learning with linear functions \(\theta^\top x\) and dataset of \(N\) training examples $$\min_\theta \underbrace{\textstyle \frac{1}{N} \sum_{j=1}^N (\theta^\top x_j - y_j)^2}_{J(\theta)}$$
  • The gradient is \(\nabla J(\theta_i)= \frac{1}{N} \sum_{j=1}^N 2(\theta_i^\top x_j - y_j)x_j\)
  • Training with SGD means sampling \(j\) uniformly and
    • \(g_i = \nabla (\theta_i^\top x_j - y_j)^2 =  2(\theta_i^\top x_j - y_j)x_j\)
  • Verifying that \(g_i\) is unbiased:
    • \(\mathbb E[g_i] =  \mathbb E[2(\theta_i^\top x_j - y_j)x_j] = \sum_{j=1}^N \frac{1}{N} 2(\theta_i^\top x_j - y_j)x_j=\nabla J(\theta_i)\)

SGA Convergence

  • SGA converges to critical points (when \(J\) is concave, this means that SGA converges to the global max)
  • Specifically, for "well behaved" function \(J\) and "well chosen" step size, running SGA for \(T\) iterations:
    • norm \(\| \nabla J(\theta_i)\|_2\) converges in expectation
    • at a rate of \(\sqrt{\sigma^2/T}\) where \(\sigma^2\) is the variance \(g_i\)
  • Example: mini-batching is a strategy to reduce variance when training supervised learning algorithms with SGD

Additional Details

Theorem: Suppose that \(J\) is differentiable and "well behaved"

  • i.e. \(\beta\) smooth and \(M\) bounded
  • i.e. \(\|\nabla J(\theta)-\nabla J(\theta')\|_2\leq\beta\|\theta-\theta'\|_2\) and \(\sup_\theta J(\theta)\leq M\).

Then for SGA with independent gradient estimates which are

  • unbiased \(\mathbb E[g_i] = \nabla J(\theta_i)\) and variance \(\mathbb E[\|g_i\|_2^2] = \sigma^2\)

$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$

Example: Minibatching

  • Continuing linear regression example
  • Minibatching means sampling \(j_1, j_2, ..., j_M\) uniformly and
    • \(g_i = \nabla \frac{1}{M} \sum_{\ell=1}^M (\theta^\top x_{j_\ell} - y_{j_\ell})^2 =  2\frac{1}{M} \sum_{\ell=1}^M(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell}\)
  • Same argument verifies that \(g_i\) is unbiased
  • Variance:
    • \(\mathbb E[\|g_i - \nabla J(\theta_i)\|_2^2] = \frac{1}{M^2}\sum_{\ell=1}^M \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] = \frac{\sigma^2}{M}\)
  • Where we define \(\sigma^2 = \mathbb E[\|2(\theta_i^\top x_{j_\ell} - y_{j_\ell})x_{j_\ell} - \nabla J(\theta_i)\|_2^2] \) to be the variance of a single data-point estimate
  • Larger minibatch size \(M\) means lower variance!

Gradients in RL

  • Can we use sampled trajectories to estimate the gradient of \(J(\theta) = V^{\pi_\theta}(s_0)\) analogous to SGD for supervised learning?
  • Sampled trajectories can estimate \(V^{\pi_\theta}(s_0)\) as we saw in the past several lectures
  • They cannot be used to estimate \(\nabla_\theta V^{\pi_\theta}(s_0)\) analagously to supervised learning
    • SL: \(\theta\to \)loss function\(\to J(\theta)\) and loss function is known!
    • RL: \(\theta\to P\to d_{s_0}^{\pi_\theta} \to V^{\pi_\theta}(s_0)=J(\theta)\) but \(P\) is unknown :(

Gradients in RL

  • Can we use sampled trajectories to estimate the gradient of \(J(\theta) = V^{\pi_\theta}(s_0)\) analogous to SGD for supervised learning?
  • Simple example: consider \(s_0=1\), \(\pi_\theta(0) =\) stay, and
    \(\pi_\theta(a|1) = \begin{cases}\mathsf{stay} & \text{w.p.} ~\theta \\ \mathsf{switch} & \text{w.p.} ~1-\theta\end{cases}\)

\(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}\)

  • One step horizon:
  • \(V^{\pi_\theta}(1) = \mathbb E[ r(s_0,a_0) + r(s_1)]\)
  • \(= -0.5(1-\theta) + 1 (\theta (1-p) + (1-\theta))\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(1-p\)

switch: \(1\)

stay: \(p\)

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

5. Random Policy Search

Derivative Free Optimization

Parabola
Parabola

\(\theta_1\)

\(\theta_2\)

  • Setting: we can query \(J(\theta_i)\) but not \(\nabla J(\theta_i)\). Brainstorm!
  • Simple idea for finding ascent direction: sample random directions, test them, and see which lead to an increase
  • Many variations on this idea: simulated annealing, cross-entropy method, genetic algorithms, evolutionary strategies

\(\theta_1\)

\(\theta_2\)

  • point \(\theta_0\)
  • test points
  • ascent directions

2D quadratic function

level sets of quadratic

Random Finite Difference Approx.

  • Recall the finite difference approximation:
    • In one dimension: \(J'(\theta) \approx \frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta} \)
    • Partial derivatives \(\frac{\partial J}{\partial \theta_i} \approx \frac{J(\theta+\delta e_i)-J(\theta-\delta e_i)}{2\delta} \)
  • How have we used this before?
    • Nonlinear control: construct gradients by sampling every \(i\)
  • More generally: $$\nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} $$
  • Thought experiment: use random \(v\)

Random Search (two-point)

  • For multivariate functions, \( \nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta-\delta v)}{2\delta} \)

Algorithm: Two Point Random Search

  • Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • sample \(v\sim \mathcal N(0,I)\)
    • \(\theta_{i+1} = \theta_i + \frac{\alpha}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v\)
  • \(\mathbb E[(J(\theta_i+\delta v) - J(\theta_i - \delta v))v|\theta_i] \approx \mathbb E[2\delta \nabla J(\theta_i)^\top v v|\theta_i] \)
    • \(=\mathbb E[2\delta v v^\top  \nabla J(\theta_i)|\theta_i] = 2\delta \mathbb E[v v^\top] \nabla J(\theta_i) = 2\delta \nabla J(\theta_i) \)
  • We can understand this as SGA with \(g_i=\frac{1}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v\)

PollEV

\(\nabla J(\theta)\)\( \approx g= \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)

Parabola

\(J(\theta) = -\theta^2 - 1\)

\(\theta\)

Random Search Example

  • start with \(\theta\) positive
  • suppose \(v\) is positive
  • then \(J(\theta+\delta v)<J(\theta-\delta v)\)
  • therefore \(g\) is negative
  • indeed, \(\nabla J(\theta) = -2\theta<0\) when \(\theta>0\)

Random Search (one point)

Algorithm: One Point Random Search

  • Initialize \(\theta_0\). For \(i=0,1,...\):
    • sample \(v\sim \mathcal N(0,I)\)
    • \(\theta_{i+1} = \theta_i + \frac{\alpha}{\delta}J(\theta_i+\delta v) v\)
  • Claim: This is SGA with \(g_i=\frac{1}{\delta} J(\theta_i+\delta v) v\)
  • Off centered finite diff. approx: \(\nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta)}{\delta} \)
  • First notice \(\mathbb E[(J(\theta_i+\delta v) - J(\theta_i))v|\theta_i]=\)
    • \(=\mathbb E[J(\theta_i+\delta v)v|\theta_i] - \mathbb E[J(\theta_i) v|\theta_i]=\mathbb E[J(\theta_i+\delta v)v|\theta_i]-0\)

Random Search (one point)

Algorithm: One Point Random Search

  • Initialize \(\theta_0\). For \(i=0,1,...\):
    • sample \(v\sim \mathcal N(0,I)\)
    • \(\theta_{i+1} = \theta_i + \frac{\alpha}{\delta}J(\theta_i+\delta v) v\)
  • Claim: This is SGA with \(g_i=\frac{1}{\delta} J(\theta_i+\delta v) v\)
  • Off centered finite diff. approx: \(\nabla J(\theta)^\top v \approx \frac{J(\theta+\delta v)-J(\theta)}{\delta} \)
  • \(\mathbb E[(J(\theta_i+\delta v) - J(\theta_i))v|\theta_i]=\mathbb E[J(\theta_i+\delta v)v|\theta_i]\)
  • \(\mathbb E[(J(\theta_i+\delta v) - J(\theta_i))v|\theta_i] \approx \mathbb E[\delta \nabla J(\theta_i)^\top v v|\theta_i] \)
    • \(=\mathbb E[\delta v v^\top  \nabla J(\theta_i)|\theta_i] = \delta \mathbb E[v v^\top] \nabla J(\theta_i) = \delta \nabla J(\theta_i) \)

Agenda

1. Recap

2. Maxima and Critical Points

3. Stochastic Gradient Ascent

4. Derivative-Free Optimization

 5. Random Policy Search 

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$

  • Recall notation for a trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\) and probability of a trajectory \(\mathbb P^{\pi}_{\mu_0}\)
  • Define cumulative reward \(R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)\)
  • For parametric (e.g. deep) policy \(\pi_\theta\), the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right] $$

Policy Optimization Setting

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)

  • Goal: achieve high expected cumulative reward:

    $$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

  • Assume that we can "rollout" policy \(\pi_\theta\) to observe:

    • a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)

    • the resulting cumulative reward \(R(\tau)\)

  • Note: we do not need to know \(P\)! (Also easy to extend to not knowing \(r\)!)

Policy Optimization Setting

Random Search Policy Optimization

  • Given \(\alpha, \delta\). Initialize \(\theta_0\)
  • For \(i=0,1,...\):
    • Sample \(v\sim \mathcal N(0, I)\)
    • Rollout policy \(\pi_{\theta_i+ \delta v}\) and observe trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\)
    • Estimate \(g_i = \frac{1}{\delta} R(\tau) v\)
    • Update \(\theta_{i+1} = \theta_i + \alpha g_i\)

Policy Opt. with Random Search

Recap

  • PSet due Friday
  • PA due next Friday

 

  • Maxima and critical points
  • Stochastic gradient ascent
  • Random search

 

  • Next lecture: Policy Optimization