Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
5. Random Policy Search
action
state
policy
data
experience
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
5. Random Policy Search
\(J(\theta)\)
\(\theta\)
np.amin(J, axis=1)
\(\theta^\star\)
Consider a function \(J(\theta) :\mathbb R^d \to\mathbb R\).
single global max
\(\underbrace{\qquad}\)
many global max
global max
local max
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
Algorithm: Gradient Ascent
saddle point
concave
concave
not concave
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
5. Random Policy Search
Algorithm: SGA
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
Theorem: Suppose that \(J\) is differentiable and "well behaved"
Then for SGA with independent gradient estimates which are
$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$
\(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1\)
stay: \(p\)
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
5. Random Policy Search
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
Algorithm: Two Point Random Search
PollEV
\(\nabla J(\theta)\)\( \approx g= \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)
\(J(\theta) = -\theta^2 - 1\)
\(\theta\)
Algorithm: One Point Random Search
Algorithm: One Point Random Search
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
5. Random Policy Search
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ] $$
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}\)
Goal: achieve high expected cumulative reward:
$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
Assume that we can "rollout" policy \(\pi_\theta\) to observe:
a sample \(\tau\) from \(\mathbb P^{\pi_\theta}_{\mu_0}\)
the resulting cumulative reward \(R(\tau)\)
Note: we do not need to know \(P\)! (Also easy to extend to not knowing \(r\)!)
Random Search Policy Optimization
By Sarah Dean