Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
action
state, reward
policy
data
experience
Key components of a value-based RL algorithm:
1. PI with MC
Data-driven PI
2. PI with TD
Data-driven PI
3. Q-learning
Data-driven VI
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
\(J(\theta)\)
\(\theta\)
np.amin(J, axis=1)
\(\theta^\star\)
Consider a function \(J(\theta) :\mathbb R^d \to\mathbb R\).
single global max
\(\underbrace{\qquad}\)
many global max
global max
local max
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
Algorithm: Gradient Ascent
saddle point
concave
concave
not concave
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
Algorithm: SGA
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
PollEV
Theorem: Suppose that \(J\) is differentiable and "well behaved"
Then for SGA with independent gradient estimates which are
$$\mathbb E\left[\frac{1}{T}\sum_{i=1}^T \| \nabla J(\theta_i)\|_2\right] \lesssim \sqrt{\frac{\beta M\sigma^2}{T}},\quad \alpha = \sqrt{\frac{M}{\beta\sigma^2 T}}$$
\(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}+\mathbf 1\{s=0\}\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1\)
stay: \(p\)
1. Recap
2. Maxima and Critical Points
3. Stochastic Gradient Ascent
4. Derivative-Free Optimization
\(\theta_1\)
\(\theta_2\)
\(\theta_1\)
\(\theta_2\)
2D quadratic function
level sets of quadratic
Algorithm: Random Search
Algorithm: Random Search
\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)
\(J(\theta) = -\theta^2 - 1\)
\(\theta\)
\(\underbrace{\qquad\qquad}_{\text{score}}\)
Algorithm: Monte-Carlo DFO
\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(z)) h(z) \)
\(J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]\)
\(z\)
\(\nabla_\theta \log P_\theta(z)= (z-\theta)\)
\(h(z) = -z^2\)
\(=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]\)
(\(=-\theta^2 - 1\))
\(P_\theta = \mathcal N(\theta, 1)\)
\(\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2\)