Reinforcement Learning

(Part 2)

MIT 6.421:

Robotic Manipulation

Fall 2023, Lecture 20

Follow live at https://slides.com/d/HoT1aag/live

(or later at https://slides.com/russtedrake/fall23-lec20)

Beware "artificial" discontinuities

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

\begin{gathered} \min_\theta f(\theta) \end{gathered}

\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered} \min_\theta f(\theta) \end{gathered}

\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

Differentiable simulators give \(\frac{\partial f}{\partial \theta}\), but we want \(\frac{\partial}{\partial \theta} E_w[f(\theta, w)]\).

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, Gradient Sampling Methods for Nonsmooth Optimization, 02 2020, pp. 201–225.

Approximate smoothed objective via Monte-carlo : \[ E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma) \]
First-order gradient estimate \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

Zero-order gradient estimate (aka REINFORCE) \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

The two gradient estimates converge to the same quantity under sufficient regularity conditions.
Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.

Often, but not always.

\(\frac{\partial f(x)}{\partial x} = 0\) almost everywhere!

\( \Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0 \)

First-order estimator is biased

\( \not\approx \frac{\partial}{\partial \mu} E_\mu [f(x)] \)

Zero-order estimator is (still) unbiased

Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
In the paper, we formalize "empirical bias" to capture this.