Do Differentiable Simulators Give Better Policy Gradients?

Russ Tedrake

RSS 2022 Workshop on Differentiable Physics for Robotics

Follow live at https://slides.com/d/PcFMXLM/live

(or later at https://slides.com/russtedrake/rss-differentiable)

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Before we take gradients, let's discuss the optimization landscape...

Contact dynamics can lead to discontinuous landscapes, but mostly in the corner cases.

The "compass gait" biped

The view from hybrid dynamics

Continuity of solutions w.r.t parameters

A key question for the success of gradient-based optimization

Use initial conditions here as a surrogate for dependence on policy parameters, etc.; final conditions as surrogate for reward.

Continuity of solutions w.r.t parameters

For the mathematical model... (ignoring numerical issues)

we do expect \(q(t_f) = F\left(q(t_0)\right)\) to be continuous.

Contact time, pre-/post-contact pos/vel all vary continuously.
Simulators will have artifacts from making discrete-time approximations; these can be made small (but often aren't)

point contact on half-plane

Continuity of solutions w.r.t parameters

We have "real" discontinuities at the corner cases

making contact w/ a different face
transitions to/from contact and no contact

Soft/compliant contact can replace discontinuities with stiff approximations

Beware "artificial" discontinuities

New medium blog post

Non-smooth optimization

\[ \min_x f(x) \]

For gradient descent, discontinuities / non-smoothness can

introduce local minima
destroy convergence (e.g. \(l_1\)-minimization)

Smoothing discontinuous objectives

A natural idea: can we smooth the objective?

Probabilistic formulation, for small \(\Sigma\): \[ \min_x f(x) \approx \min_\mu E \left[ f(x) \right], x \sim \mathcal{N}(\mu, \Sigma) \]

A low-pass filter in parameter space with a Gaussian kernel.

Example: The Heaviside function

Smooth local minima
Alleviate flat regions
Encode robustness

Smoothing with stochasticity

\begin{gathered} \min_\theta f(\theta) \end{gathered}

\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

Smoothing with stochasticity for Multibody Contact

Relationship to RL Policy Gradient / CMA / MPPI

In reinforcement learning (RL) and "deep" model-predictive control, we add stochasticity via

Stochastic policies
Random initial conditions
"Domain randomization"

then optimize a stochastic optimal control objective (e.g. maximize expected reward)

These can all smooth the optimization landscape.

Do Differentiable Simulators Give Better Policy Gradients?

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered} \min_\theta f(\theta) \end{gathered}

\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

Differentiable simulators give \(\frac{\partial f}{\partial \theta}\), but we want \(\frac{\partial}{\partial \theta} E_w[f(\theta, w)]\).

Randomized smoothing

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, Gradient Sampling Methods for Nonsmooth Optimization, 02 2020, pp. 201–225.

Approximate smoothed objective via Monte-carlo : \[ E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma) \]
First-order gradient estimate \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

Zero-order gradient estimate (aka REINFORCE) \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

Lessons from stochastic optimization

The two gradient estimates converge to the same quantity under sufficient regularity conditions.
Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.

Often, but not always.

Example: The Heaviside function

\(\frac{\partial f(x)}{\partial x} = 0\) almost everywhere!

\( \Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0 \)

First-order estimator is biased

\( \not\approx \frac{\partial}{\partial \mu} E_\mu [f(x)] \)

Zero-order estimator is (still) unbiased

What about smooth (but stiff) approximations?

Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
In the paper, we formalize "empirical bias" to capture this.

First-order estimates can also have high variance

e.g. with stiff contact models (large gradient \(\Rightarrow\) high variance)

First-order estimates can also have high variance

Summary so far

First-order estimators are often lower variance that zero-order estimators. But they have some pathologies:

Bias/empirical bias around (near) discontinuities
High variance from stiffness

Zero-order estimators are robust in these regimes.

This may explain the experimental success of zero-order methods in contact-rich RL.

The best of both worlds?

Define \(\alpha\)-order gradient estimate as

\begin{gathered} \bar\nabla^\alpha F(x) = \alpha \underbrace{\bar\nabla^1 F(x)} + (1-\alpha) \underbrace{\bar\nabla^0 F(x)} , \qquad 0 \le \alpha \le 1 \end{gathered}

first-order estimate

zero-order estimate

We give an algorithm to choose \(\alpha\) automatically based on the empirical variance

(+ a trust region using empirical bias).

Is stochasticity essential?

Force at a distance

Smoothing of time-stepping contact model

Deterministic smoothing - force at a distance

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

Establish equivalence between randomized smoothing and a (deterministic/differentiable) force-at-a-distance contact model.

Smoothed gradients (stochastic)

x_{n+1} = f(x_n, u_n) \approx A(x-x_0) + B(u-u_0) + c \\ \quad A = \frac{\partial f}{\partial x},\quad B = \frac{\partial f}{\partial u},\quad c = f(x_0, u_0)

Gradients

E_{w \sim \rho}[f(x+w_x, u+w_u)] \\ \quad A_\rho = \frac{\partial}{\partial x} E_{w \sim \rho}[f(x+w_x, u+w_u)],

Smoothed gradients (under distribution \(\rho\))

\quad B_\rho = \frac{\partial}{\partial u}E_{w \sim \rho}[f(x+w_x, u+w_u)],\\ \quad c_\rho = E_{w \sim \rho}[f(x_0+w_x, u_0+w_u)]

Smoothed gradients (deterministic)

\( x_{n+1} = f(x_n, u_n) \) : the solution of an optimization under contact complementarity constraints.
Relaxed problem: move hard complementarity constraints into objective (e.g. via log barrier penalty term)
- Results in force at a distance
- For simple problems, we show that each barrier function corresponds with a choice of \( \rho \) and vice versa.
RRT distance metric / Trajectory optimization using (differentiable) quasi-dynamic model.

Summary (do gradients help?)

RL uses stochastic optimal control / smooths discontinuities
- Here we need \( \frac{\partial}{\partial \theta} E_w[f(\theta, w)] \), not just \(\frac{\partial f}{\partial \theta}\)
- First-order estimators have some pathologies with stiffness/discontinuities; zero-order is robust.
- \(\alpha\)-order estimator can achieve faster convergence + robust performance.
Examining smoothing for simple systems reveals a deterministic equivalent (e.g. force at a distance)
- Now \(\frac{\partial f}{\partial \theta}\) is all you need
- Enabled RRT / trajectory opt. for dexterous hands

Beware "artificial" discontinuities

My claim: Subtle interactions between the collision and physics engines can cause artificial discontinuities

(sometimes with dramatic results)

Understanding this requires a few steps

Numerical methods must deal with overlapping geometry.
Standard approaches summarize the contact forces / constraints at one or more points.
It is effectively impossible to do this without introducing (potentially severe) discontinuities.