# Do Differentiable Simulators Give Better Policy Gradients?

Russ Tedrake

RSS 2022 Workshop on Differentiable Physics for Robotics

Follow live at https://slides.com/d/PcFMXLM/live

(or later at https://slides.com/russtedrake/rss-differentiable)

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Before we take gradients, let's discuss the optimization landscape...

Contact dynamics can lead to discontinuous landscapes, but mostly in the corner cases.

## Continuity of solutions w.r.t parameters

A key question for the success of gradient-based optimization

Use initial conditions here as a surrogate for dependence on policy parameters, etc.; final conditions as surrogate for reward.

## Continuity of solutions w.r.t parameters

For the mathematical model... (ignoring numerical issues)

we do expect $$q(t_f) = F\left(q(t_0)\right)$$ to be continuous.

• Contact time, pre-/post-contact pos/vel all vary continuously.
• Simulators will have artifacts from making discrete-time approximations; these can be made small (but often aren't)

point contact on half-plane

## Continuity of solutions w.r.t parameters

We have "real" discontinuities at the corner cases

• making contact w/ a different face
• transitions to/from contact and no contact

Soft/compliant contact can replace discontinuities with stiff approximations

## Non-smooth optimization

$\min_x f(x)$

For gradient descent, discontinuities / non-smoothness can

• introduce local minima
• destroy convergence (e.g. $$l_1$$-minimization)

## Smoothing discontinuous objectives

• A natural idea: can we smooth the objective?

• Probabilistic formulation, for small $$\Sigma$$: $\min_x f(x) \approx \min_\mu E \left[ f(x) \right], x \sim \mathcal{N}(\mu, \Sigma)$

• A low-pass filter in parameter space with a Gaussian kernel.

## Example: The Heaviside function

• Smooth local minima
• Alleviate flat regions
• Encode robustness

## Smoothing with stochasticity

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

vs

## Relationship to RL Policy Gradient / CMA / MPPI

In reinforcement learning (RL) and "deep" model-predictive control, we add stochasticity via

• Stochastic policies
• Random initial conditions
• "Domain randomization"

then optimize a stochastic optimal control objective (e.g. maximize expected reward)

These can all smooth the optimization landscape.

### Do Differentiable Simulators Give Better Policy Gradients?

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

vs

Differentiable simulators give $$\frac{\partial f}{\partial \theta}$$, but we want $$\frac{\partial}{\partial \theta} E_w[f(\theta, w)]$$.

## Randomized smoothing

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, Gradient Sampling Methods for Nonsmooth Optimization, 02 2020, pp. 201–225.

• Approximate smoothed objective via Monte-carlo : $E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma)$
• First-order gradient estimate $\frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma)$

• Zero-order gradient estimate (aka REINFORCE) $\frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma)$

## Lessons from stochastic optimization

1. The two gradient estimates converge to the same quantity under sufficient regularity conditions.

2. Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.

Often, but not always.

## Example: The Heaviside function

$$\frac{\partial f(x)}{\partial x} = 0$$ almost everywhere!

$$\Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0$$

First-order estimator is biased

$$\not\approx \frac{\partial}{\partial \mu} E_\mu [f(x)]$$

Zero-order estimator is (still) unbiased

## What about smooth (but stiff) approximations?

• Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
• In the paper, we formalize "empirical bias" to capture this.

## First-order estimates can also have high variance

e.g. with stiff contact models (large gradient $$\Rightarrow$$ high variance)

## Summary so far

First-order estimators are often lower variance that zero-order estimators.  But they have some pathologies:

• Bias/empirical bias around (near) discontinuities
• High variance from stiffness

Zero-order estimators are robust in these regimes.

This may explain the experimental success of zero-order methods in contact-rich RL.

## The best of both worlds?

Define $$\alpha$$-order gradient estimate as

\begin{gathered} \bar\nabla^\alpha F(x) = \alpha \underbrace{\bar\nabla^1 F(x)} + (1-\alpha) \underbrace{\bar\nabla^0 F(x)} , \qquad 0 \le \alpha \le 1 \end{gathered}

first-order estimate

zero-order estimate

We give an algorithm to choose $$\alpha$$ automatically based on the empirical variance

(+ a trust region using empirical bias).

## Force at a distance

Smoothing of time-stepping contact model

## Deterministic smoothing - force at a distance

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

Establish equivalence between randomized smoothing and a (deterministic/differentiable) force-at-a-distance contact model.

## Smoothed gradients (stochastic)

x_{n+1} = f(x_n, u_n) \approx A(x-x_0) + B(u-u_0) + c \\ \quad A = \frac{\partial f}{\partial x},\quad B = \frac{\partial f}{\partial u},\quad c = f(x_0, u_0)

E_{w \sim \rho}[f(x+w_x, u+w_u)] \\ \quad A_\rho = \frac{\partial}{\partial x} E_{w \sim \rho}[f(x+w_x, u+w_u)],

Smoothed gradients (under distribution $$\rho$$)

\quad B_\rho = \frac{\partial}{\partial u}E_{w \sim \rho}[f(x+w_x, u+w_u)],\\ \quad c_\rho = E_{w \sim \rho}[f(x_0+w_x, u_0+w_u)]

## Smoothed gradients (deterministic)

• $$x_{n+1} = f(x_n, u_n)$$ : the solution of an optimization under contact complementarity constraints.

• Relaxed problem: move hard complementarity constraints  into objective (e.g. via log barrier penalty term)
• Results in force at a distance
• For simple problems, we show that each barrier function corresponds with a choice of $$\rho$$ and vice versa.

• RRT distance metric / Trajectory optimization using (differentiable) quasi-dynamic model.

## Summary (do gradients help?)

• RL uses stochastic optimal control / smooths discontinuities
• Here we need $$\frac{\partial}{\partial \theta} E_w[f(\theta, w)]$$, not just $$\frac{\partial f}{\partial \theta}$$
• First-order estimators have some pathologies with stiffness/discontinuities; zero-order is robust.
• $$\alpha$$-order estimator can achieve faster convergence + robust performance.
• Examining smoothing for simple systems reveals a deterministic equivalent (e.g. force at a distance)
• Now $$\frac{\partial f}{\partial \theta}$$ is all you need
• Enabled RRT / trajectory opt. for dexterous hands

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

## Beware "artificial" discontinuities

My claim: Subtle interactions between the collision and physics engines can cause artificial discontinuities

(sometimes with dramatic results)

Understanding this requires a few steps

1. Numerical methods must deal with overlapping geometry.
2. Standard approaches summarize the contact forces / constraints at one or more points.
3. It is effectively impossible to do this without introducing (potentially severe) discontinuities.

## Rich collision geometries

Green arrow is the force on the red box due to the overlap with the blue box.

## Multi-point contact

Many heuristics for using multiple points...

## "Hydroelastic contact" as implemented in Drake

major contributions from Damrong Guoy, Sean Curtis, Rick Cory, Alejandro Castro, ...

## "Hydroelastic contact" as implemented in Drake

Red box is rigid, blue box is soft.

## "Hydroelastic contact" as implemented in Drake

Both boxes are soft.

## Point contact vs hydroelastic

Point contact (discontinuous)

Hydroelastic

(continuous)

vs

Hydroelastic is

• more expensive than point contact
• (much) less expensive than finite-element models

State-space (for simulation, planning, control) is the original rigid-body state.

## "Hydroelastic contact" as implemented in Drake

Point contact and multi-point contact can produce qualitatively wrong behavior.

Hydroelastic often resolves it.

## Example: Simulating LEGO® block mating

Manually-curated point contacts

Hydroelastic contact surfaces

Stable and symmetrical hydroelastic forces

Before

Now

Text

## The corner cases

Point contact

Hydroelastic contact

the frictionless case

## The corner cases

Point contact (no friction)

Hydroelastic

(no friction)

By russtedrake

# Do Differentiable Simulators Give Better Policy Gradients?

RSS 2022 Workshop on Differentiable Physics for Robotics

• 389