# Do Differentiable Simulators Give Better Policy Gradients?

Russ Tedrake

RSS 2022 Workshop on Differentiable Physics for Robotics

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Before we take gradients, let's discuss the optimization landscape...

Contact dynamics can lead to discontinuous landscapes, but mostly in the corner cases.

## Continuity of solutions w.r.t parameters

A key question for the success of gradient-based optimization

Use initial conditions here as a surrogate for dependence on policy parameters, etc.; final conditions as surrogate for reward.

## Continuity of solutions w.r.t parameters

For the mathematical model... (ignoring numerical issues)

we do expect $$q(t_f) = F\left(q(t_0)\right)$$ to be continuous.

• Contact time, pre-/post-contact pos/vel all vary continuously.
• Simulators will have artifacts from making discrete-time approximations; these can be made small (but often aren't)

point contact on half-plane

## Continuity of solutions w.r.t parameters

We have "real" discontinuities at the corner cases

• making contact w/ a different face
• transitions to/from contact and no contact

Soft/compliant contact can replace discontinuities with stiff approximations

## Non-smooth optimization

$\min_x f(x)$

For gradient descent, discontinuities / non-smoothness can

• introduce local minima
• destroy convergence (e.g. $$l_1$$-minimization)

## Smoothing discontinuous objectives

• A natural idea: can we smooth the objective?

• Probabilistic formulation, for small $$\Sigma$$: $\min_x f(x) \approx \min_\mu E \left[ f(x) \right], x \sim \mathcal{N}(\mu, \Sigma)$

• A low-pass filter in parameter space with a Gaussian kernel.

## Example: The Heaviside function

• Smooth local minima
• Alleviate flat regions
• Encode robustness

## Smoothing with stochasticity

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

vs

## Relationship to RL Policy Gradient / CMA / MPPI

In reinforcement learning (RL) and "deep" model-predictive control, we add stochasticity via

• Stochastic policies
• Random initial conditions
• "Domain randomization"

then optimize a stochastic optimal control objective (e.g. maximize expected reward)

These can all smooth the optimization landscape.

### Do Differentiable Simulators Give Better Policy Gradients?

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

vs

Differentiable simulators give $$\frac{\partial f}{\partial \theta}$$, but we want $$\frac{\partial}{\partial \theta} E_w[f(\theta, w)]$$.

## Randomized smoothing

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, Gradient Sampling Methods for Nonsmooth Optimization, 02 2020, pp. 201–225.

• Approximate smoothed objective via Monte-carlo : $E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma)$
• First-order gradient estimate $\frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma)$

• Zero-order gradient estimate (aka REINFORCE) $\frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma)$

## Lessons from stochastic optimization

1. The two gradient estimates converge to the same quantity under sufficient regularity conditions.

2. Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.

Often, but not always.

## Example: The Heaviside function

$$\frac{\partial f(x)}{\partial x} = 0$$ almost everywhere!

$$\Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0$$

First-order estimator is biased

$$\not\approx \frac{\partial}{\partial \mu} E_\mu [f(x)]$$

Zero-order estimator is (still) unbiased

## What about smooth (but stiff) approximations?

• Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
• In the paper, we formalize "empirical bias" to capture this.

## First-order estimates can also have high variance

e.g. with stiff contact models (large gradient $$\Rightarrow$$ high variance)

## Summary so far

First-order estimators are often lower variance that zero-order estimators.  But they have some pathologies:

• Bias/empirical bias around (near) discontinuities
• High variance from stiffness

Zero-order estimators are robust in these regimes.

This may explain the experimental success of zero-order methods in contact-rich RL.

## The best of both worlds?

Define $$\alpha$$-order gradient estimate as

\begin{gathered} \bar\nabla^\alpha F(x) = \alpha \underbrace{\bar\nabla^1 F(x)} + (1-\alpha) \underbrace{\bar\nabla^0 F(x)} , \qquad 0 \le \alpha \le 1 \end{gathered}

first-order estimate

zero-order estimate

We give an algorithm to choose $$\alpha$$ automatically based on the empirical variance

(+ a trust region using empirical bias).

## Force at a distance

Smoothing of time-stepping contact model

## Deterministic smoothing - force at a distance

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

Establish equivalence between randomized smoothing and a (deterministic/differentiable) force-at-a-distance contact model.

x_{n+1} = f(x_n, u_n) \approx A(x-x_0) + B(u-u_0) + c \\ \quad A = \frac{\partial f}{\partial x},\quad B = \frac{\partial f}{\partial u},\quad c = f(x_0, u_0)

E_{w \sim \rho}[f(x+w_x, u+w_u)] \\ \quad A_\rho = \frac{\partial}{\partial x} E_{w \sim \rho}[f(x+w_x, u+w_u)],

Smoothed gradients (under distribution $$\rho$$)

\quad B_\rho = \frac{\partial}{\partial u}E_{w \sim \rho}[f(x+w_x, u+w_u)],\\ \quad c_\rho = E_{w \sim \rho}[f(x_0+w_x, u_0+w_u)]

• $$x_{n+1} = f(x_n, u_n)$$ : the solution of an optimization under contact complementarity constraints.

• Relaxed problem: move hard complementarity constraints  into objective (e.g. via log barrier penalty term)
• Results in force at a distance
• For simple problems, we show that each barrier function corresponds with a choice of $$\rho$$ and vice versa.

• RRT distance metric / Trajectory optimization using (differentiable) quasi-dynamic model.

• RL uses stochastic optimal control / smooths discontinuities
• Here we need $$\frac{\partial}{\partial \theta} E_w[f(\theta, w)]$$, not just $$\frac{\partial f}{\partial \theta}$$
• First-order estimators have some pathologies with stiffness/discontinuities; zero-order is robust.
• $$\alpha$$-order estimator can achieve faster convergence + robust performance.
• Examining smoothing for simple systems reveals a deterministic equivalent (e.g. force at a distance)
• Now $$\frac{\partial f}{\partial \theta}$$ is all you need
• Enabled RRT / trajectory opt. for dexterous hands

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

## Beware "artificial" discontinuities

My claim: Subtle interactions between the collision and physics engines can cause artificial discontinuities

(sometimes with dramatic results)

Understanding this requires a few steps

1. Numerical methods must deal with overlapping geometry.
2. Standard approaches summarize the contact forces / constraints at one or more points.
3. It is effectively impossible to do this without introducing (potentially severe) discontinuities.

## Rich collision geometries

Green arrow is the force on the red box due to the overlap with the blue box.

## Multi-point contact

Many heuristics for using multiple points...

## "Hydroelastic contact" as implemented in Drake

major contributions from Damrong Guoy, Sean Curtis, Rick Cory, Alejandro Castro, ...

## "Hydroelastic contact" as implemented in Drake

Red box is rigid, blue box is soft.

## "Hydroelastic contact" as implemented in Drake

Both boxes are soft.

## Point contact vs hydroelastic

Point contact (discontinuous)

Hydroelastic

(continuous)

vs

Hydroelastic is

• more expensive than point contact
• (much) less expensive than finite-element models

State-space (for simulation, planning, control) is the original rigid-body state.

## "Hydroelastic contact" as implemented in Drake

Point contact and multi-point contact can produce qualitatively wrong behavior.

Hydroelastic often resolves it.

## Example: Simulating LEGO® block mating

Manually-curated point contacts

Hydroelastic contact surfaces

Stable and symmetrical hydroelastic forces

Before

Now

Text

## The corner cases

Point contact

Hydroelastic contact

the frictionless case

## The corner cases

Point contact (no friction)

Hydroelastic

(no friction)

By russtedrake

# Do Differentiable Simulators Give Better Policy Gradients?

RSS 2022 Workshop on Differentiable Physics for Robotics

• 751