Smoothing Dynamics for Planning through Contact

Is "Deep RL" solving problems we couldn't solve before?

If yes, why (precisely)?

Claim: We can't model the real world

Claim: We can't model the real world

Real-world robustness trained only on (simple) models!

Claim: From RL => Deep RL

Deep learning theory (for supervised learning):

  • Overparameterization (training error=0)
  • Implicit regularization

Are we really doing "Deep RL"?

  • Most papers use ~3 layer MLP with ~256 hidden units.  Anymal's policy was {256, 160, 128}.
  • Perception layers are deep, but often trained separately.

Certainly the deep learning echo system has helped! (big compute, Adam, weight initializations, hyper parameter searches, ...) 

Direct policy search (vs e.g. motion planning)

Classic control problems can be solved with policy gradient

More direct path to (dynamic) output feedback policies (aka "pixels to torques")

These optimization landscapes are not convex, but have no local minima.

Focus for today

  • Stochastic gradient descent can smooth the discontinuities of multibody contact.
     
  • We can extract this idea, and use it for trajectory optimization, RRT, etc.
     
  • Stochasticity is not essential (deterministic smoothing works, too).

Terry Suh

Tao Pang

Randomized Smoothing

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

vs

Randomized Smoothing for Multibody Contact

Convex quasi-dynamic time-stepping model

  • State space: \( q_a, q_u \) for actuated / unactuated DOF
  • Input, \(u\), is commanded position of actuated joints
  • Assume robot is impedance (stiffness) controlled, yielding:
\begin{aligned} h K \left(q_a + \delta q_a - u \right) &= h\tau_A + \sum_i (J_a[i])^\intercal \lambda_i, \\ \left( \frac{1}{h} M_u \right) \delta q_u &= h\tau_U + \sum_i (J_u[i])^\intercal \lambda_i, \end{aligned}
\begin{aligned} \min_{\delta q} \quad & \frac{1}{2} \delta q^\intercal \mathbf{Q} \delta q + b^\intercal \delta q, \\ \text{subject to} \quad & J_i {\delta q} + \begin{bmatrix} \phi_i \\ 0_2 \end{bmatrix} \in \mathcal{K}_i^\star, \qquad \text{(dual friction cone)}\\ & \mathbf{Q} \coloneqq \begin{bmatrix} M_u/h & 0 \\ 0 & h K_a \end{bmatrix}, \; b \coloneqq - h\begin{bmatrix} \tau_U \\ K_a(u - q_a) + \tau_A \end{bmatrix}, \end{aligned}

SOCP

gravity

contact forces

mass

stiffness

Trajectory optimization via smoothed iLQR/iMPC

Linearizing a smoothed function

\begin{gathered} F(x) = E_w\left[ f(x + w) \right] \\ F(x) \approx A(x - x_0) + b \end{gathered}
\min_{A,b} \frac{1}{2} E_w\left[ \| Aw + b - f(x_0 + w) \|^2 \right]

This can be approximated by (zero-order or first-order) Monte-carlo gradient estimation; as seen in RL.

Randomized smoothing of quasi-dynamic model gives "force at a distance"

Deterministic smoothing - force at a distance

\begin{aligned} \min_{\delta q} & \frac{1}{2} \delta q^\intercal \mathbf{Q} \delta q + b^\intercal \delta q \\ &- \frac{1}{\kappa} \sum_i \log \left[\frac{(J_n[i] \delta q + \phi_i)^2}{\mu_i^2} - (J_t[i]\delta q)^\intercal J_t[i]\delta q \right] \end{aligned}

Log-barrier penalty method

In simple cases, can establish equivalence with the randomized smoothing (from RL)

RRT distance metrics and extend operators

Idea: Grow RRT only in unactuated DOFs; distance metric based on smoothed linearization

\begin{aligned} d_{\rho,\gamma}^\mathrm{u}(q;\bar{q}) & \coloneqq \|q^\mathrm{u} - \mu^\mathrm{u}_\rho\|_{\mathbf{\Sigma}^{\mathrm{u}^{-1}}_{\rho,\gamma}}, \\ \mathbf{\Sigma}_{\rho,\gamma}^\mathrm{u} & \coloneqq \mathbf{B}^\mathrm{u}_\rho(\bar{q},\bar{q}^\mathrm{a})\mathbf{B}^\mathrm{u}_\rho(\bar{q},\bar{q}^\mathrm{a})^\intercal + \gamma\mathbf{I}_{n_\mathrm{u}},\\ \mu_\rho^\mathrm{u} & \coloneqq c_\rho^\mathrm{u}(\bar{q},\bar{q}^\mathrm{a}). \end{aligned}

Discussion

  • I think the quasi-dynamic model is very useful; we should probably be exploring it here.
  • Graph of Convex Sets bring in graph optimization; can consume the discontinuities more directly.  Q: is it the right tool for a dexterous hand?  or is smoothing more natural?
  • I still want output feedback (control without explicit state representation); this is what we are studying in Intuitive.
  • Next wave of theoretical RL results will help us get to the heart of what is working and when.  I don't think PPO is the final answer.

2022 International Conference on Machine Learning (ICML), Accepted as Long Talk

will be submitted (and available on arxiv) very soon!

For more details:

Motion planning through contact via local smoothing

By russtedrake

Motion planning through contact via local smoothing

TRI dexterous group meeting

  • 958