# Is "Deep RL" solving problems we couldn't solve before?

If yes, why (precisely)?

## Claim: From RL => Deep RL

Deep learning theory (for supervised learning):

• Overparameterization (training error=0)
• Implicit regularization

Are we really doing "Deep RL"?

• Most papers use ~3 layer MLP with ~256 hidden units.  Anymal's policy was {256, 160, 128}.
• Perception layers are deep, but often trained separately.

Certainly the deep learning echo system has helped! (big compute, Adam, weight initializations, hyper parameter searches, ...)

## Direct policy search (vs e.g. motion planning)

Classic control problems can be solved with policy gradient

More direct path to (dynamic) output feedback policies (aka "pixels to torques")

## Focus for today

• Stochastic gradient descent can smooth the discontinuities of multibody contact.

• We can extract this idea, and use it for trajectory optimization, RRT, etc.

• Stochasticity is not essential (deterministic smoothing works, too).

Terry Suh

Tao Pang

## Randomized Smoothing

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

vs

## Convex quasi-dynamic time-stepping model

• State space: $$q_a, q_u$$ for actuated / unactuated DOF
• Input, $$u$$, is commanded position of actuated joints
• Assume robot is impedance (stiffness) controlled, yielding:
\begin{aligned} h K \left(q_a + \delta q_a - u \right) &= h\tau_A + \sum_i (J_a[i])^\intercal \lambda_i, \\ \left( \frac{1}{h} M_u \right) \delta q_u &= h\tau_U + \sum_i (J_u[i])^\intercal \lambda_i, \end{aligned}
\begin{aligned} \min_{\delta q} \quad & \frac{1}{2} \delta q^\intercal \mathbf{Q} \delta q + b^\intercal \delta q, \\ \text{subject to} \quad & J_i {\delta q} + \begin{bmatrix} \phi_i \\ 0_2 \end{bmatrix} \in \mathcal{K}_i^\star, \qquad \text{(dual friction cone)}\\ & \mathbf{Q} \coloneqq \begin{bmatrix} M_u/h & 0 \\ 0 & h K_a \end{bmatrix}, \; b \coloneqq - h\begin{bmatrix} \tau_U \\ K_a(u - q_a) + \tau_A \end{bmatrix}, \end{aligned}

SOCP

gravity

contact forces

mass

stiffness

## Trajectory optimization via smoothed iLQR/iMPC

Linearizing a smoothed function

\begin{gathered} F(x) = E_w\left[ f(x + w) \right] \\ F(x) \approx A(x - x_0) + b \end{gathered}
\min_{A,b} \frac{1}{2} E_w\left[ \| Aw + b - f(x_0 + w) \|^2 \right]

This can be approximated by (zero-order or first-order) Monte-carlo gradient estimation; as seen in RL.

Randomized smoothing of quasi-dynamic model gives "force at a distance"

## Deterministic smoothing - force at a distance

\begin{aligned} \min_{\delta q} & \frac{1}{2} \delta q^\intercal \mathbf{Q} \delta q + b^\intercal \delta q \\ &- \frac{1}{\kappa} \sum_i \log \left[\frac{(J_n[i] \delta q + \phi_i)^2}{\mu_i^2} - (J_t[i]\delta q)^\intercal J_t[i]\delta q \right] \end{aligned}

Log-barrier penalty method

In simple cases, can establish equivalence with the randomized smoothing (from RL)

## RRT distance metrics and extend operators

Idea: Grow RRT only in unactuated DOFs; distance metric based on smoothed linearization

\begin{aligned} d_{\rho,\gamma}^\mathrm{u}(q;\bar{q}) & \coloneqq \|q^\mathrm{u} - \mu^\mathrm{u}_\rho\|_{\mathbf{\Sigma}^{\mathrm{u}^{-1}}_{\rho,\gamma}}, \\ \mathbf{\Sigma}_{\rho,\gamma}^\mathrm{u} & \coloneqq \mathbf{B}^\mathrm{u}_\rho(\bar{q},\bar{q}^\mathrm{a})\mathbf{B}^\mathrm{u}_\rho(\bar{q},\bar{q}^\mathrm{a})^\intercal + \gamma\mathbf{I}_{n_\mathrm{u}},\\ \mu_\rho^\mathrm{u} & \coloneqq c_\rho^\mathrm{u}(\bar{q},\bar{q}^\mathrm{a}). \end{aligned}

## Discussion

• I think the quasi-dynamic model is very useful; we should probably be exploring it here.
• Graph of Convex Sets bring in graph optimization; can consume the discontinuities more directly.  Q: is it the right tool for a dexterous hand?  or is smoothing more natural?
• I still want output feedback (control without explicit state representation); this is what we are studying in Intuitive.
• Next wave of theoretical RL results will help us get to the heart of what is working and when.  I don't think PPO is the final answer.

2022 International Conference on Machine Learning (ICML), Accepted as Long Talk

will be submitted (and available on arxiv) very soon!

For more details:

By russtedrake

# Motion planning through contact via local smoothing

TRI dexterous group meeting

• 298