Smoothing Techniques for Optimal Control of Highly Contact Rich Systems

H.J. Terry Suh

MIT

Motivation: Why should we care?

Agile & Autonomous Locomotion

Dexterous Manipulation

Whole-Body Loco-Manipulation

"The art of robots making contact where they are not supposed to make contact"

Motivation: What makes this problem difficult?

The Non-Smooth Nature of Contact makes tools from smooth optimization difficult to use.

Making & Breaking Contact

Non-smoothness of Friction

Non-smoothness of Geometry

Motivation: Hybrid Dynamics

Hybrid Dynamics

  • Is appealing because it's a framework describes dynamical systems with mixed continuous and discrete variables.
  • Divides the landscape into piecewise smooth regions, divided by manifolds that cause non-smoothness.
  • General solutions to optimal control of hybrid dynamics involve having to decide which of these pieces to choose, while optimizing within each individual piece. (Mixed-Integer Nonlinear Programming)

Motivation: The Fallacy of Hybrid Dynamics

Contact is non-smooth. But Is it truly "discrete"?

The core thesis of this talk:

Counting contact modes for these kinds of systems results in trillions of modes.

Are we truly thinking of whether we're making contact or not for all possible fingers?

Smoothing Techniques for Non-Smooth Problems

Some non-smooth problems are successfully tackled by smooth approximations without sacrificing much from bias.

Is contact one of these problems?

*Figures taken from Yuxin Chen's slides on "Smoothing for Non-smooth Optimization"

Smoothing in Optimization

We can formally define smoothing as a process of convolution with a smooth kernel,

f(x)
f_\rho(x)\coloneqq \mathbb{E}_{w\sim\rho }f(x + w)

In addition, for purposes of optimization, we are interested in methods that provide easy access to the derivative of the smooth surrogate.

Original Function

Smooth Surrogate

Derivative of the Smooth Surrogate:

\frac{\partial}{\partial x} f_\rho(x) = \frac{\partial}{\partial x} \mathbb{E}_{w\sim\rho }f(x + w)

These provide linearization Jacobians in the setting when f is dynamics, and policy gradients in the setting when f is a value function.

Taxonomy of Smoothing

Case 1. Analytic Smoothing

f_\rho(x)\coloneqq \mathbb{E}_{w\sim\rho }f(x + w)
  • If the original function f and the distribution rho is sufficiently structured, we can also evaluate the smooth surrogate in closed form by computing the integral.
  • This can be analytically differentiated to give the derivative.
  • Commonly used in ML as smooth nonlinearities.

Taxonomy of Smoothing

Case 2. Randomized Smoothing, First Order

\begin{aligned} f_\rho(x) & \coloneqq \mathbb{E}_{w\sim\rho }f(x + w) \\ & = \textstyle\frac{1}{N}\sum^N_{i=1} f(x + w_i) \end{aligned}

 

  • When we write convolution as an expectation, it motivates Monte-Carlo sampling methods that can estimate the value of the smooth surrogate.
  • In order to obtain the derivative, we can use the Leibniz integral rule to exchange the expectation and the derivative.
  • This means we can sample derivatives to approximate the derivative of the sampled function.
  • Requires access to the derivative of the original function f.
  • Also known as the Reparametrization (RP) gradient.
\begin{aligned} \frac{\partial}{\partial x} f_\rho(x) & = \frac{\partial}{\partial x} \mathbb{E}_{w\sim\rho }f(x + w) = \mathbb{E}_{w\sim\rho}\left[\frac{\partial}{\partial x} f(x+w)\right] \\ & \approx \frac{1}{N} \sum^N_{i=1} \frac{\partial}{\partial x}f(x+w_i) \end{aligned}

Taxonomy of Smoothing

Case 2. Randomized Smoothing, First Order

*Figures taken from John Duchi's slides on Randomized Smoothing

Taxonomy of Smoothing

Case 2. Randomized Smoothing, Zeroth-Order

\begin{aligned} f_\rho(x) & \coloneqq \mathbb{E}_{w\sim\rho }f(x + w) \\ & = \textstyle\frac{1}{N}\sum^N_{i=1} f(x + w_i) \end{aligned}

 

  • Interestingly, we can obtain the derivative of the randomized smoothing objective WITHOUT having access to the gradients of f.
  • This gradient is derived from Stein's lemma
  • Known by many names: Likelihood Ratio gradient, Score Function gradient, REINFORCE gradient.
\begin{aligned} \frac{\partial}{\partial x}\mathbb{E}_{w\sim\rho}[f(x+w)] = \mathbb{E}_{w\sim\rho}\left[f(x + w)S(w)^\top\right] \end{aligned}
S(w) = \nabla_w \log \rho(w) = -\frac{\nabla \rho(w)}{\rho(w)}

This seems like it came out of nowhere? How can this be true?

Taxonomy of Smoothing

Rethinking Linearization as a Minimizer.

\min_{\mathbf{J},\mu} \frac{1}{2}\mathbb{E}_{w\sim\rho}\left[\|f(\bar{x} + w) - \mathbf{J}w - \mu\|^2_2\right]

 

  • The linearization of a function provides the best linear model (i.e. up to first order) to approximate the function locally.
  • We could use the same principle for a stochastic function.
  • Fix a point xbar. If we were to sample bunch of f(xbar + w_i) and run a least-squares procedure to find the best linear model, this converges to the linearization of the smooth surrogate.
\begin{aligned} \mathbf{J}^* & = \mathbb{E}_{w\sim \rho}\left[f(\bar{x} + w)w^\top\right]\mathbf{\Sigma}^{-1} \\ \mu^* & = \mathbb{E}_{w\sim\rho}\left[f(\bar{x} + w)\right] \end{aligned}

Also provides a convenient way to compute the gradient in zeroth-order. Just sample and run least-squares!

Tradeoffs between structure and performance.

The generally accepted wisdom: more structure gives more performance.

Analytic smoothing

Randomized Smoothing

First-Order

Randomized Smoothing

Zeroth-Order

 

  • Requires closed-form evaluation of the integral.
  • No sampling required.

 

  • Requires access to first-order oracle (derivative of f).
  • Generally less variance than zeroth-order.

 

  • Only requires zeroth-order oracle (value of f)
  • High variance.

Structure Requirements

Performance / Efficiency

Smoothing of Contact Dynamics

Without going too much into details of multibody contact dynamics, we will use time-stepping, quasidynamic formulation of contact.

  • We assume that velocities die out instantly
  • Inputs to the system are defined by position commands to actuated bodies.
  • The actuated body and the commanded position is connected through an impedance source k.
\begin{aligned} \text{find} \quad & \delta q \\ \text{s.t.} \quad & hk(q + \delta q - u) = \lambda \\ & 0 \leq q + \delta q \perp \lambda \geq 0 \end{aligned}

Equations of Motion (KKT Conditions)

Non-penetration

(Primal feasibility)

Complementary slackness

Dual feasibility

Force Balance

(Stationarity)

 

\begin{aligned} \min_{\delta q}\quad & \frac{1}{2} hk(\delta q)^2 - hk(u - q)\delta q \\ \text{s.t.} \quad & q + \delta q \geq 0 \end{aligned}

Quasistatic QP Dynamics

We can randomize smooth this with first order methods using sensitivity analysis or use zeroth-order randomized smoothing.

 

But can we smooth this analytically?

Barrier (Interior-Point) Smoothing

\begin{aligned} \min_{\delta q}\quad & \frac{1}{2} hk(\delta q)^2 - hk(u - q)\delta q \\ \text{s.t.} \quad & q + \delta q \geq 0 \end{aligned}

Quasistatic QP Dynamics

\begin{aligned} \text{find} \quad & \delta q \\ \text{s.t.} \quad & hk(q + \delta q - u) = \lambda \\ & 0 \leq q + \delta q \perp \lambda \geq 0 \end{aligned}

Equations of Motion (KKT Conditions)

\begin{aligned} \min_{\delta q}\quad & \frac{1}{2} hk(\delta q)^2 - hk(u - q)\delta q - \frac{1}{\kappa}\log(q + \delta q) \end{aligned}

Interior-Point Relaxation of the QP

Equations of Motion (Stationarity)

hk(q + \delta q - u) = \frac{1}{\kappa}\left(\frac{1}{q + \delta q}\right)
\lambda (q + \delta q) = \kappa^{-1}

Impulse

Relaxation of complementarity

"Force from a distance"

What does smoothing do to contact dynamics?

 

  • Both schemes (randomized smoothing and barrier smoothing) provides "force from a distance effects" where the exerted force increases  with distance.
  • Provides gradient information from a distance.
  • In contrast, without smoothing, zero gradients and flat landscapes cause problems for gradient-based optimization.

Is barrier smoothing a form of convolution?

Equivalence of Randomized and Barrier Smoothing.

 

  • For the simple 1D pusher system, it turns out that one can show that barrier smoothing also implements a convolution with the original function and a kernel.
  • This is an elliptical distribution with a "fatter tail" compared to a Gaussian".
\rho(w) = \sqrt{\frac{4 hk\kappa}{(w^\top 4h\kappa w + 4)^3}}

 

Later result shows that there always exists such a kernel for Linear Complementary Systems (LCS).

Optimal Control with Dynamics Smoothing

Replace linearization in iLQR with smoothed linearization

Exact

Smoothed

Smoothing of Value Functions.

Optimal Control thorugh Non-smooth Dynamics

\begin{aligned} \min_{\theta} & \; \bigg[V(\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \pi(x_t, \theta) \end{aligned}
\begin{aligned} \min_{\theta} & \; \bigg[V(\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; {\color{red} x_{t+1} = \mathbb{E}_{w,v}\left[f(x_t + w_t, u_t+v_t)\right]} \\ & \; u_t = \pi(x_t, \theta) \end{aligned}

Policy Optimization

Cumulative Cost

Dynamics

Policy (can be open-loop)

Dynamics Smoothing

\begin{aligned} \min_\theta V(\theta) \end{aligned}

Value Smoothing

\begin{aligned} \min_{\theta} & \; {\color{red}\mathbb{E}_{w}}\bigg[V(\theta + w) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x_{t+1} = f(x_t, u_t) \\ & \; {\color{red} u_t = \pi(x_t, \theta + w)} \end{aligned}
f(x)
f_\rho(x)\coloneqq \mathbb{E}_{w\sim\rho }f(x + w)

Recall that smoothing turns              into                                                       .

Why not just smooth the value function directly and run policy optimization?

Smoothing of Value Functions.

Original Problem

\begin{aligned} \min_\theta V(\theta) \end{aligned}

Long-horizon problems involving contact can have terrible landscapes. 

Smoothing of Value Functions.

Smooth Surrogate

\begin{aligned} \min_\theta \mathbb{E}_w V(\theta + w) \end{aligned}

The benefits of smoothing are much more pronounced in the value smoothing case. 

Beautiful story - noise sometimes regularizes the problem, developing into helpful bias.

How do we take gradients of smoothed value function?

Analytic smoothing

Randomized Smoothing

First-Order

Randomized Smoothing

Zeroth-Order

​​​​​

 

  • Requires differentiability over dynamics, reward, policy.
  • Generally lower variance.

 

  • Only requires zeroth-order oracle (value of f)
  • High variance.

Structure Requirements

Performance / Efficiency

Pretty much not possible.

How do we take gradients of smoothed value function?

First-Order Policy Search with Differentiable Simulation

Policy Gradient Methods in RL (REINFORCE / TRPO / PPO)

 

 

  • Requires differentiability over dynamics, reward, policy.
  • Generally lower variance.

 

  • Only requires zeroth-order oracle (value of f)
  • High variance.

Structure Requirements

Performance / Efficiency

Turns out there is an important question hidden here regarding the utility of differentiable simulators.

Do Differentiable Simulators Give Better Policy Gradients?

Very important question for RL, as it promises lower variance, faster convergence rates, and more sample efficiency.

What do we mean by "better"?

Consider a simple stochastic optimization problem

\min_\theta F(\theta) = \min_\theta \mathbb{E}_{w\sim\mathcal{N}(w;0,\sigma^2)} f(\theta + w)

First-Order Gradient Estimator

\begin{aligned} \nabla_\theta \mathbb{E}_w f(\theta + w) & = \mathbb{E}_w f(\theta + w) \\ & = \frac{1}{N}\sum^N_{i=1}\nabla_\theta f(\theta + w_i) \end{aligned}

Zeroth-Order Gradient Estimator

\begin{aligned} \nabla_\theta \mathbb{E}_w f(\theta + w) & = \frac{1}{\sigma^2}\mathbb{E}_w [f(\theta + w)w] \\ & = \frac{1}{N}\sum^N_{i=1}\frac{1}{\sigma^2} f(\theta + w_i)w_i \end{aligned}

Then, we can define two different gradient estimators.

What do we mean by "better"?

First-Order Gradient Estimator

\begin{aligned} \nabla_\theta \mathbb{E}_w f(\theta + w) & = \mathbb{E}_w f(\theta + w) \\ & = \frac{1}{N}\sum^N_{i=1}\nabla_\theta f(\theta + w_i) \end{aligned}

Zeroth-Order Gradient Estimator

\begin{aligned} \nabla_\theta \mathbb{E}_w f(\theta + w) & = \frac{1}{\sigma^2}\mathbb{E}_w [f(\theta + w)w] \\ & = \frac{1}{N}\sum^N_{i=1}\frac{1}{\sigma^2} f(\theta + w_i)w_i \end{aligned}

Bias

Variance

Common lesson from stochastic optimization:

 

1. Both are unbiased under sufficient regularity conditions

2. First-order generally has less variance than zeroth order.

What happens in Contact-Rich Scenarios?

Bias

Variance

Common lesson from stochastic optimization:

 

1. Both are unbiased under sufficient regularity conditions

2. First-order generally has less variance than zeroth order.

Bias

Variance

Bias

Variance

We show two cases where the commonly accepted wisdom is not true.

1st Pathology: First-Order Estimators CAN be biased.

2nd Pathology: First-Order Estimators can have MORE

variance than zeroth-order.

Bias from Discontinuities

1st Pathology: First-Order Estimators CAN be biased.

\begin{aligned} \nabla_\theta \mathbb{E}_w f(\theta + w) & = \mathbb{E}_w f(\theta + w) = \frac{1}{N}\sum^N_{i=1}\nabla_\theta f(\theta + w_i) \end{aligned}

What's worse: the empirical variance is also zero!

(The estimator is absolutely sure about an estimate that is wrong)

Not just a pathology, could happen quite often in contact.

Empirical Bias leads to High Variance

Perhaps it's a modeling artifact? Contact can be softened.

  • From a low-sample regime, no statistical way to distinguish between an actual discontinuity and a very stiff function.
  • Generally, stiff relaxations lead to high variance. As relaxation converges to true discontinuity, variance blows up to infinity, and suddenly turns into bias!
  • Zeroth-order escapes by always thinking about the cumulative effect over some finite interval.

Variance of First-Order Estimators

2nd Pathology: First-order Estimators CAN have more variance than zeroth-order ones.

\begin{aligned} \textbf{Var}_\theta\left[{\color{blue}\frac{1}{N}\frac{1}{\sigma^2}\sum^N_{i=1}f(\theta + w_i)w_i}\right] \leq \frac{\dim \theta}{N\sigma^2}\max_w\|f(\theta + w)\|^2_2 \end{aligned}
\begin{aligned} \textbf{Var}_\theta\left[{\color{red}\frac{1}{N}\sum^N_{i=1}\nabla_\theta f(\theta + w_i)}\right] \leq \frac{1}{N}\max_w \|\nabla_\theta f(\theta + w)\|^2_2 \end{aligned}

Scales with Gradient

Scales with Function Value

Scales with dimension of decision variables.

High-Variance Events

Case 1. Persistent Stiffness

Case 2. Chaotic

  • Contact modeling using penalty method is a bad idea for differentiable policy search
  • Gradients always has the spring stiffness value!
  • High variance depending on initial condition
  • Zeroth-order always bounded in value, but the gradients can be very high.

Motivating Gradient Interpolation

Bias

Variance

Common lesson from stochastic optimization:

 

1. Both are unbiased under sufficient regularity conditions

2. First-order generally has less variance than zeroth order.

Bias

Variance

Bias

Variance

1st Pathology: First-Order Estimators CAN be biased.

2nd Pathology: First-Order Estimators can have MORE

variance than zeroth-order.

Can we automatically decide which of these categories we fall into based on statistical data?

The Alpha-Ordered Gradient Estimator

Perhaps we can do some interpolation of the two gradients based on some criteria.

\min_{\alpha\in[0, 1]} \textbf{Var}(\bar{\nabla}^{[\alpha]}F(\theta)) = \alpha^2\hat{\sigma}^2_1 + (1-\alpha)^2 \hat{\sigma}_0^2

Previous works attempt to minimize the variance of the interpolated estimator using empirical variance.

Robust Interpolation

\min_{\alpha\in[0, 1]} \textbf{Var}(\bar{\nabla}^{[\alpha]}F(\theta)) = \alpha^2\hat{\sigma}^2_1 + (1-\alpha)^2 \hat{\sigma}_0^2

Thus, we propose a robust interpolation criteria that also restricts the bias of the interpolated estimator.

\begin{aligned} \min_{\alpha\in[0, 1]} \; & \textbf{Var}(\bar{\nabla}^{[\alpha]}F(\theta)) \\ \text{s.t.} \; & \textbf{Bias}(\bar{\nabla}^{[\alpha]}F(\theta)) \leq \gamma \end{aligned}

Robust Interpolation

Robust Interpolation

\begin{aligned} \min_{\alpha\in[0, 1]} \; & \textbf{Var}(\bar{\nabla}^{[\alpha]}F(\theta)) \\ \text{s.t.} \; & \textbf{Bias}(\bar{\nabla}^{[\alpha]}F(\theta)) \leq \gamma \end{aligned}

Implementation

\begin{aligned} \min_{\alpha\in[0, 1]} \; & \alpha^2\hat{\sigma}_1^2 + (1 - \alpha)^2\hat{\sigma}^2_0 \\ \text{s.t.} \; & \varepsilon + \alpha\|\bar{\nabla}^{[1]}F - \bar{\nabla}^{[0]} F \| \leq \gamma \end{aligned}

Confidence Interval on the zeroth-order gradient.

Difference between the gradients.

Key idea: Unit-test the first-order estimate against the unbiased zeroth-order estimate to guarantee correctness probabilistically. .

Results: Ball throwing on Wall

Key idea: Do not commit to zeroth or first uniformly,

but decide coordinate-wise which one to trust more.

Results: Policy Optimization

Able to capitalize on better convergence of first-order methods while being robust to their pitfalls.

Limitations of Smoothing

Contact is non-smooth. But Is it truly "discrete"?

The core thesis of this talk:

The local decisions of where to make contact are better modeled as continuous decisions with some smooth approximations.

My viewpoint so far:

Limitations of Smoothing

Contact is non-smooth. But Is it truly "discrete"?

The core thesis of this talk:

The local decisions of where to make contact are better modeled as continuous decisions with some smooth approximations.

My viewpoint so far:

The remaining "discrete decisions" come not from contact, but from discrete-level decisions during planning.

Smoothing CANNOT handle these high-level discrete decisions.

Limitations of Smoothing

These reveal true discrete "modes" of the decision making process.

Limitations of Smoothing

\theta

Apply negative impulse

to stand up.

Apply positive impulse to bounce on the wall.

Limitations of Smoothing

Can we smooth local contact decisions and efficiently search through high-level discrete decisions?

Our ideal solution

Global Search with Smoothing: Contact-Rich RRT

Motivating Contact-Rich RRT

Sampling-Based Motion Planning is a popular solution in robotics for complex non-convex motion planning

How do we define notions of nearest? 

How do we extend (steer)? 

  •  Nearest states in Euclidean space are not necessarily reachable according to system dynamics (Dubin's car)
  • Typically, kinodynamic RRT solves trajopt
  • Potentially a bit costly.

Reachability-Consistent Distance Metric

Reachability-based Mahalanobis Distance

d(q;\bar{q})

How do we come up with a distance metric between q and qbar in a dynamically consistent manner?

Reachability-Consistent Distance Metric

Reachability-based Mahalanobis Distance

How do we come up with a distance metric between q and qbar in a dynamically consistent manner?

q_+ = \mathbf{B}(\bar{q},\bar{u})\delta u + f(\bar{q},\bar{u})

Consider a one-step input linearization of the system.

Then we could consider a "reachability ellipsoid" under this linearized dynamics, 

\mathcal{E} = \{q_+ | q_+ = \mathbf{B}(\bar{q},\bar{u})\delta u + f(\bar{q},\bar{u}), \|\delta u \| \leq 1\}

Note: For quasidynamic formulations, ubar is a position command, which we set as the actuated part of qbar.

Reachability Ellipsoid

Reachability Ellipsoid

\mathcal{E} = \{q_+ | q_+ = \mathbf{B}(\bar{q},\bar{u})\delta u + f(\bar{q},\bar{u}), \|\delta u \| \leq 1\}

Intuitively, if B lengthens the direction towards q from a uniform ball, q is easier to reach.

On the other hand, if B decreases the direction towards q, q is hard to reach.

Mahalanobis Distance of an Ellipsoid

Reachability Ellipsoid

\mathcal{E} = \{q_+ | q_+ = \mathbf{B}(\bar{q},\bar{u})\delta u + f(\bar{q},\bar{u}), \|\delta u \| \leq 1\}

The B matrix induces a natural quadratic form for an ellipsoid,

\begin{aligned} d_\gamma(q;\bar{q}) & \coloneqq \frac{1}{2}(q-\mu)^\top \mathbf{\Sigma}_\gamma^{-1}(q - \mu) \\ \mathbf{\Sigma}_\gamma & \coloneqq \mathbf{B}(\bar{q},\bar{u})\mathbf{B}(\bar{q},\bar{u})^\top + \gamma\mathbf{I}_n \\ \mu & \coloneqq f(\bar{q},\bar{u}) \end{aligned}

Mahalanobis Distance using 1-Step Reachability

Note: if BBT is not invertible, we need to regularize to property define a quadratic distance metric numerically.

Smoothed Distance Metric

\begin{aligned} d_{{\color{red}\rho},\gamma}(q;\bar{q}) & \coloneqq \frac{1}{2}(q-\mu)^\top \mathbf{\Sigma}_{{\color{red}\rho},\gamma}^{-1}(q - \mu) \\ \mathbf{\Sigma}_\gamma & \coloneqq \mathbf{B}_{\color{red}\rho}(\bar{q},\bar{u})\mathbf{B}_{\color{red}\rho}(\bar{q},\bar{u})^\top + \gamma\mathbf{I}_n \\ \mu & \coloneqq f_{\color{red}\rho}(\bar{q},\bar{u}) \end{aligned}

For Contact:

Don't use the exact linearization, but the smooth linearization.

Global Search with Smoothing

Dynamically consistent extension

Theoretically, it is possible to use long-horizon trajopt algorithms such as iLQR / DDP.

 

Here we simply do one-step trajopt and solve least-squares.

Importantly, the actuation matrix for least-squares is smoothed, but we rollout the actual dynamics with the found action. 

Dynamically consistent extension

Contact Sampling

With some probability, we execute a regrasp (sample another valid contact configuration) in order to encourage further exploration.

 

 

Global Search with Smoothing

Before I go...

Will be hosting an IROS workshop on

Leveraging Models for Contact-Rich Manipulation.

https://sites.google.com/view/iros2023-contactrich/home

Excited to talk to more model-based folks who are interested in manipulation! (Or conversely, manipulation folks who still have an ounce of hope for models).

Thank you!

  • H.J. Terry Suh*, Tao Pang*, Russ Tedrake,
    "Bundled Gradients through Contact via Randomized Smoothing",
    RA-L 2022, Presented at ICRA 2022
     
  • H.J. Terry Suh, Max Simchowitz, Kaiqing Zhang, Russ Tedrake,
    "Do Differentiable Simulators Give Better Policy Gradients?"
    ICML 2022 Outstanding Paper Award,
    2022 TC on Model-based Optimization for Robotics Best Paper Award
     
  • Tao Pang*, H.J. Terry Suh*, Lujie Yang, Russ Tedrake,
    "Global Planning for Contact-Rich Manipulation via Local Smoothing of Quasidynamic Contact Models",
    To appear in Transactions of Robotics (T-RO)