Active Tactile Exploration for State Estimation (Lab Talk)

Active Tactile Exploration

for Rigid Body State Estimation

Ethan K. Gordon, Bruke Baraki, Michael Posa

Known / Estimated:

Object Geometry
Object Pose
Object Mass / Inertia
Frictional Properties

IN Robotics, Models are Powerful

Max Planck Real Robotics Challenge 2020

Arbitrary Convex Object Repose Task

Bauer et al. "Real Robot Challenge: A Robotics Competition in the Cloud". NeurIPS 2021 Competition.

Models are difficult to build Online

Occlusions / Darkness
Clutter
Heterogeneous Materials
Broken Objects

Visual Model Learning

Structure from Motion (SFM)

Bianco et al. "Evaluating the Performance of Structure from Motion Pipelines", Journal of Imaging 2018

Wen et al. "BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects", CVPR 2023

Geometry from Video

Pros: Spatially Dense, Mature HW and SW

Cons:

Occlusions / Darkness
SFM: can't capture physical properties
Video: What's doing the manipulating?

State of the art Tactile Model Learning

Hu et al. "Active shape reconstruction using a novel visuotactile palm sensor", Biomimetic Intelligence and Robotics 2024

Xu et al. "TANDEM3D: Active Tactile Exploration for 3D Object Recognition", ICRA 2023

Single-Finger Poking: No friction or inertia.

Utilizes discrete object priors.

Spatially Sparse Data -> Active Learning

Active Tactile Exploration: Problem Statement

Assumptions:

Rigidity
Convexity
Coulomb friction

r[t]

What we know / measure:

Robot state trajectory \(r[t]\)
Contact force \(\lambda_m[t]\)
Contact normal \(\hat{n}_m[t]\)

Unknown object properties:

State \(x[t]\)
Geometry \(\theta\)
Inertial properties \(\theta\)
Frictional properties \(\theta\)

\lambda_m[t]

\hat{n}_m[t]

x[t] = (q[t], v[t])

\theta

Measurement Probability Model

<Gaussian: Major (likely incorrect) Assumption>

<A Gamma is likely more accurate (>0 and mean-dependent variance, with variance -> 0 when mean -> 0). However, in practice, a Gaussian estimator often achieves similar performance to a Gamma.>

\mathcal{L} = -\log\mathbb{P}(\lambda_m | \theta, x; r) = \sum_t\left|\left|\hat{\lambda}-\lambda_m\right|\right|_2^2

Minimize as a loss function for a Maximum Likelihood Estimate

\lambda_m[t] = \hat{\lambda}(\theta, x[t]; r[t]) + \epsilon

\epsilon \sim \mathcal{N}(0, \Sigma)

Given a simulator that can compute \(\hat{\lambda}\)

ONe Possibility is Differential Simulation + Shooting

\mathcal{L} = \sum_t\left|\left|\hat{\lambda}(\theta, x)-\lambda_m\right|\right|_2^2

\text{s.t. } \hat{\lambda} = \arg\min_{\lambda}\sum_t\left|\left|M\Delta v_c - J^{\textrm T}\lambda\right|\right| _{M^{-1}}^2 + \phi^{\textrm T}\lambda

\phi

\mathcal{FC}(\mu)

\text{s.t. } q[t] = q[t-1] + v[t-1]

\text{Given: } x[0]

Anitescu. "Optimization-based simulation of nonsmooth rigid multibody dynamics,” Mathematical Programming 2006

\lambda \in \mathcal{FC}(\mu)

\phi > 0

Improving Stability with an Implicit Loss

\mathcal{L} = \min_{\lambda}\sum_t\left|\left|\lambda-\lambda_m\right|\right|_2^2 + \left|\left|M\Delta v_c - J^{\textrm T}\lambda\right|\right| _{M^{-1}}^2 + \phi^{\textrm T}\lambda

Bianchini et al. "Generalization Bounded Implicit Learning of Nearly Discontinuous Functions,” L4DC 2022

<TODO: Replace with self-made plot>

DiffSim + Shooting Limitations:

Sensitivity to x[0]
Discontinuities given process noise \(\epsilon_p\)
Only gets worse with smaller dt

Solution is to bring the optimization into the loss function.

+ \left|\left|\Delta q - v\right|\right|_2^2 + \min(\phi, 0)

\hat{\lambda}(\theta, x[t] + \epsilon_p; r[t]) + \epsilon_m

MSE -> Graph Distance

Violation Implicit Loss summary

Pfommer et al. "ContactNets: Learning Discontinuous Contact Dynamics with Smooth, Implicit Representations,” CoRL 2020

\phi^{\textrm T}\lambda

\min(\phi, 0)

Complementarity:

Penetration:

\left|\left|\lambda-\lambda_m\right|\right|_2^2

Measurement:

+ (1 - \hat{n}_m\cdot\hat{n})

\hat{n}_m

\hat{n}

\left|\left|M\Delta v_c - J^{\textrm T}\lambda\right|\right| _{M^{-1}}^2 + \left|\left|\Delta q - v\right|\right|_2^2

Prediction:

\left|\left|\mu J_tv\right|\right|\lambda_n + \lambda_t^{\textrm T}\mu J_tv

Power Dissipation:

(Relaxed in Anitescu)

J_tv

\lambda

+ \max(J_nv,0)^{\textrm T}\lambda_n

J_nv

\lambda=0

Learning <Preliminary> Results

Real Time Simulated Data Collection, Real Time Gradient Descent

active exploration: What is Information?

We want to (possibly) be surprised

\(\Theta\)

\(\mathcal{L}\)

Ideally information is local

(i.e. no belief distribution on \(\Theta\))

\(\Theta\)

\(\mathcal{L}\)

\(\Theta\)

\(\mathcal{L}\)

\(\Theta\)

\(\mathcal{L}\)

Estimate \(\hat{\Theta}\)
Choose \(r\)
Observe (random) \(\lambda_m\)

\(\hat{\Theta}\)

\(r\)

Fisher Information: Variance of the score

\mathcal{L}(\Theta, r, \lambda_m) = -\log\mathbb{P}(\lambda_m | \Theta; r)

"log-likelihood"

\nabla_\Theta\mathcal{L}(\Theta, r, \lambda_m)

"score"

We are surprised if, at \(\hat{\Theta}\), the score varies a lot with new data.

\mathcal{I} = Var_{\lambda_m}\left[\nabla_\Theta\mathcal{L}(\Theta, r, \lambda_m)\Bigr\rvert_{\hat{\Theta}}\right]

"Fisher Information"

Fisher Information definitions

\mathcal{I} = Var_{\lambda_m}\left[\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right] = \mathbb{E}_{\lambda_m}\left[\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right]^2 + \mathbb{E}_{\lambda_m}\left[\left(\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right)^2\right]

\(\hat{\Theta}\) is a Maximum Likelihood Estimate

(outer product)

Var_{\lambda_m}\left[\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right] = \mathbb{E}_{\lambda_m}\left[\nabla_\Theta\otimes\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right]

The variance of the gradient is the expected sensitivity of the gradient to small changes in the loss function.

\(\Theta\)

\(\mathcal{L}\)

Fisher Information definitions

Var_{\lambda_m}\left[\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right] = \mathbb{E}_{\lambda_m}\left[\nabla_\Theta\otimes\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right]

The variance of the gradient is the expected sensitivity of the gradient to small changes in the loss function.

\(\Theta\)

\(\mathcal{L}\)

Mathematically requires "certain regularity conditions":

\(\mathbb{E}\) is necessary: requires swapping integral and derivative order
Requires the \(\log\mathbb{P}\): uses normalization of the probability distribution

How to calculate Fisher Information

\mathbb{E}_{\lambda_m}\left[\left(\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right)^2\right]

Start with the probability model: \(\lambda_m = \hat{\lambda} + \epsilon\)
1. Not Necessarily Gaussian
For a given \(r\), simulate forward to find \(\hat{\lambda}\)
Sample possible forward values for \(\lambda_m\)
Autodiff calculate \(\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\) for each sample
Take the empirical mean of the outer products

<What is the right probability model? Can also simulate with process noise>

<Currently I have a bug in my implementation, so I don't have complete results.

I feed in \(\lambda_m\) as post-optimization impulses. Instead I need to re-optimize in calculation of \(\mathcal{L}\)>

Example: Complementarity

\mathbb{E}_{\lambda_m}\left[\left(\nabla_\Theta\phi\lambda_m + ...\right)^2\right] = \mathbb{E}_{\lambda_m}\left[\lambda_m\left(\nabla_\Theta\phi\right)^2\lambda_m\right]

\hat{\lambda}\left(\nabla_\Theta\phi\right)^2\hat{\lambda} + tr(\left(\nabla_\Theta\phi\right)^2\Sigma) + ... \text{(cross terms)}

\lambda_m = \hat{\lambda} + \epsilon

\epsilon \sim \mathcal{N}(0, \Sigma)

\nabla\mathcal{L}(\hat{\lambda}) = 0

(\nabla\phi)^2 = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}

(\nabla\phi)^2 = \begin{bmatrix} 0 & 0 \\ 0 & 1 \end{bmatrix}

(\nabla\phi)^2 = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

\phi = (l, w)

Example with action library

\(tr(\mathcal{I})\) = [1388150.4359, 2878543.4818, 2905122.0841]

For Actions: [2-finger X Pinch, 2-finger Z Pinch w/ Ground, 1-finger Cube Corner Hit]

Expected Info Gain: avoid redundancy

\(\mathcal{I}\) is independent of past data \(\mathcal{D}\)
Same action will be taken every time!
Solution: de-prioritize info we've already seen.

Note \(\sum_\mathcal{D}\left(\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right) = \nabla_\Theta\left(\sum_\mathcal{D}\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right) = 0\), since \(\hat{\Theta}\) is the MLE

\mathcal{I}_\mathcal{O} = \sum_\mathcal{D}\left(\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right)^2 \approx \sum_\mathcal{D}\left(\nabla_\Theta\otimes\nabla_\Theta\mathcal{L}\Bigr\rvert_{\hat{\Theta}}\right)

EIG(r) = \mathcal{I}(r)\mathcal{I}_\mathcal{O}^{-1}

Final MAximization Problem

Choosing a scalarization is a whole field of study. Common choices include:

A (average): \(tr(EIG)\) -> average EIG across parameters
E (eigenvalue): \(\min(eigenvalue(EIG))\) -> prioritize parameter we know the least about.
D (determinant): \(det(EIG)\) -> maximize area of the "uncertainty ellipse" around the score.

r = \arg\max_r scalarization(EIG(r))

Thank You!