Ethan K. Gordon
Postdoc, University of Pennsylvania
PhD 2023, University of Washington
with Contact-Rich Active Learning
“If I can have a robot do it, I can learn to adapt to it, but it would be me feeding me, and that would be huge”
Tyler Schrenk
1985-2023
The Promise of PARs:
Contact-Rich Manipulation
Online Adaptation
Online Adaptation
There is no time for
re-training!
How can robots adapt at deployment-time
efficiently, safely, and portably?
Policy Space Reduction
Model-Based Methods
Leveraging Haptics
Support
Inform
It is important to ask users, observational and qualitative research before experimentation.
Time Per Bite:
It's intuitive and familiar.
T. Bhattacharjee, E.K. Gordon et al, “Is more autonomy always better?...", HRI 2020Users with more limited mobility had a preference for greater autonomy, even if it experienced errors.
T. Bhattacharjee, E.K. Gordon et al, “Is more autonomy always better?...", HRI 2020Autonomy Preference Given Errors
User Rating
Trade-off between autonomy (with chance of error) and high-effort manual control.
What errors are tolerable? Minimum Food Acquisition Success Rate: 80%
T. Bhattacharjee, E.K. Gordon et al, “Is more autonomy always better?...", HRI 2020R. Feng, Y. Kim, G. Lee, E. K. Gordon, et al, "...Generalizing skewering strategies...", ISRR 2019
Food simulation is hard*, but we can collect real data. What if we just RL-train a policy?
Example: 10 trajectories x 16 food types
85 person-hours
E. Heiden et al, “DiSECt", RSS 2021
(only planar cutting)
Hierarchy and Bandits
T. Bhattacharjee et al, “Towards Robotic Feeding...", R-AL 2019
Qualitative Taxonomy
Insights:
Discrete classes of strategies
Lots of variations within those classes
E. K. Gordon, A. Nanavati et al, “Towards General Single-Utensil...", CoRL 2023
Splines and force/torque thresholds.
Comparable with Euclidean Metric.
Wiggling
Tilting
High Pressure
Scooping
E. K. Gordon, A. Nanavati et al, “Towards General Single-Utensil...", CoRL 2023
Yes!
(Note the 80% acceptance threshold)
E. K. Gordon, A. Nanavati et al, “Towards General Single-Utensil...", CoRL 2023
Online learning with a discrete action space maps cleanly on the contextual bandit setting.
\(r \sim\)
\( \mathcal{N}(0.5, 1)\)
Reward
\( \mathcal{N}(0.8, 1)\)
\( \mathcal{N}(0.1, 1)\)
Interaction Protocol:
\(a =\)
\(1\)
\(2\)
\(3\)
Metric: Regret: \(\mathbb{E}[r(a^*) - r(a_t)]\)
Test time metric, balances exploration vs. exploitation,
often theoretically bounded
\(r \sim\)
\( \mathcal{N}(\mu_1(c), \sigma)\)
Reward
Interaction Protocol:
\(a =\)
\(1\)
\(2\)
\(3\)
\( \mathcal{N}(\mu_2(c), \sigma)\)
\( \mathcal{N}(\mu_3(c), \sigma)\)
\(c_t\)
Discrete Actions \(a\)
Visual Context: \(c_t\)
\(r\): Success/Failure \(\{1, 0\}\)
E. K. Gordon et al, “Adaptive robot-assisted feeding...", IROS 2020
Eye-in-Hand RGBD
SegmentAnything
ResNet Features
Haptic data is really good for food classification.
55ms of force data:
T. Bhattacharjee et al, R-AL 2019 ; E. Gordon et al, "Leveraging post hoc context...", ICRA 2021\(l_a(c) = [0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\)
Supervised Learning sees \(l_a(c) \forall a\)
Full Feedback
\(c\)
Bandit Algorithm sees \(l_{a_t}(c)\)
Bandit Feedback (Harder)
No counterfactual.
E. Gordon et al, "Leveraging post hoc context...", ICRA 2021Regret: \(O(\sqrt{T})\)
Regret: \(O(\dim a \dim c * \sqrt{T})\)
E. Gordon et al, "Leveraging post hoc context...", ICRA 2021Consider a joint loss model:
\(r_a + \epsilon = f_{\theta^*}(c) = g_{\phi^*}(p)\)
\(\epsilon \sim \mathcal{N}\)
visual context
haptic context
\(f_\theta(c, a) = [0.9, 0.1, 0.5, 0.8, \ldots]\)
Once either model is learned, the other model reduces from bandit feedback to full feedback
Example:
After only 1 action, robot determines that kiwi \(\approx\) banana, and can impute the counterfactual.
Observe \(a=1 \rightarrow [0, ?, ?, ?, \ldots]\)
\(g_\phi(p, a) = [0.8, 0.2, 0.4, 0.7, \ldots]\)
Can provide the counterfactual
Regret: \(O(\dim c) \rightarrow O(\min(\dim c,\dim p))\)
E. Gordon et al, "Leveraging post hoc context...", ICRA 2021Yes
New foods take ~7-8 actions to learn to user satisfaction.
[1] Kapusta et al, Autonomous Robots 2019; [2] Hello Robot; [3] KukaDressing
Grooming
Rehabilitation
How can we perform active learning with:
Dynamic Objects?
Visual Occlusions?
Robust Simulation Tools?
Can a robot do this?
Robot Trajectory \(r_t\)
Object Geometry \(\theta^*\)
Object Pose \(x^*_T\)
Static Objects: "assume a sensor that can detect contact before causing movement" [2]
Utilizes 2D OR discrete object priors.
Spatially Sparse Data -> Active Learning
[1] Hu et al, Biomimetic Intelligence and Robotics 2024 ; [2] Xu et al. "TANDEM3D...", ICRA 2023
E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026
\(m_0\)
\(\ldots\)
\(x_5\)?
We measure contact at \(t=0\).
(Learning) Where is the object at \(t=5\)?
(Information) How certain are we?
Likelihood: \(\mathbb{P}(m_0 | x_5) = \mathbb{P}(m_0 | x_0 = f^{-5}(x_5))\)
That's a backwards simulation.
Measurement Model: \(\mathbb{P}(m_t | x_t)\)
Difficulty #1: It is impossible to simulate Coulomb friction backwards.
(much less differentiate it backwards)
Dynamics: \(x_{t+1} = f(x_t)\)
E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026
Avoid Difficulty #1 by optimizing the entire trajectory:
\(\mathbb{P}(m_0 | x_5) \rightarrow \mathbb{P}(m_0 | \tau=[x_0,\ldots,x_5])\mathbb{P}(\tau)\)
Loss \(\mathcal{L} := -\sum_t\log\mathbb{P}(m_t | x_t) + ||x_t - f(x_{t-1})||^2\)
Difficulty #2: contact dynamics \(f\) often have near-0 or near-\(\infty\) gradients.
B. Bianchini et al, "Generalization Bounded...", L4DC 2022; E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026
Analogy: \(y = H(x-\theta)\)
\(\theta\)
\(x\)
\(y\)
\(\mathcal{D}\)
Mean Square Error
\(\mathcal{L}_{MSE} = \sum_\mathcal{D}||y_{\mathcal{D}} - H(x_\mathcal{D}-\theta)||^2\)
MSE
GD
Alternative: Graph Distance
\(\mathcal{L}_{GD} = \sum_\mathcal{D}\min_x||(x_{\mathcal{D}}, y_{\mathcal{D}}) - (x, H(x-\theta))||^2\)
Problem: look at \(\nabla_\theta \mathcal{L}_{MSE}\).
It is 0 almost everywhere!
Trade-Off:
Inner Opt (QP) over contact forces.
\(\mathbb{P}(m_t | x_t)\)
\(\Theta\)
\(\mathcal{L}\)
\(\tilde{\Theta}\)
\(\Theta\)
\(\mathcal{L}\)
\(\tilde{\Theta}\)
Maximum Likelihood Estimate: \(\tilde{\Theta} = \arg\min_\Theta\mathcal{L}(\Theta)\)
\(\rightarrow \frac{d\mathcal{L}}{d\Theta}(\tilde{\Theta}) = 0\)
Information \(\rightarrow\) How certain am I? Ideally: answer without a strong prior.
How certain is this?
Noise Floor
Low Info
High Info
Past (Observed) Information:
\(\mathcal{I} := \sum_{m_t}\nabla_{\Theta}\mathcal{L}\left(\nabla_{\Theta}\mathcal{L}\right)^T\)
Future (Fisher) Information:
\(\mathcal{F} := Var_{m_t}\left[\nabla_{\Theta}\mathcal{L}\right]\)
\(= \mathbb{E}_{m_t}\left[\nabla_{\Theta}\mathcal{L}\left(\nabla_{\Theta}\mathcal{L}\right)^T\right] \)
EIG \(:= \log\det(\mathcal{F}\mathcal{I}^{-1} + \mathbf{I})\)
E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026
Learn; Compute
Observed Info \(\mathcal{I}\)
Sample Actions + Simulate
Expected Fisher Info \(\mathcal{F}\)
\(\max EIG := \log\det\left(\mathcal{F}\mathcal{I}^{-1} + \mathbf{I}\right)\)
Choose actions where simulated, expected Fisher info is distinct from Observed info.
No avoiding Difficulty #1, we need \(\nabla_{x_5}\mathbb{P}(m_0 | x_5)\) to compute \(\mathbb{E}_{m_0}[(\nabla_{x_5}\mathbb{P}(m_0 | x_5))^2]\)
E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026
"Quasi-Static" Solution: pretend \(\nabla_{x_5}x_0 = \mathbf{I}\)
\(\mathcal{F} = \sum_t \mathbb{E}_{m_t}[(\nabla_{x_t}\mathbb{P}(m_t|x_t))^2]\)
Pro: Easy to compute
For Gaussian measurement model, no sampling required for \(\mathbb{E}\)
E. Gordon et al, "Active Tactile Exploration...", ICRA 2026
Ongoing Work
"Quasi-Static" Solution won't work for more dynamic systems.
Can we do better? Yes, through marginalization.
\(\approx softmax_{\tau\sim MCMC}\left(\log \mathbb{P}(m_0 | x_0)\right) \cdot \left(\nabla_{x_T}||f(x_{T-1}) - x_T||_2^2\right)\)
\(=\nabla_{x_T} \log\int_{\tau}\mathbb{P}(m_0 | \tau)\mathbb{P}(\tau| x_T)\)
We want \(\nabla_{x_T} \log\mathbb{P}(m_0 | x_T)\)
Con: We introduce sampling. However...
Pros:
Journal Extension in the works...
We want \(\nabla_{x_T} \log\mathbb{P}(m_t | x_T)\)
Ongoing Work
\(=\nabla_{x_T} \log\int_{x_t}\mathbb{P}(m_t | x_t)\mathbb{P}(x_t | x_T)\)
Assume:
\(\mathbb{P}(x_t | x_T) \propto \exp(f(x_t, x_T))\)
\(\approx\nabla_{x_T} \log\sum_{x_t\sim U}\mathbb{P}(m_t | x_t)\mathbb{P}(x_t | x_T)\)
\(=softmax_{x_t\sim U}\left(\log \mathbb{P}(m_t | x_t)+ f(x_t, x_T)\right) \cdot \left(\nabla_{x_T}f(x_t, x_T)\right)\)
\(=softmax_{x_t\sim MCMC}\left(\log \mathbb{P}(m_t | x_t)\right) \cdot \left(\nabla_{x_T}f(x_t, x_T)\right)\)
No gradient through sampling
At the cost of sampling trajectories \(x_t\) (via MCMC), we bypass the inverse Jacobian.
(If we have a good guess \(\tilde{x}_t\), MCMC should be quick)
E. Gordon et al, "An adaptable, safe, and portable robot-assisted feeding system.", HRI Companion 2024A. Nanavati, E. Gordon et al, "Lessons learned from designing...", HRI 2025Images
\(\mathcal{L}\)
Used privileged simulator information to compute
observed info \(\approx\) uncertainty.
Have the VLA try to learn that uncertainty explicitly.
States
Observed Information:
\(\mathcal{I} := \sum_{m_t}\nabla_{\Theta}\mathcal{L}\left(\nabla_{\Theta}\mathcal{L}\right)^T\)
Dataset \(\mathcal{D}\)
Perception
World Model + Policy
\(\mathcal{L}\)
Run Classical Techniques on World-Model:
As et al, "ActSafe...", ICLR 2025Beneficial to play optimistically w.r.t. loss
Safer to play pessimistically w.r.t. model parameters
Identify loss components that are safety-critical (pessimistic) vs. performance-critical (optimistic).
Residual Policy
\(a\)
\(a + \tilde{a}\)
\(\mathcal{L}\)
DAIR Lab
Amal Nanavati
Ethan K. Gordon
Postdoc, University of Pennsylvania
PhD 2023, University of Washington
with Contact-Rich Active Learning