Ethan AI2 Job Talk

Empowering Physically Assistive Robots

Ethan K. Gordon

Postdoc, University of Pennsylvania

PhD 2023, University of Washington

with Contact-Rich Active Learning

Physically Assistive Robots (PARs)

“If I can have a robot do it, I can learn to adapt to it, but it would be me feeding me, and that would be huge”

Tyler Schrenk

1985-2023

The Promise of PARs:

Empowerment
Independence

What is needed for PARs?

Contact-Rich Manipulation

Sliding to clean the spoon and bowl
Shaking to smoothen
In-Mouth Hand-Off
(vision-denied)

Online Adaptation

Bite Size Adjustment

What is needed for PARs?

Online Adaptation

Totally Different Food
Multi-bite: different shapes for each bite

There is no time for

re-training!

Tractable Adaptability

How can robots adapt at deployment-time

efficiently, safely, and portably?

Policy Space Reduction

Model-Based Methods

Leveraging Haptics

Support

Inform

Multimodal Active Learning

Physically Assistive Robots

Summary

The Promise of Physically Assistive Robotics
Robot-Assisted Feeding: User-Defined Metrics
Food Bite Acquisition as a Contextual Bandit
- Policy space reduction a priori
- Haptics as post hoc bandit context
Active Learning with Dynamic Contact
Community-Based Participatory Design
Research @ Ai2: Safety and Interpretability

Multimodal Active Learning

Physically Assistive Robots

Do we need autonomy? What kind?

Community-Based Participatory Research

It is important to ask users, observational and qualitative research before experimentation.

Time Per Bite:

Caretaker: ~20s
Preferred: <2min
Teleoperated Robot: 5-40min

Why Single-Utensil Feeding?

It's intuitive and familiar.

The Assistive Dexterous Arm (ADA)

User Studies Capture Diversity

T. Bhattacharjee, E.K. Gordon et al, “Is more autonomy always better?...", HRI 2020

Acceptance is User-Dependent, But High

Users with more limited mobility had a preference for greater autonomy, even if it experienced errors.

T. Bhattacharjee, E.K. Gordon et al, “Is more autonomy always better?...", HRI 2020

Autonomy Preference Given Errors

User Rating

User Studies Capture Metrics

Trade-off between autonomy (with chance of error) and high-effort manual control.

What errors are tolerable? Minimum Food Acquisition Success Rate: 80%

T. Bhattacharjee, E.K. Gordon et al, “Is more autonomy always better?...", HRI 2020

Summary

The Promise of Physically Assistive Robotics
Robot-Assisted Feeding: User-Defined Metrics
Food Bite Acquisition as a Contextual Bandit
- Policy space reduction a priori
- Haptics as post hoc bandit context
Active Learning with Dynamic Contact
Community-Based Participatory Design
Research @ Ai2: Safety and Interpretability

Multimodal Active Learning

Physically Assistive Robots

Data Driven Bite Acquisition

R. Feng, Y. Kim, G. Lee, E. K. Gordon, et al, "...Generalizing skewering strategies...", ISRR 2019

Food simulation is hard*, but we can collect real data. What if we just RL-train a policy?

Example: 10 trajectories x 16 food types

85 person-hours

Is it portable? Yes
Is it safe? Yes, but limited
Is it adaptable? Not really...

Data Driven Bite Acquisition

 E. Heiden et al, “DiSECt", RSS 2021

(only planar cutting)

Online Learning with Policy Space Reduction

Hierarchy and Bandits

Leveraging Expert Data

 T. Bhattacharjee et al, “Towards Robotic Feeding...", R-AL 2019

Qualitative Taxonomy

Insights:

Discrete classes of strategies

Lots of variations within those classes

Imitation Learning for Policy Space Reduction

 E. K. Gordon, A. Nanavati et al, “Towards General Single-Utensil...", CoRL 2023

Splines and force/torque thresholds.

Comparable with Euclidean Metric.

Emergent Behavior

Wiggling

Tilting

High Pressure

Scooping

 E. K. Gordon, A. Nanavati et al, “Towards General Single-Utensil...", CoRL 2023

Is this expressive enough?

Yes!

(Note the 80% acceptance threshold)

 E. K. Gordon, A. Nanavati et al, “Towards General Single-Utensil...", CoRL 2023

Online learning with a discrete action space maps cleanly on the contextual bandit setting.

The Multi-Arm Bandit (MAB)

\(r \sim\)

\( \mathcal{N}(0.5, 1)\)

Reward

\( \mathcal{N}(0.8, 1)\)

\( \mathcal{N}(0.1, 1)\)

Interaction Protocol:

Select \(a = \pi(a_{0\ldots t}, r_{0\ldots t})\)
Observe \(r\)
Update \(\pi\)

\(a =\)

\(1\)

\(2\)

\(3\)

Metric: Regret: \(\mathbb{E}[r(a^*) - r(a_t)]\)

Test time metric, balances exploration vs. exploitation,

often theoretically bounded

The (Stochastic) Contextual Bandit

\(r \sim\)

\( \mathcal{N}(\mu_1(c), \sigma)\)

Reward

Interaction Protocol:

Observe \(c_t\)
Select \(a_t = \pi(c_t)\)
Observe \(r(a_t, c_t)\)
Update \(\pi\)

\(a =\)

\(1\)

\(2\)

\(3\)

\( \mathcal{N}(\mu_2(c), \sigma)\)

\( \mathcal{N}(\mu_3(c), \sigma)\)

\(c_t\)

Bite Acquisition as a Contextual Bandit

Discrete Actions \(a\)

Visual Context: \(c_t\)

\(r\): Success/Failure \(\{1, 0\}\)

 E. K. Gordon et al, “Adaptive robot-assisted feeding...", IROS 2020

Eye-in-Hand RGBD

SegmentAnything

ResNet Features

Post Hoc Haptics for Bite Acquisition

Haptic data is really good for food classification.

55ms of force data:

T. Bhattacharjee et al, R-AL 2019 ; E. Gordon et al, "Leveraging post hoc context...", ICRA 2021

Supervised vs. Bandit Learning

\(l_a(c) = [0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\)

Supervised Learning sees \(l_a(c) \forall a\)

Full Feedback

\(c\)

Bandit Algorithm sees \(l_{a_t}(c)\)

Bandit Feedback (Harder)

No counterfactual.

E. Gordon et al, "Leveraging post hoc context...", ICRA 2021

Regret: \(O(\sqrt{T})\)

Regret: \(O(\dim a \dim c * \sqrt{T})\)

Post Hoc Haptics for Bite Acquisition

E. Gordon et al, "Leveraging post hoc context...", ICRA 2021

Consider a joint loss model:

\(r_a + \epsilon = f_{\theta^*}(c) = g_{\phi^*}(p)\)

\(\epsilon \sim \mathcal{N}\)

visual context

haptic context

\(f_\theta(c, a) = [0.9, 0.1, 0.5, 0.8, \ldots]\)

Once either model is learned, the other model reduces from bandit feedback to full feedback

Example:

After only 1 action, robot determines that kiwi \(\approx\) banana, and can impute the counterfactual.

Observe \(a=1 \rightarrow [0, ?, ?, ?, \ldots]\)

\(g_\phi(p, a) = [0.8, 0.2, 0.4, 0.7, \ldots]\)

Can provide the counterfactual

Regret: \(O(\dim c) \rightarrow O(\min(\dim c,\dim p))\)

Are the 11 actions tractable for learning?

E. Gordon et al, "Leveraging post hoc context...", ICRA 2021

Yes

New foods take ~7-8 actions to learn to user satisfaction.

Summary

The Promise of Physically Assistive Robotics
Robot-Assisted Feeding: User-Defined Metrics
Food Bite Acquisition as a Contextual Bandit
- Policy space reduction a priori
- Haptics as post hoc bandit context
Active Learning with Dynamic Contact
Community-Based Participatory Design
Research @ Ai2: Safety and Interpretability

Multimodal Active Learning

Physically Assistive Robots

Active Tactile Exploration Through Contact

[1] Kapusta et al, Autonomous Robots 2019; [2] Hello Robot; [3] Kuka

Dressing

Grooming

Rehabilitation

How can we perform active learning with:

Dynamic Objects?

Visual Occlusions?

Robust Simulation Tools?

Can a robot do this?

Dynamic Object System Identification

Choose:

Robot Trajectory \(r_t\)

Measure:

Find:

Object Geometry \(\theta^*\)
Object Pose \(x^*_T\)

Previous Work in Tactile SysID

Static Objects: "assume a sensor that can detect contact before causing movement" [2]

Utilizes 2D OR discrete object priors.

Spatially Sparse Data -> Active Learning

 [1] Hu et al, Biomimetic Intelligence and Robotics 2024 ; [2] Xu et al. "TANDEM3D...", ICRA 2023

Likelihood Through Contact Dynamics

 E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026

\(m_0\)

\(\ldots\)

\(x_5\)?

We measure contact at \(t=0\).

(Learning) Where is the object at \(t=5\)?

(Information) How certain are we?

Likelihood: \(\mathbb{P}(m_0 | x_5) = \mathbb{P}(m_0 | x_0 = f^{-5}(x_5))\)

That's a backwards simulation.

Measurement Model: \(\mathbb{P}(m_t | x_t)\)

Difficulty #1: It is impossible to simulate Coulomb friction backwards.

(much less differentiate it backwards)

Dynamics: \(x_{t+1} = f(x_t)\)

Online Learning as Trajectory Optimization

 E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026

Avoid Difficulty #1 by optimizing the entire trajectory:

\(\mathbb{P}(m_0 | x_5) \rightarrow \mathbb{P}(m_0 | \tau=[x_0,\ldots,x_5])\mathbb{P}(\tau)\)

Loss \(\mathcal{L} := -\sum_t\log\mathbb{P}(m_t | x_t) + ||x_t - f(x_{t-1})||^2\)

Difficulty #2: contact dynamics \(f\) often have near-0 or near-\(\infty\) gradients.

(Approximately) Minimizing Graph Distance

 B. Bianchini et al, "Generalization Bounded...", L4DC 2022; E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026

Analogy: \(y = H(x-\theta)\)

\(\theta\)

\(x\)

\(y\)

\(\mathcal{D}\)

Mean Square Error

\(\mathcal{L}_{MSE} = \sum_\mathcal{D}||y_{\mathcal{D}} - H(x_\mathcal{D}-\theta)||^2\)

MSE

Alternative: Graph Distance

\(\mathcal{L}_{GD} = \sum_\mathcal{D}\min_x||(x_{\mathcal{D}}, y_{\mathcal{D}}) - (x, H(x-\theta))||^2\)

Problem: look at \(\nabla_\theta \mathcal{L}_{MSE}\).

It is 0 almost everywhere!

Trade-Off:

Pro: Loss gradient is finite (or bounded) everywhere.
Con: Potentially expensive inner optimization loop.

Learning with a Violation-Implicit Loss

Inner Opt (QP) over contact forces.

\(\mathbb{P}(m_t | x_t)\)

Quantifying Information Without a Prior

\(\Theta\)

\(\mathcal{L}\)

\(\tilde{\Theta}\)

\(\Theta\)

\(\mathcal{L}\)

\(\tilde{\Theta}\)

Maximum Likelihood Estimate: \(\tilde{\Theta} = \arg\min_\Theta\mathcal{L}(\Theta)\)

\(\rightarrow \frac{d\mathcal{L}}{d\Theta}(\tilde{\Theta}) = 0\)

Information \(\rightarrow\) How certain am I? Ideally: answer without a strong prior.

How certain is this?

Noise Floor

Low Info

High Info

Past (Observed) Information:

\(\mathcal{I} := \sum_{m_t}\nabla_{\Theta}\mathcal{L}\left(\nabla_{\Theta}\mathcal{L}\right)^T\)

Future (Fisher) Information:

\(\mathcal{F} := Var_{m_t}\left[\nabla_{\Theta}\mathcal{L}\right]\)

\(= \mathbb{E}_{m_t}\left[\nabla_{\Theta}\mathcal{L}\left(\nabla_{\Theta}\mathcal{L}\right)^T\right] \)

EIG \(:= \log\det(\mathcal{F}\mathcal{I}^{-1} + \mathbf{I})\)

 E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026

Expected Information Gain (EIG)

Learn; Compute

Observed Info \(\mathcal{I}\)

Sample Actions + Simulate

Expected Fisher Info \(\mathcal{F}\)

\(\max EIG := \log\det\left(\mathcal{F}\mathcal{I}^{-1} + \mathbf{I}\right)\)

Choose actions where simulated, expected Fisher info is distinct from Observed info.

Computing Fisher Information

No avoiding Difficulty #1, we need \(\nabla_{x_5}\mathbb{P}(m_0 | x_5)\) to compute \(\mathbb{E}_{m_0}[(\nabla_{x_5}\mathbb{P}(m_0 | x_5))^2]\)

 E.K. Gordon et al, "Active Tactile Exploration...", ICRA 2026

"Quasi-Static" Solution: pretend \(\nabla_{x_5}x_0 = \mathbf{I}\)

\(\mathcal{F} = \sum_t \mathbb{E}_{m_t}[(\nabla_{x_t}\mathbb{P}(m_t|x_t))^2]\)

Pro: Easy to compute

For Gaussian measurement model, no sampling required for \(\mathbb{E}\)

Information Maximization In Action

 E. Gordon et al, "Active Tactile Exploration...", ICRA 2026

Information Through Marginalization

Ongoing Work

"Quasi-Static" Solution won't work for more dynamic systems.

Can we do better? Yes, through marginalization.

\(\approx softmax_{\tau\sim MCMC}\left(\log \mathbb{P}(m_0 | x_0)\right) \cdot \left(\nabla_{x_T}||f(x_{T-1}) - x_T||_2^2\right)\)

\(=\nabla_{x_T} \log\int_{\tau}\mathbb{P}(m_0 | \tau)\mathbb{P}(\tau| x_T)\)

We want \(\nabla_{x_T} \log\mathbb{P}(m_0 | x_T)\)

Con: We introduce sampling. However...

Pros:

No gradient of \(f\) required
For sampling: we already have a good guess of \(\tau\) from the learning algorithm

Journal Extension in the works...

Information Through Marginalization

We want \(\nabla_{x_T} \log\mathbb{P}(m_t | x_T)\)

Ongoing Work

\(=\nabla_{x_T} \log\int_{x_t}\mathbb{P}(m_t | x_t)\mathbb{P}(x_t | x_T)\)

Assume:

\(\mathbb{P}(x_t | x_T) \propto \exp(f(x_t, x_T))\)

\(\approx\nabla_{x_T} \log\sum_{x_t\sim U}\mathbb{P}(m_t | x_t)\mathbb{P}(x_t | x_T)\)

\(=softmax_{x_t\sim U}\left(\log \mathbb{P}(m_t | x_t)+ f(x_t, x_T)\right) \cdot \left(\nabla_{x_T}f(x_t, x_T)\right)\)

\(=softmax_{x_t\sim MCMC}\left(\log \mathbb{P}(m_t | x_t)\right) \cdot \left(\nabla_{x_T}f(x_t, x_T)\right)\)

No gradient through sampling

At the cost of sampling trajectories \(x_t\) (via MCMC), we bypass the inverse Jacobian.

(If we have a good guess \(\tilde{x}_t\), MCMC should be quick)

Summary

The Promise of Physically Assistive Robotics
Robot-Assisted Feeding: User-Defined Metrics
Food Bite Acquisition as a Contextual Bandit
- Policy space reduction a priori
- Haptics as post hoc bandit context
Active Learning with Dynamic Contact
Community-Based Participatory Design
Research @ Ai2: Safety and Interpretability

Multimodal Active Learning

Physically Assistive Robots

Community-Based Participatory Design

E. Gordon et al, "An adaptable, safe, and portable robot-assisted feeding system.", HRI Companion 2024

Community-Based Participatory Design

A. Nanavati, E. Gordon et al, "Lessons learned from designing...", HRI 2025

Summary

The Promise of Physically Assistive Robotics
Robot-Assisted Feeding: User-Defined Metrics
Food Bite Acquisition as a Contextual Bandit
- Policy space reduction a priori
- Haptics as post hoc bandit context
Active Learning with Dynamic Contact
Community-Based Participatory Design
Research @ Ai2: Safety and Interpretability

Bringing Foundation Models into the Home

How can we make it safer?
How can we make it more interpretable?
How can it adapt from deployment data?

Learned Uncertainty

Images

\(\mathcal{L}\)

Used privileged simulator information to compute

observed info \(\approx\) uncertainty.

Have the VLA try to learn that uncertainty explicitly.

States

Observed Information:

\(\mathcal{I} := \sum_{m_t}\nabla_{\Theta}\mathcal{L}\left(\nabla_{\Theta}\mathcal{L}\right)^T\)

Model-Based Distillation

Dataset \(\mathcal{D}\)

Perception

World Model + Policy

\(\mathcal{L}\)

Run Classical Techniques on World-Model:

Information Quantification (Observed and \(\mathbb{E}_\pi\))
Generate / efficiently sample fine-tuning sim data.
Offline MPC when compute or connectivity are limited (portability).

Safety and Active Exploration

As et al, "ActSafe...", ICLR 2025

Beneficial to play optimistically w.r.t. loss

Safer to play pessimistically w.r.t. model parameters

Identify loss components that are safety-critical (pessimistic) vs. performance-critical (optimistic).

Residual Active Learning at Test Time

Residual Policy

\(a\)

Augment the baseline, frozen foundation model with a smaller residual policy.
Update the residual policy at deployment time
The residual policy can use the previous techniques and use past data to minimize regret.