Learning From Corrupt Data

Sept 3, 2025

Adam Wei

Agenda

Sources of Robot Data

Open-X

 expert robot teleop

 robot teleop

simulation

Goal: sample "high-quality" trajectories for your task

Train on entire spectrum to learn "high-level" reasoning, semantics, etc

"corrupt"

"clean"

Types of "Corruption" in Robotics

  • Sim2real gaps
  • Task-level corruptions
  • High-quality vs low-quality teleop data
  • Change of low-level controller
  • Embodiment gap
  • Corruption in the conditioning variables
    • ex. camera intrinsics/extrinsics, motion blur, etc

Research Questions

Giannis Daras

  1. How can we learn from both clean and corrupt data?
  2. Can these algorithms be adapted for robotics?

Project Outline

North Star Goal: Train with internet scale data (Open-X, AgiBot, etc)

Stepping Stone Experiments:

  • Motion planning
  • Sim-and-real co-training (planar pushing)
  • Cross-embodied data (Lu's data)
  • Bin picking

Algorithm Overview (w/o \(\sigma_{max}\))

Repeat:

  1. Sample (O, A, \(\sigma_{min}\)) ~ \(\mathcal{D}\)
  2. Choose noise level \(\sigma > \sigma_{min}\)
  3. Optimize denoiser or ambient loss

\(\sigma=1\)

\(\sigma=0\)

\(\sigma=\sigma_{min}\)

Corrupt Data (\(\sigma_{min}\geq0\))

Clean Data (\(\sigma_{min}=0\))

Loss Function (for \(x_0\sim q_0\))

\(\mathbb E[\lVert h_\theta(x_t, t) + \frac{\sigma_{min}^2\sqrt{1-\sigma_{t}^2}}{\sigma_t^2-\sigma_{min}^2}x_{t} - \frac{\sigma_{t}^2\sqrt{1-\sigma_{min}^2}}{\sigma_t^2-\sigma_{min}^2} x_{t_{min}} \rVert_2^2]\)

Ambient Loss

Denoising Loss

\(x_0\)-prediction

\(\epsilon\)-prediction

(assumes access to \(x_0\))

(assumes access to \(x_{t_{min}}\))

\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)

\(\mathbb E[\lVert h_\theta(x_t, t) - \epsilon \rVert_2^2]\)

\(\mathbb E[\lVert h_\theta(x_t, t) - \frac{\sigma_t^2 (1-\sigma_{min}^2)}{(\sigma_t^2 - \sigma_{min}^2)\sqrt{1-\sigma_t^2}}x_t + \frac{\sigma_t \sqrt{1-\sigma_t^2}\sqrt{1-\sigma_{min}^2}}{\sigma_t^2 - \sigma_{min}^2}x_{t_{min}}\rVert_2^2]\)

Experiment: RRT vs GCS

GCS

(clean)

RRT

(clean)

Task: Cotrain on GCS and RRT data

Goal: Sample clean and smooth GCS plans

  • Example of high frequency corruption
  • Common in robotics
    • Low quality teleoperation or data gen

Baselines

Success Rate: 50%

Average Jerk Squared: 7.5k

100 GCS Demos

Success Rate: 99%

Average Jerk Squared: 2.5k

~5000 GCS Demos

Success Rate: 100%

Average Jerk Squared: 17k

~5000 RRT Demos

Cotraining vs Ambient Diffusion

Success Rate: 98%

Average Jerk Squared: 5.5k

Ambient: 100 GCS Demos, 5000 RRT Demos

Success Rate: 91%

Average Jerk Squared: 14.5k

Co-training: 100 GCS Demos, 5000 RRT Demos

Success Rate: 50%

Average Jerk Squared: 48.5k

GCS Only: 100 GCS Demos, 5000 RRT Demos

Cotraining vs Ambient Diffusion

Sim & Real Cotraining

"Clean" Data

"Corrupt" Data

\(|\mathcal{D}_T|=50\)

\(|\mathcal{D}_S|=2000\)

Eval criteria: Success rate for planar pushing across 200 randomized trials

Sim & Real Cotraining

Sweep \(\sigma_{min}\) per dataset

Sim & Real Cotraining

Sweep \(\sigma_{min}\) per dataset

Sim & Real Cotraining

Performance vs Classifier Epoch

Sim & Real Cotraining

Performance vs Classifier Threshold

Sim & Real Cotraining

Performance vs Classifier Threshold

Sim & Real Cotraining

Performance vs Sim Demos: 10 Real Demos

Sim & Real Cotraining

Performance vs Sim Demos: 50 Real Demos

Sim & Real Cotraining

Performance vs Sim Demos: 150 Real Demos

Sim & Real Cotraining

Performance vs Sim Demos: 500 Real Demos

Sim & Real Cotraining

Performance vs Classifier Epoch

Denoising Loss vs Ambient Loss

Ambient Diffusion: Scaling \(|\mathcal{D}_S|\)

Hypothesis: As sim data increases, ambient diffusion approaches "sim-only" training, which harms performance.

\(|\mathcal{D}_S|=500\)

\(|\mathcal{D}_S|=8000\)

Denoising Loss vs Ambient Loss

Ambient Diffusion: Scaling \(|\mathcal{D}_S|\)

Hypothesis 1: Ambient Diffusion has plateau-ed.

  • I think this is unlikely... although Giannis has an experiment to confirm this

Hypothesis 2: Softer version of ambient diffusion

  • Instead of having a hard \(\sigma_{min}\) cutoff, use a softer version (i.e. use datapoints more for high noise levels, and less for lower noise levels)
  • Need to figure out what this soft "mixing" function looks like

Denoising Loss vs Ambient Loss

Ambient Diffusion: \(\sigma_{max}\)

2025/09/03: Costis/Russ

By weiadam

2025/09/03: Costis/Russ

  • 33