Learning From Out-of-Distribution Data In Robotics

April 10, 2026

Adam Wei

Agenda

  1. Reviewing the Ambient Algorithm
  2. Properties of Robot Data
  3. Motion planning experiments
  4. Bin Sorting Experiments
  5. Real-World Experiments
  6. Concluding Thoughts

Part 1

Reviewing the Ambient Algorithm

Open-X

Learning Algorithm

In-Distribution Data

Policy

\(\pi(a | o, l)\)

What are principled algorithms for learning from out-of-distribution data sources?

simulation

Colab w/ Giannis Daras

Our approach: Ambient Diffusion Policy

\(p\)

\(q\)

Out-of-Distribution Data

Distribution Shifts in Robot Data

  • Sim2real gaps
  • Noisy/low-quality teleop
  • Task-level mismatch 
  • Changes in low-level controller

Open-X

In-Distribution Data

simulation

Out-of-Distribution Data

  • Embodiment gap
  • Camera models, poses, etc
  • Different environment, objects, etc

Data Filtering: Pro

CC12M: 12M+ image + text captions

"Corrupt" Data:

Low quality images

"Clean" Data:

High quality images

Pro: Protects the final sampling distribution

Data Filtering: Con

There is still value and utility in OOD data!

... we just aren't using it correctlty

Goal: to develop principled algorithms that change the way we use low-quality or OOD data

Open-X

In-Distribution Data

simulation

Out-of-Distribution Data

Diffusion Training

\(\sigma=0\)

\(\sigma=1\)

"Clean" Data

For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)

Co-training

\(\sigma=0\)

\(\sigma=1\)

"Corrupt" Data

"Clean" Data

For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)

Co-training

\(\sigma=0\)

\(\sigma=1\)

"Clean" Data

\(p^{train} = \alpha\) \(p\)\(+(1-\alpha)\) \(q\)

For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)

\(\alpha\)

\(1-\alpha\)

\(p^{train}\) contains \(q\) \(\implies\) This is the wrong objective

"Corrupt" Data

Ambient

\(\sigma=0\)

\(\sigma=1\)

For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)

\(\sigma > \sigma_{min}\)

"Clean" Data

\(\sigma_{min}\)

\(p_\sigma\) \(\not\approx\) \(q_\sigma\)

\(p_\sigma\) \(\approx\) \(q_\sigma\)

Gaussian Noise As Contraction

\(p_0\)

\(q_0\)

\(p_\sigma\)

\(q_\sigma\)

D(p_0, q_0) \geq D(p_\sigma, q_\sigma)

\(D(p_\sigma, q_\sigma) \to 0\) as \(\sigma\to \infty\) 

\(\implies \exists \sigma_{min} \ \mathrm{s.t.}\ D(p_\sigma, q_\sigma) < \epsilon\ \forall \sigma > \sigma_{min}\)

Noisy Channel

\(Y = X + \sigma Z\)

\(D(p_0, q_0)\)

\(D(p_\sigma, q_\sigma)\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

How to choose \(\sigma_{min}\)?

\(\sigma_{min}= \inf\{\sigma\in[0,1]: c_\theta (x_\sigma, \sigma) > 0.5-\epsilon\}\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

How to choose \(\sigma_{min}\)?

* assuming \(c_\theta\) the best possible classifier

Original claim*: this bounds \(TV(p_{\sigma_{min}}, q_{\sigma_{min}})\)

\(\sigma_{min}\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

How to choose \(\sigma_{min}\)?

* assuming \(c_\theta\) the best possible classifier

Original claim*: this bounds \(TV(p_{\sigma_{min}}, q_{\sigma_{min}})\)

Corrected claim* [Kerem]: this bounds \(\Delta (p_{\sigma_{min}}, q_{\sigma_{min}})\)

\(\geq TV (p_{\sigma_{min}}, q_{\sigma_{min}})^2\)

\(\sigma_{min}\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

How to choose \(\sigma_{min}\)?

poor classifier performance at \(\sigma_{min}\)\(\implies p_{\sigma} \approx q_{\sigma} \quad \forall \sigma > \sigma_{min}\)

Key Takeaway:

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

How to choose \(\sigma_{min}\)?

Increasing granularity

Assign \(\sigma_{min}\) per datapoint

Assign \(\sigma_{min}\) per dataset

Run the classifier per dataset

Run the classifier per datapoint

Can do in-between... ex. last long talk

Theory [from original paper]

\(p_{\sigma_{min}} \approx q_{\sigma_{min}}\)

For \(\sigma > \sigma_{min}\):

but \(p_{\sigma_{min}} \neq q_{\sigma_{min}}\)

\(\mathrm{MSE}(h_\theta) = \mathrm{bias}(h_\theta) + \mathrm{var}(h_\theta)\)

... so training on \(q_\sigma\) still introduces bias!

[Informal] Theorem: For all \(\mathcal{D}_p\) and \(\mathcal{D}_q\), there exists \(\sigma_{min}\) sufficiently high s.t. training on \(\mathcal{D}_q\) for \(\sigma > \sigma_{min}\) improves distribution learning for \(p\)

* happens in co-training as well...

*

Ambient

\(\sigma=0\)

\(\sigma=1\)

For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)

"Corrupt" Data

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

\(p_\sigma\) \(\approx\) \(q_\sigma\)

Ambient + Locality

\(\sigma=0\)

\(\sigma=1\)

For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)

"Corrupt" Data

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

\(\sigma \leq \sigma_{max}\)

"Locality"

More on this later...

\(p_\sigma\) \(\approx\) \(q_\sigma\)

\(\sigma_{max}\)

Loss Function (for \(x_0\sim q_0\))

\(\mathbb E[\lVert h_\theta(x_t, t) + \frac{\sigma_{min}^2\sqrt{1-\sigma_{t}^2}}{\sigma_t^2-\sigma_{min}^2}x_{t} - \frac{\sigma_{t}^2\sqrt{1-\sigma_{min}^2}}{\sigma_t^2-\sigma_{min}^2} x_{t_{min}} \rVert_2^2]\)

Ambient Loss

Denoising Loss

\(x_0\)-prediction

\(\epsilon\)-prediction

(assumes access to \(x_0\))

(assumes access to \(x_{\sigma_{min}}\))

\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)

\(\mathbb E[\lVert h_\theta(x_t, t) - \epsilon \rVert_2^2]\)

\(\mathbb E[\lVert h_\theta(x_t, t) - \frac{\sigma_t^2 (1-\sigma_{min}^2)}{(\sigma_t^2 - \sigma_{min}^2)\sqrt{1-\sigma_t^2}}x_t + \frac{\sigma_t \sqrt{1-\sigma_t^2}\sqrt{1-\sigma_{min}^2}}{\sigma_t^2 - \sigma_{min}^2}x_{t_{min}}\rVert_2^2]\)

Part 2

Why Ambient and Robotics

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

When Does Ambient Work?

1. Data is scarce

+  5 years

=

~100,000 solved protein structures

200M+ protein structures

Criteria

Robotics?

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

When Does Ambient Work?

1. Data is scarce

Criteria

Robotics?

(short-term)

2. Data quality and sources are heterogeneous

Open-X

simulation

Out-of-Distribution Data

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

When Does Ambient Work?

1. Data is scarce

Criteria

Robotics?

(short-term)

2. Data quality and sources are heterogeneous

3. Data exhibits certain structure

  • Spectral Decay
  • Locality

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Images: Power Spectral Density (PSD)

By Sander Dieleman

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Images: Power Spectral Density (PSD)

"Radially Averaged" PSD of the green channel

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Images: Noisy Image PSD

"Radially Averaged" PSD of the green channel

High SNR

Low SNR

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Diffusion Process PSD

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Image Diffusion

\(\sigma=0\)

\(\sigma=1\)

at high noise, generate low frequency features

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Image Diffusion

\(\sigma=0\)

\(\sigma=1\)

at high noise, generate low frequency features

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Image Diffusion

\(\sigma=0\)

\(\sigma=1\)

at high noise, generate low frequency features

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Image Diffusion

\(\sigma=0\)

\(\sigma=1\)

at low noise, generate high frequency features

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Image Diffusion

\(\sigma=0\)

\(\sigma=1\)

at low noise, generate high frequency features

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Ambient

Ambient uses noise as a high-freq. mask

\(\implies\) ambient is best for high-freq corruptions

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Robot Data: PSD

Punch line: Robot data exhibits spectral decay,

and many robot corruptions are "motion-level" (high freq)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Robot Data: PSD

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Robot Data: PSD

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Robotics

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Robotics

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Robotics

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Ambient in Robotics

If action corruption is motion-level, then Ambient will work well

Low Freq: high-level (task) planning and decision making

High Freq: low-level motion primitives (ex. how to grasp, smoothness)

ex. cross-embodiment, sim2real, noisy-teleop, etc

\(\sigma=0\)

\(\sigma=1\)

"Corrupt" Data

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Ambient in Robotics

Concern: task and motion planning means something very specific...

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Ambient in Robotics

Concern: task and motion planning means something very specific...

Possible experiment:

Show that a classifier can predict the task variables at high noise

\(a_{\sigma(t_2)}\) sufficient to predict task variables?

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Task-level Corruptions

What if the action corruption is task-level?

Red \(\rightarrow\) Left

Blue \(\rightarrow\) Right

Red \(\rightarrow\) Right

Blue \(\rightarrow\) Left

(out-dated video...)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Locality

Sensitivity of \(\hat a_0^{(8)}\) to \(a_\sigma^{i}\) at different noise levels

\(\lVert \frac{\partial h_\theta^{(8)}(a_\sigma)}{\partial a_\sigma^{(i)}}\rVert\) vs action index \(i\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Locality

Sensitivity of \(\hat a_0^{(8)}\) to \(a_\sigma^{i}\) at different noise levels

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Locality

At low-noise the denoiser does not attend to distant actions (i.e. ignores low-frequencies)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Locality

At low-noise the denoiser does not attend to distant actions (i.e. ignores low-frequencies)

\(\implies\) can use data with task-level distribution shift at low-noise

\(\sigma=0\)

\(\sigma=1\)

"Corrupt" Data

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

\(\sigma \leq \sigma_{max}\)

"Locality"

\(p_\sigma\) \(\approx\) \(q_\sigma\)

\(\sigma_{max}\)

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

When Does Ambient Work?

1. Data is scarce

Criteria

Robotics?

(short-term)

2. Data quality and sources are heterogeneous

3. Data exhibits certain structure

  • Spectral Decay
  • Locality

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Misconceptions

"Coarse" to "Fine" is a property of diffusion

"Coarse" to "Fine" is a property of the data

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Example: Audio

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Example: Audio

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Misconceptions

"Coarse" to "Fine" is a property of diffusion

"Coarse" to "Fine" is a property of the data

Spectral Decay is required for diffusion

Diffusion (and Ambient) work regardless of PSD

  • ex. Diffusion still works for audio
  • Spectral decay provides nice interpretation

Part 3

Motion Planning Experiments

Question break!

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Motion Planning Experiments

Distribution shift: Low-quality, noisy trajectories

High Quality: 

100 GCS trajectories

Low Quality: 

5000 RRT trajectories

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Task vs Motion Level

Distribution shift: Low-quality, noisy trajectories

\(\sigma=0\)

5000 RRT Trajectories

\(\sigma_{min}\)

\(\sigma=1\)

100 GCS Trajectories

Task level:

learn the maze structure

Motion level:

learn smooth motions

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Implications for Robotics

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Results

GCS

Success Rate

Avg. Acc^2

(Motion-level)

RRT

GCS+RRT

(Co-train)

GCS+RRT

(Ambient)

57.5%

Swept for best \(\sigma_{min}\) per dataset

Policies evaluated over 1000 trials each

99.0%

141.65

74.8

99.4%

62.2

99.5%

30.9

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Qualitative Results

Co-trained

Ambient

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Pareto Front

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

7-DoF Motion Planning

Clean data:

  • 100k trajopt trajectories

Corrupt data:

  • 1M RRT trajectories

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Results

Trajopt

Success Rate

Avg. Acc^2

(Motion-level)

RRT

Trajopt+RRT

(Co-train)

Trajopt+RRT

(Ambient)

46.0%

Swept for best \(\sigma_{min}\) per dataset

Policies evaluated over 1000 trials each

52.0%

3.9

54.9

59.9%

42.7

65.9%

31.4

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Pareto Front

Loss Function

Loss Function (for \(x_0\sim q_0\))

Denoising Loss vs Ambient Loss

Choosing \(\sigma_{min}\)

Pareto Front

Part 3.5

Sim-and-Real Cotraining

It works!

Best co-trained policy (IROS): 84.5%

Best Ambient policy (CoRL?): 93.5%

Sim & Real Cotraining

Distribution shift: sim2real gap

In-Distribution: 

50 demos in "target" environment

Out-of-Distribution: 

2000 demos in sim environment

Results

Part 4

Bin Sorting (Locality)

Algorithm Reminder

\(\sigma=0\)

\(\sigma=1\)

"Corrupt" Data

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

\(\sigma \leq \sigma_{max}\)

"Locality"

\(p_\sigma\) \(\approx\) \(q_\sigma\)

\(\sigma_{max}\)

Locality Only

\(\sigma=0\)

\(\sigma=1\)

"Clean" Data

\(\sigma \leq \sigma_{max}\)

"Locality"

\(\sigma_{max}\)

Goal: isolate the effect of locality in robotics

Example: Bin Sorting

Distribution shift: task level mismatch, motion level correctness

In-Distribution: 

50 demos with correct sorting logic

Out-of-Distribution: 

200 demos with incorrect sorting

2x

2x

Metics

Robot needs to learn two things:

1. Motion Planning

2. Logic

\(\frac{\#\ blocks \ in \ any \ bin}{total \ blocks}\)

\(\frac{\# \ blocks \ in \ correct \ bin}{\# \ blocks \ \ in \ any  bin}\)

Goal: learn motion planning from the bad data, but not the task planning

Success rate:

\(\frac{\# \ blocks \ in \ correct \ bin}{\# \ total \ blocks}\) = (motion planning) x (logic)

Results

Diffusion

Success Rate

Logic Metric

Cotrain

Motion Metric

Locality

61.0%

61.9%

98.6%

22.7%

87.2%

26.0%

93.3%

95.0%

98.2%

Task Planning

Motion Planning

Pareto Front

Task Conditioning

Success Rate

Logic Metric

Cotrain

(with task condition)

Motion Metric

Locality

90.3%

91.5%

98.6%

93.3%

95.0%

98.2%

Locality

(with task condition)

92.8%

94.2%

98.5%

Corrupt-only below \(\sigma_{max}\)

\(\sigma=0\)

\(\sigma=1\)

"Clean" Data

\(\sigma \leq \sigma_{max}\)

"Locality"

\(\sigma_{max}\)

Corrupt-only below \(\sigma_{max}\)

Ablation Results

Success Rate

Logic Metric

Ablation

Motion Metric

Locality

91.9%

93.8%

97.9%

93.3%

95.0%

98.2%

Part 5

Real-World Experiments

Scaling to Real-World Datasets

Goal: move objects from the table into the drawer

\(\mathcal{D}_{clean}\): 50 in-distribution demos

\(\mathcal{D}_{corrupt}\): Open-X Embodiment

Can we learn from unstructure distribution shifts in large real-world datasets?

Open-X

  • cross-embodied
  • diff. teleoperators
  • sim data!
  • mislabeled data
  • diff tasks, environments, camera

Task: Table Cleaning

Open-X

Magic Soup++: 27 Datasets

Custom OXE: 48 Datasets

  • 1.4M episodes
  • 55M "datagrams"

Algorithm

\(\sigma=0\)

\(\sigma=1\)

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

"Ambient"

"Ambient

+ Locality"

"Clean" Data

\(\sigma_{min}\)

\(\sigma > \sigma_{min}\)

\(\sigma \leq \sigma_{max}\)

\(\sigma_{max}\)

Task: Table Cleaning

Task completion =

   0.1 x [opened drawer]

+ 0.8 x [# obj. cleaned / # obj.]

+ 0.1 x [closed drawer]

Results

Results

Ablations: Reweighting

Ablations: Reweighting

Ablations: Reweighting

1. Ambient benefits from reweighting, but does not need it.

Note: Ambient and reweighting are orthogonal (can be applied simultaneously)

2. Co-training must reweight.

Ablations: Finetuning

Ablations: Finetuning

Ablations: Finetuning

1. Finetuning the Ambient base model is always better

2. When good data is limited, ambient outperforms co-train + finetune

Ablations: Noisy Observations

Ablations: CFG

Part 6

Concluding Thoughts

Concluding Thoughts

  • Sim2real gaps
  • Noisy/low-quality teleop
  • Task-level mismatch 
  • Changes in low-level controller
  • Embodiment gap
  • Camera models, poses, etc
  • Different environment, objects, etc

Ambient can be used to learn from any distribution shift in robotics

In-Distribution Data

Open-X

simulation

Out-of-Distribution Data

Next Directions

Q: What is "in-distribution"?

A [in this paper]: expert teleoperator on your robot, your task, your environment

A [more generally]: data quality?

How you define "in-distribution" changes if Ambient is used as pre-training or finetuning recipe

Next Directions

  1. Defining "in-distribution"
  2. Picking \(\sigma_{max}\), \(\sigma_{min}\)
  3. Rejection sampling idea [Kerem]
  4. Soft ambient [Pablo/Asu]
  5. Corruption in observation space
  6. Extending to Flows
  7. Ambient for video/latent actions

Thank you

Long Talk: 04/10/26

By weiadam

Long Talk: 04/10/26

  • 7