April 10, 2026
Adam Wei
Part 1
Reviewing the Ambient Algorithm
Open-X
Learning Algorithm
In-Distribution Data
Policy
\(\pi(a | o, l)\)
What are principled algorithms for learning from out-of-distribution data sources?
simulation
Colab w/ Giannis Daras
Our approach: Ambient Diffusion Policy
\(p\)
\(q\)
Out-of-Distribution Data
Open-X
In-Distribution Data
simulation
Out-of-Distribution Data
CC12M: 12M+ image + text captions
"Corrupt" Data:
Low quality images
"Clean" Data:
High quality images
Pro: Protects the final sampling distribution
There is still value and utility in OOD data!
... we just aren't using it correctlty
Goal: to develop principled algorithms that change the way we use low-quality or OOD data
Open-X
In-Distribution Data
simulation
Out-of-Distribution Data
\(\sigma=0\)
\(\sigma=1\)
"Clean" Data
For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)
\(\sigma=0\)
\(\sigma=1\)
"Corrupt" Data
"Clean" Data
For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)
\(\sigma=0\)
\(\sigma=1\)
"Clean" Data
\(p^{train} = \alpha\) \(p\)\(+(1-\alpha)\) \(q\)
For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)
\(\alpha\)
\(1-\alpha\)
\(p^{train}\) contains \(q\) \(\implies\) This is the wrong objective
"Corrupt" Data
\(\sigma=0\)
\(\sigma=1\)
For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)
\(\sigma > \sigma_{min}\)
"Clean" Data
\(\sigma_{min}\)
\(p_\sigma\) \(\not\approx\) \(q_\sigma\)
\(p_\sigma\) \(\approx\) \(q_\sigma\)
\(p_0\)
\(q_0\)
\(p_\sigma\)
\(q_\sigma\)
\(D(p_\sigma, q_\sigma) \to 0\) as \(\sigma\to \infty\)
\(\implies \exists \sigma_{min} \ \mathrm{s.t.}\ D(p_\sigma, q_\sigma) < \epsilon\ \forall \sigma > \sigma_{min}\)
Noisy Channel
\(Y = X + \sigma Z\)
\(D(p_0, q_0)\)
\(D(p_\sigma, q_\sigma)\)
\(\sigma_{min}= \inf\{\sigma\in[0,1]: c_\theta (x_\sigma, \sigma) > 0.5-\epsilon\}\)
* assuming \(c_\theta\) the best possible classifier
Original claim*: this bounds \(TV(p_{\sigma_{min}}, q_{\sigma_{min}})\)
\(\sigma_{min}\)
* assuming \(c_\theta\) the best possible classifier
Original claim*: this bounds \(TV(p_{\sigma_{min}}, q_{\sigma_{min}})\)
Corrected claim* [Kerem]: this bounds \(\Delta (p_{\sigma_{min}}, q_{\sigma_{min}})\)
\(\geq TV (p_{\sigma_{min}}, q_{\sigma_{min}})^2\)
\(\sigma_{min}\)
poor classifier performance at \(\sigma_{min}\)\(\implies p_{\sigma} \approx q_{\sigma} \quad \forall \sigma > \sigma_{min}\)
Key Takeaway:
Increasing granularity
Assign \(\sigma_{min}\) per datapoint
Assign \(\sigma_{min}\) per dataset
Run the classifier per dataset
Run the classifier per datapoint
Can do in-between... ex. last long talk
\(p_{\sigma_{min}} \approx q_{\sigma_{min}}\)
For \(\sigma > \sigma_{min}\):
but \(p_{\sigma_{min}} \neq q_{\sigma_{min}}\)
\(\mathrm{MSE}(h_\theta) = \mathrm{bias}(h_\theta) + \mathrm{var}(h_\theta)\)
... so training on \(q_\sigma\) still introduces bias!
[Informal] Theorem: For all \(\mathcal{D}_p\) and \(\mathcal{D}_q\), there exists \(\sigma_{min}\) sufficiently high s.t. training on \(\mathcal{D}_q\) for \(\sigma > \sigma_{min}\) improves distribution learning for \(p\)
* happens in co-training as well...
*
\(\sigma=0\)
\(\sigma=1\)
For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)
"Corrupt" Data
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
\(p_\sigma\) \(\approx\) \(q_\sigma\)
\(\sigma=0\)
\(\sigma=1\)
For all \(\sigma \in [0,1]\): train \(h_\theta(A_\sigma, O, \sigma) \approx \mathbb{E}[A_0 \mid A_\sigma, O]\)
"Corrupt" Data
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
\(\sigma \leq \sigma_{max}\)
"Locality"
More on this later...
\(p_\sigma\) \(\approx\) \(q_\sigma\)
\(\sigma_{max}\)
\(\mathbb E[\lVert h_\theta(x_t, t) + \frac{\sigma_{min}^2\sqrt{1-\sigma_{t}^2}}{\sigma_t^2-\sigma_{min}^2}x_{t} - \frac{\sigma_{t}^2\sqrt{1-\sigma_{min}^2}}{\sigma_t^2-\sigma_{min}^2} x_{t_{min}} \rVert_2^2]\)
Ambient Loss
Denoising Loss
\(x_0\)-prediction
\(\epsilon\)-prediction
(assumes access to \(x_0\))
(assumes access to \(x_{\sigma_{min}}\))
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - \epsilon \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - \frac{\sigma_t^2 (1-\sigma_{min}^2)}{(\sigma_t^2 - \sigma_{min}^2)\sqrt{1-\sigma_t^2}}x_t + \frac{\sigma_t \sqrt{1-\sigma_t^2}\sqrt{1-\sigma_{min}^2}}{\sigma_t^2 - \sigma_{min}^2}x_{t_{min}}\rVert_2^2]\)
Part 2
Why Ambient and Robotics
1. Data is scarce
+ 5 years
=
~100,000 solved protein structures
200M+ protein structures
Criteria
Robotics?
1. Data is scarce
Criteria
Robotics?
✅
(short-term)
2. Data quality and sources are heterogeneous
✅
Open-X
simulation
Out-of-Distribution Data
1. Data is scarce
Criteria
Robotics?
✅
(short-term)
2. Data quality and sources are heterogeneous
✅
3. Data exhibits certain structure
✅
By Sander Dieleman
"Radially Averaged" PSD of the green channel
"Radially Averaged" PSD of the green channel
High SNR
Low SNR
\(\sigma=0\)
\(\sigma=1\)
at high noise, generate low frequency features
\(\sigma=0\)
\(\sigma=1\)
at high noise, generate low frequency features
\(\sigma=0\)
\(\sigma=1\)
at high noise, generate low frequency features
\(\sigma=0\)
\(\sigma=1\)
at low noise, generate high frequency features
\(\sigma=0\)
\(\sigma=1\)
at low noise, generate high frequency features
Ambient uses noise as a high-freq. mask
\(\implies\) ambient is best for high-freq corruptions
Punch line: Robot data exhibits spectral decay,
and many robot corruptions are "motion-level" (high freq)
If action corruption is motion-level, then Ambient will work well
Low Freq: high-level (task) planning and decision making
High Freq: low-level motion primitives (ex. how to grasp, smoothness)
ex. cross-embodiment, sim2real, noisy-teleop, etc
\(\sigma=0\)
\(\sigma=1\)
"Corrupt" Data
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
Concern: task and motion planning means something very specific...
Concern: task and motion planning means something very specific...
Possible experiment:
Show that a classifier can predict the task variables at high noise
\(a_{\sigma(t_2)}\) sufficient to predict task variables?
What if the action corruption is task-level?
Red \(\rightarrow\) Left
Blue \(\rightarrow\) Right
Red \(\rightarrow\) Right
Blue \(\rightarrow\) Left
(out-dated video...)
Sensitivity of \(\hat a_0^{(8)}\) to \(a_\sigma^{i}\) at different noise levels
\(\lVert \frac{\partial h_\theta^{(8)}(a_\sigma)}{\partial a_\sigma^{(i)}}\rVert\) vs action index \(i\)
Sensitivity of \(\hat a_0^{(8)}\) to \(a_\sigma^{i}\) at different noise levels
At low-noise the denoiser does not attend to distant actions (i.e. ignores low-frequencies)
At low-noise the denoiser does not attend to distant actions (i.e. ignores low-frequencies)
\(\implies\) can use data with task-level distribution shift at low-noise
\(\sigma=0\)
\(\sigma=1\)
"Corrupt" Data
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
\(\sigma \leq \sigma_{max}\)
"Locality"
\(p_\sigma\) \(\approx\) \(q_\sigma\)
\(\sigma_{max}\)
1. Data is scarce
Criteria
Robotics?
✅
(short-term)
2. Data quality and sources are heterogeneous
✅
3. Data exhibits certain structure
✅
"Coarse" to "Fine" is a property of diffusion
"Coarse" to "Fine" is a property of the data
✅
❌
"Coarse" to "Fine" is a property of diffusion
"Coarse" to "Fine" is a property of the data
Spectral Decay is required for diffusion
Diffusion (and Ambient) work regardless of PSD
✅
❌
❌
✅
Part 3
Motion Planning Experiments
Question break!
Distribution shift: Low-quality, noisy trajectories
High Quality:
100 GCS trajectories
Low Quality:
5000 RRT trajectories
Distribution shift: Low-quality, noisy trajectories
\(\sigma=0\)
5000 RRT Trajectories
\(\sigma_{min}\)
\(\sigma=1\)
100 GCS Trajectories
Task level:
learn the maze structure
Motion level:
learn smooth motions
GCS
Success Rate
Avg. Acc^2
(Motion-level)
RRT
GCS+RRT
(Co-train)
GCS+RRT
(Ambient)
57.5%
Swept for best \(\sigma_{min}\) per dataset
Policies evaluated over 1000 trials each
99.0%
141.65
74.8
99.4%
62.2
99.5%
30.9
Co-trained
Ambient
Clean data:
Corrupt data:
Trajopt
Success Rate
Avg. Acc^2
(Motion-level)
RRT
Trajopt+RRT
(Co-train)
Trajopt+RRT
(Ambient)
46.0%
Swept for best \(\sigma_{min}\) per dataset
Policies evaluated over 1000 trials each
52.0%
3.9
54.9
59.9%
42.7
65.9%
31.4
Part 3.5
Sim-and-Real Cotraining
It works!
Best co-trained policy (IROS): 84.5%
Best Ambient policy (CoRL?): 93.5%
Distribution shift: sim2real gap
In-Distribution:
50 demos in "target" environment
Out-of-Distribution:
2000 demos in sim environment
Part 4
Bin Sorting (Locality)
\(\sigma=0\)
\(\sigma=1\)
"Corrupt" Data
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
\(\sigma \leq \sigma_{max}\)
"Locality"
\(p_\sigma\) \(\approx\) \(q_\sigma\)
\(\sigma_{max}\)
\(\sigma=0\)
\(\sigma=1\)
"Clean" Data
\(\sigma \leq \sigma_{max}\)
"Locality"
\(\sigma_{max}\)
Goal: isolate the effect of locality in robotics
Distribution shift: task level mismatch, motion level correctness
In-Distribution:
50 demos with correct sorting logic
Out-of-Distribution:
200 demos with incorrect sorting
2x
2x
Robot needs to learn two things:
1. Motion Planning
2. Logic
\(\frac{\#\ blocks \ in \ any \ bin}{total \ blocks}\)
\(\frac{\# \ blocks \ in \ correct \ bin}{\# \ blocks \ \ in \ any bin}\)
Goal: learn motion planning from the bad data, but not the task planning
Success rate:
\(\frac{\# \ blocks \ in \ correct \ bin}{\# \ total \ blocks}\) = (motion planning) x (logic)
Diffusion
Success Rate
Logic Metric
Cotrain
Motion Metric
Locality
61.0%
61.9%
98.6%
22.7%
87.2%
26.0%
93.3%
95.0%
98.2%
Task Planning
Motion Planning
Success Rate
Logic Metric
Cotrain
(with task condition)
Motion Metric
Locality
90.3%
91.5%
98.6%
93.3%
95.0%
98.2%
Locality
(with task condition)
92.8%
94.2%
98.5%
\(\sigma=0\)
\(\sigma=1\)
"Clean" Data
\(\sigma \leq \sigma_{max}\)
"Locality"
\(\sigma_{max}\)
Success Rate
Logic Metric
Ablation
Motion Metric
Locality
91.9%
93.8%
97.9%
93.3%
95.0%
98.2%
Part 5
Real-World Experiments
Goal: move objects from the table into the drawer
\(\mathcal{D}_{clean}\): 50 in-distribution demos
\(\mathcal{D}_{corrupt}\): Open-X Embodiment
Can we learn from unstructure distribution shifts in large real-world datasets?
Open-X
Open-X
Magic Soup++: 27 Datasets
Custom OXE: 48 Datasets
\(\sigma=0\)
\(\sigma=1\)
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
"Ambient"
"Ambient
+ Locality"
"Clean" Data
\(\sigma_{min}\)
\(\sigma > \sigma_{min}\)
\(\sigma \leq \sigma_{max}\)
\(\sigma_{max}\)
Task completion =
0.1 x [opened drawer]
+ 0.8 x [# obj. cleaned / # obj.]
+ 0.1 x [closed drawer]
1. Ambient benefits from reweighting, but does not need it.
Note: Ambient and reweighting are orthogonal (can be applied simultaneously)
2. Co-training must reweight.
1. Finetuning the Ambient base model is always better
2. When good data is limited, ambient outperforms co-train + finetune
Part 6
Concluding Thoughts
Ambient can be used to learn from any distribution shift in robotics
In-Distribution Data
Open-X
simulation
Out-of-Distribution Data
Q: What is "in-distribution"?
A [in this paper]: expert teleoperator on your robot, your task, your environment
A [more generally]: data quality?
How you define "in-distribution" changes if Ambient is used as pre-training or finetuning recipe