Sept 24, 2025
Adam Wei
Part 1
Interpretations of Ambient Diffusion
CC12M: 12M+ image + text captions
"Corrupt" Data:
Low quality images
"Clean" Data:
High quality images
Not just in CV: also in language, audio, robotics!
robot teleop
simulation
Open-X
robot teleop
simulation
Open-X
There is still value and utility in this data!
... we just aren't using it correctlty
Goal: to develop principled algorithms that change the way we use low-quality or OOD data
2025
"Ambient diffusion is best in domains where good data is hard to get, but bad data is easy"
"low data"
robot teleop
simulation
Open-X
2025
My bets:
"low data"
1. Robot data will be plentiful
2. High quality data is hard to get
\(\implies\) this problem will remain relevant
202[?]
"big data"
Repeat:
\(\sigma=0\)
\(\sigma>\sigma_{min}\)
\(\sigma_{min}\)
*\(\sigma_{min} = 0\) for all clean samples
\(\sigma=1\)
\(\mathbb E[\lVert h_\theta(x_t, t) + \frac{\sigma_{min}^2\sqrt{1-\sigma_{t}^2}}{\sigma_t^2-\sigma_{min}^2}x_{t} - \frac{\sigma_{t}^2\sqrt{1-\sigma_{min}^2}}{\sigma_t^2-\sigma_{min}^2} x_{t_{min}} \rVert_2^2]\)
Ambient Loss
Denoising Loss
\(x_0\)-prediction
\(\epsilon\)-prediction
(assumes access to \(x_0\))
(assumes access to \(x_{t_{min}}\))
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - \epsilon \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - \frac{\sigma_t^2 (1-\sigma_{min}^2)}{(\sigma_t^2 - \sigma_{min}^2)\sqrt{1-\sigma_t^2}}x_t + \frac{\sigma_t \sqrt{1-\sigma_t^2}\sqrt{1-\sigma_{min}^2}}{\sigma_t^2 - \sigma_{min}^2}x_{t_{min}}\rVert_2^2]\)
"Good" Data
"Bad" Data
1. Gaussian noise = corruption
2. Gaussian noise = contraction
\(p_0\)
\(q_0\)
\(p_\sigma\)
\(q_\sigma\)
\(D(p_\sigma, q_\sigma) \to 0\) as \(\sigma\to \infty\)
\(\implies \exists \sigma_{min} \ \mathrm{s.t.}\ D(p_\sigma, q_\sigma) < \epsilon\ \forall \sigma > \sigma_{min}\)
Noisy Channel
\(Y = X + \sigma Z\)
\(p_0\)
\(q_0\)
\(p_\sigma=p_0 * \mathcal N(0, \sigma^2\mathrm{I})\)
\(q_\sigma=q_0 * \mathcal N(0, \sigma^2\mathrm{I})\)
1. "Noise as corruption": for \(\sigma > \sigma_{min}\), corruption appears Gaussian
Goal: Learn \(\nabla \log p_\sigma(x)\)
\(p_0\)
\(q_0\)
\(p_\sigma=p_0 * \mathcal N(0, \sigma^2\mathrm{I})\)
\(q_\sigma=q_0 * \mathcal N(0, \sigma^2\mathrm{I})\)
Goal: Learn \(\nabla \log p_\sigma(x)\)
2. "Noise as contraction": for \(\sigma > \sigma_{min}\), \(\red{\nabla \log p_\sigma (x)} \approx \blue{\nabla \log q_\sigma (x)}\)
Any process that contracts distributions to the same final distribution is viable!
Different ways to mix the data... more on this later ;)
Noise as contraction: for \(\sigma > \sigma_{min}\), \(\red{\nabla \log p_\sigma (x)} \approx \blue{\nabla \log q_\sigma (x)}\)
Goal: Learn \(\nabla \log p_\sigma(x)\)
Part 2
Motion Planning Experiments
Repeat:
\(\sigma=0\)
\(\sigma>\sigma_{min}\)
\(\sigma_{min}\)
*\(\sigma_{min} = 0\) for all clean samples
\(\sigma=1\)
Finding \(\sigma_{min}\)
Increasing granularity
Assign \(\sigma_{min}\) per datapoint
Assign \(\sigma_{min}\) per dataset
\(\sigma_{min}^i = \inf\{\sigma\in[0,1]: c_\theta (x_\sigma, \sigma) > 0.5-\epsilon\}\)
\(\implies \sigma_{min}^i = \inf\{\sigma\in[0,1]: d_\mathrm{TV}(p_\sigma, q_\sigma) = 2\epsilon\}\)*
* assuming \(c_\theta\) the best possible classifier
Distribution shift: Low-quality, noisy trajectories
High Quality:
100 GCS trajectories
Low Quality:
5000 RRT trajectories
Note: This experiment is easily generalizable to 7-DoF robot motion planning
Distribution shift: Low-quality, noisy trajectories
\(\sigma=0\)
5000 RRT Trajectories
\(\sigma_{min}\)
\(\sigma=1\)
100 GCS Trajectories
Task level:
learn the maze structure
Motion level:
learn smooth motions
GCS
Success Rate
(Task-level)
Avg. Jerk^2
(Motion-level)
RRT
GCS+RRT
(Co-train)
GCS+RRT
(Ambient)
50%
Swept for best \(\sigma_{min}\) per dataset
Policies evaluated over 100 trials each
100%
7.5k
17k
91%
14.5k
98%
5.5k
Co-trained
Ambient
Part 3
Sim-and-Real Cotraining
Distribution shift: sim2real gap
In-Distribution:
50 demos in "target" environment
Out-of-Distribution:
2000 demos in sim environment
Experiment 1: swept \(\sigma_{min}\) per dataset
Experiment 2: estimated \(\sigma_{min}\) per datapoint with a classifier
Caveat! Classifier was trained to predict sim vs real actions
Experiment 3: exploring ambient diffusion at different data scales
Moving forwards, all experiments will use the classifier
!!
Part 4
Scaling Low-Quality Data
Problem 1
Classifier biases data distribution at different noise-levels
Let's examine the problems one at a time...
Ambient \(\to\) sim-only as \(|\mathcal{D}_S| \to \infty\)
... and sim-only is bad!
Classifier biases data distribution at different noise-levels
Problem 1
Problem 2
Some "types" of data are more likely to be assigned low \(\sigma_{min}\)
\(q_0\) =
w.p. \(\frac{1}{2}\)
w.p. \(\frac{1}{2}\)
\(p_0\) =
w.p. \(\frac{1}{2}\)
w.p. \(\frac{1}{2}\)
\(q_0\) =
w.p. \(\frac{1}{2}\)
w.p. \(\frac{1}{2}\)
not in \(p_0\)
\(\implies\) \(\sigma_{min}\approx 1\)
(dont use this data)
in \(p_0\)
\(\implies\) \(\sigma_{min}\approx 0\)
(use this data!)
Mode 1
Mode 2
Mode 3
Mode 4
Mode 1
Mode 2
Mode 3
Mode 4
"Can't do statistics with n=1" - Costis
Increasing granularity
Assign \(\sigma_{min}\) per datapoint
Assign \(\sigma_{min}\) per dataset
Assign \(\sigma_{min}\) per bucket
Split dataset into N buckets
\(N=1\)
\(N=|\mathcal{D}|\)
How to bucket?
Increasing granularity
Assign \(\sigma_{min}\) per datapoint
Assign \(\sigma_{min}\) per dataset
Assign \(\sigma_{min}\) per bucket
\(N=1\)
\(N=|\mathcal{D}|\)
Solution: bucketing
Ambient \(\to\) sim-only as \(|\mathcal{D}_S| \to \infty\)
... and sim-only is bad!
Classifier biases data distribution at different noise-levels
Problem 1
Problem 2
\(|\mathcal{D}_S|=500\)
\(|\mathcal{D}_S|=4000\)
Sampling procedure:
"Doesn't make sense to use a hard threshold" - Pablo
(I am paraphrasing...)
Hard threshold!
\(\sigma=1\)
\(\sigma=0\)
\(\sigma=\sigma_{min}\)
Corrupt Data (\(\sigma_>\sigma_{min}\))
Clean Data (\(\sigma_{min}=0\))
"Doesn't make sense to use a hard threshold" - Pablo
(I am paraphrasing...)
\(\sigma=1\)
\(\sigma=0\)
Clean Data
Corrupt Data
Soft Ambient: ratio of clean to corrupt is a function of \(\alpha_\sigma\)
Ambient: ratio of clean to corrupt is a function of dataset size
Optimal mixing ratio is a function of:
Ex. In binary classification and fixed \(D(p,q)\)...
Ben-David et. al, A theory of learning from different domains, 2009
\(\sigma\)
\(D(p_\sigma, q_\sigma)\)
ratio for "bad" data
This idea is fits naturally in the noise as contraction interpretation.
Have some ideas to find \(\alpha^*\) based on theoretical bounds...
... first try linear function just to get some signal.
\(\sigma=1\)
\(\sigma=0\)
Clean Data
Corrupt Data
Solution: bucketing
Ambient \(\to\) sim-only as \(|\mathcal{D}_S| \to \infty\)
... and sim-only is bad!
Classifier biases data distribution at different noise-levels
Problem 1
Problem 2
Solution: soft ambient
Part 5
Ambient Omni
Ambient: "use low-quality data at high noise levels"
Ambient Omni: "use low-quality data at low and high noise levels"
Which photo is the cat?
Which photo is the cat?
Which photo is the cat?
Which photo is the cat?
receptive field = \(f(\sigma)\)
Intuition
receptive field = \(f(\sigma)\)
Repeat:
\(\sigma=0\)
\(\sigma>\sigma_{min}\)
\(\sigma_{min}\)
*\(\sigma_{min} = 0\) for all clean samples
\(\sigma=1\)
\(\sigma_{max}\)
\(\sigma>\sigma_{min}\)
\(\sigma_{max}\)
\(\sigma=0\)
\(\sigma>\sigma_{min}\)
\(\sigma_{min}\)
\(\sigma=1\)
\(\sigma_{max}\)
\(\sigma>\sigma_{min}\)
\(\sigma_{max}\)
Task level
Motion level
Distribution shift: task level mismatch, motion level correctness
In-Distribution:
50 demos with correct sorting logic
Out-of-Distribution:
200 demos with arbitrary sorting
2x
2x
Distribution shift: task level mismatch, motion level correctness
Contrived experiment... but it effectively illustrates the effect of \(\sigma_{max}\)
In-Distribution
Out-of-Distribution
2x
2x
Contrived experiment... but it effectively illustrates the effect of \(\sigma_{max}\)
Repeat:
\(\sigma=0\)
\(\sigma=1\)
\(\sigma_{max}\)
\(\sigma>\sigma_{min}\)
\(\sigma_{max}\)
Diffusion
Score
(Task + motion)
Correct logic
(Task level)
Cotrain
Completion
(Motion level)
Ambient-Omni
(\(\sigma_{max}=0.24\))
70.4%
48%
High...
55.2%
88%
Low...
94.0%
88%
97.5%
Diffusion
Correct logic
(Task level)
Cotrain
Cotrain
(task conditioned)
Completion
(Motion level)
Ambient-Omni
(\(\sigma_{max}=0.24\))
70.4%
48%
High...
55.2%
88%
Low...
88.8%
86%
92.5%
94.0%
88%
97.5%
Score
(Task + motion)
Motion Planning*
Task
Planning*
* not a binary distinction!!
Motion Planning
Task
Planning
* not a binary distinction!!
Motion Planning
Task
Planning
* not a binary distinction!!
Distribution shift: task level mismatch, motion level correctness
In-Distribution
Out-of-Distribution
2x
2x
Open-X
Part 6
Concluding Thoughts
North Star Goal: Train with internet scale data (Open-X, AgiBot, etc)
So far: "Stepping stone" experiments
Please suggest task ideas! Think big!
robot teleop
simulation
Open-X
Open-X
Open-X
Variant of Ambient Omni
Cool Task!!
(TBD)
Cool Demo!!
Thank you!! (please suggest tasks...)