RLG Long Talk
May 2, 2025
Adam Wei
... works in progress
Part 1
Motivation
Open-X
expert robot teleop
robot teleop
simulation
Goal: sample "high-quality" trajectories for your task
Train on entire spectrum to learn "high-level" reasoning, semantics, etc
"corrupt"
"clean"
Corrupt Data
Clean Data
Computer Vision
Language
Poor writing, profanity, toxcity, grammar/spelling errors, etc
👀
Giannis Daras
Part 2a
Ambient Diffusion
Epsilon prediction:
Sample prediction:
affine transform
Variance preserving:
Variance exploding:
Variance exploding and sample-prediction simplify analysis
... analysis can be extended to DDPM
Assumptions
1. Each corrupt data point contains additive gaussian noise up to some noise level \(t_n\)
\(t=0\)
\(t=T\)
\(t=t_n\)
Corrupt data: \(X_{t_n} = X_0 + \sigma_{t_n} Z\)
Assumptions
1. Each corrupt data point contains additive gaussian noise up to some noise level \(t_n\)
2. \(t_n\) is known for each dataset or datapoint
\(t=0\)
\(t=T\)
\(t=t_n\)
Corrupt data: \(X_{t_n} = X_0 + \sigma_{t_n} Z\)
\(t=0\)
\(t=T\)
\(t=t_n\)
noise
denoise
Forward Process
Backward Process
\(t=0\)
\(t=T\)
\(t=t_n\)
noise
denoise
Forward Process
Backward Process
\(t=0\)
\(t=T\)
\(t=t_n\)
noise
denoise
How can we learn \(h_\theta^*(x_t, t) = \mathbb E[X_0 | X_t = x_t]\) using noisy samples \(X_{t_n}\)?
Ambient Loss
Note that this loss does not require access to \(x_0\)
How can we learn \(h_\theta^*(x_t, t) = \mathbb E[X_0 | X_t = x_t]\) using noisy samples \(X_{t_n}\)?
Diffusion Loss
No access to \(x_0\)
We can learn \(\mathbb E[X_0 | X_t=x_t]\) using only corrupted data!
Remarks
\(X_{t} = X_0 + \sigma_{t}Z\)
\(X_{t} = X_{t_n} + \sqrt{\sigma_t^2 - \sigma_{t_n}^2}Z\)
\(\nabla \mathrm{log} p_t(x_t) = \frac{\mathbb E[X_0|X_t=x_t]-x_t}{\sigma_t^2}\)
\(\nabla \mathrm{log} p_t(x_t) = \frac{\mathbb E[X_{t_n}|X_t=x_t]-x_t}{\sigma_t^2-\sigma_{t_n}^2}\)
(Tweedies)
(Tweedies)
\(g_\theta(x_t,t) := \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2}x_t\).
Ambient loss:
\(\frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta^*(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2}x_t = \mathbb E [X_{t_n} | X_t = x_t]\)
(from double Tweedies)
\(\implies\)
\(\implies\)
(definition of \(g_\theta\))
For each data point:
Clean data: \(t\in (0, T]\)
\(x_t = x_0 + \sigma_t z\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)
Forward process:
Backprop:
Corrupt data
\(t\in (t_n, T]\)
\(x_t = x_{t_n} + \sqrt{\sigma_t^2-\sigma_{t_n}^2} z_2\)
\(\mathbb E[\lVert \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2} - x_{t_n} \rVert_2^2]\)
\(t=0\)
\(t=T\)
\(t=t_n\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)
\(\mathbb E[\lVert \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2}x_t - x_{t_n} \rVert_2^2]\)
Use clean data to learn denoisers for \(t\in(0,T]\)
Use corrupt data to learn denoisers for \(t\in(t_n,T]\)
Part 2b
Ambient Diffusion w/ Contraction
Protein Folding Case Study
Assumptions
1. Each corrupt data point contains additive gaussian noise up to some noise level \(t_n\)
2. \(t_n\) is known for each dataset or datapoint
\(t=0\)
\(t=T\)
\(t=t_n\)
Corrupt data: \(X_{t_n} = X_0 + \sigma_{t_n} Z\)
Corrupt Data
Clean Data
+ 5 years
=
~100,000 solved protein structures
200M+ protein structures
Corrupt Data
Clean Data
\(X_0^{corrupt} \neq X_0^{clean} + \sigma_{t_n} Z\)
Breaks both assumptions:
\(X_0^{clean}\)
\(X_0^{corrupt}\)
Corrupt Data
Clean Data
Corrupt Data + Noise
For sufficiently high \(t_n\): \(X_{t_n}^{corrupt} = X_0^{corrupt} + \sigma_{t_n} Z \approx X_0^{clean} + \sigma_{t_n} Z\)
\(X_{t_n}^{corrupt} = X_0^{corrupt} + \sigma_{t_n} Z\)
Key idea: Corrupt data further with Gaussian noise
\(X_0^{clean}\)
\(X_0^{corrupt}\)
\(X_{t_n}^{corrupt}\)
Key idea: Corrupt data further with Gaussian noise
For sufficiently high \(t_n\): \(X_{t_n}^{corrupt} = X_0^{corrupt} + \sigma_{t_n} Z \approx X_0^{clean} + \sigma_{t_n} Z\)
\(x_0^{clean}\)
\(x_0^{corrupt}\)
\(t=0\)
\(t=T\)
\(t=t_n\)
For sufficiently high \(t_n\): \(X_{t_n}^{corrupt} \approx X_0^{clean} + \sigma_{t_n} Z\)
\(P_{Y|X}\)
\(Y = X + \sigma_{t_n} Z\)
\(P_{X_0}^{clean}\)
\(Q_{X_0}^{corrupt}\)
\(P_{X_{t_n}}^{clean}\)
\(Q_{X_{t_n}}^{corrupt}\)
\(\implies\) Adding noise makes the corruption appear Gaussian (Assumption 1)
\(x_0^{clean}\)
\(x_0^{corrupt}\)
\(t=0\)
\(t=T\)
\(t=t_n\)
1. Train a "clean" vs "corrupt" classifier for all \(t \in [0, T]\)
2. For each corrupt data point, increase noise until the classification accuracy drops. This is \(t_n\).
\(x_0^{clean}\)
\(x_0^{corrupt}\)
\(t=0\)
\(t=T\)
\(t=t_n\)
For each data point:
Clean data
\(x_t = x_0 + \sigma_t z\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)
Forward process:
Backprop:
Corrupt data
\(x_t = x_{t_n} + \sqrt{\sigma_t^2-\sigma_{t_n}^2} z_2\)
\(x_{t_n} = x_0 + \sigma_{t_n} z_1\)
Contract data:
\(\mathbb E[\lVert \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2} - x_{t_n} \rVert_2^2]\)
Part 3
Interpretations
Diffusion is AR sampling in the frequency domain
\(t=0\)
\(t=T\)
\(t=t_n\)
Low Frequency
High Frequency
Coarse Structure
Fine-grained Details
* assuming power spectral density of data is decreasing
\(t=0\)
\(t=T\)
\(t=t_n\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0 \rVert_2^2]\)
\(\mathbb E[\lVert \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2}x_t - x_{t_n} \rVert_2^2]\)
Low Freq
High Freq
Clean & corrupt data share the same "coarse structure"
Learn coarse structure from all data
Learn fine-grained details from clean data only
i.e. same low frequency features
Clean Data
Corrupt Data
Hypothesis: Robot data shares low-frequency features, but differ in high frequency features
Clean & corrupt data share the same "coarse structure"
i.e. same low frequency features
Low Frequency
Learn with Open-X, AgiBot, sim, clean data, etc
High Frequency
Learn from clean task-specific data
Diffusion is Euclidean projection onto the data manifold
Assumption: The clean and corrupt data manifolds lie in a similar part of space, but differ in shape
Diffusion is Euclidean projection onto the data manifold
Large \(\sigma_t\)
Both clean and corrupt data project to close to the clean data manifold
Diffusion is Euclidean projection onto the data manifold
Small \(\sigma_t\)
Use clean data to project onto the clean data manifold
Part 4
Robotics??
Open-X
expert robot teleop
robot teleop
simulation
"corrupt"
"clean"
Start by exploring ideas inspired by Ambient Diffusion on sim-and-real cotraining
Corrupt Data (Sim): \(\mathcal D_S\)
Clean Data (Real): \(\mathcal D_T\)
Replayed GCS plans in Drake
Sources of corruption: (non-Gaussian...)
Teleoperated demos in a target-sim environment
... sorry no video :(
Training:
\(t=0\)
\(t=T\)
\(t=t_n\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{clean} \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{corrupt} \rVert_2^2]\)
Low Freq
High Freq
Learn with all data
Learn with clean data
Clean Data
Corrupt Data
Intuition: mask out high-frequency components of "corrupt" data
\(t=0\)
\(t=T\)
\(t=t_n\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{clean} \rVert_2^2]\)
Low Freq
High Freq
Learn with all data
Learn with clean data
Clean Data
Corrupt Data
Assume \(X_0^{corrupt} = X_{t_n}^{clean} = X_0^{clean} + \sigma_{t_n} Z\)
(NOT TRUE)
\(\mathbb E[\lVert \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2}x_t - x_0^{corrupt} \rVert_2^2]\)
Results are currently worse than the cotraining baseline :(
\(\mathbb E[\lVert \frac{\sigma_t^2 - \sigma_{t_n}^2}{\sigma_t^2} h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2}x_t - x_{t_n} \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t,t) + \frac{\sigma_{t_n}^2}{\sigma_t^2-\sigma_{t_n}^2}x_t - \frac{\sigma_t^2}{\sigma_t^2 - \sigma_{t_n}^2}x_{t_n} \rVert_2^2]\)
Ambient Loss
"Rescaled" Ambient Loss
... blows up for small \(\sigma_t\)
Worse than idea 1, but this was expected
Assume \(X_0^{corrupt} = X_{t_n}^{clean} = X_0^{clean} + \sigma_{t_n} Z\)
(NOT TRUE)
1. Train a binary classifier to distinguish sim vs real actions
2. Use the classifier to find \(t_n\) for all sim actions in \(\mathcal D_S\)
3. Noise all sim actions to \(t_n\):
\(X_{t_n}^{corrupt} = X_0^{corrupt} + \sigma_{t_n} Z\)
\(\approx X_0^{clean} + \sigma_{t_n} Z\)
4. Train with ambient diffusion and \(t_n\) per datapoint
*same algorithm as SOTA protein folding
1. Train a binary classifier to distinguish sim vs real actions
2. Use the classifier to find \(t_n\) for all sim actions in \(\mathcal D_S\)
\(t_n\) = 0 for 40%+ of action sequences... \(\implies\) sim data is barely corrupt
\(t=0\)
\(t=T\)
\(t=t_n\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{clean} \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{corrupt} \rVert_2^2]\)
Low Freq
High Freq
Learn with all data
Learn with clean data
Clean Data
Corrupt Data
\(t=0\)
\(t=T\)
Low Freq
High Freq
Clean Data
Corrupt Data
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{clean} \rVert_2^2]\)
\(\mathbb E[\lVert h_\theta(x_t, t) - x_0^{corrupt} \rVert_2^2]\)
\(\alpha_T\)
\(\alpha_0\)
\(D(\red{p^{clean}_{\sigma_t}}||\blue{p^{corrupt}_{\sigma_t}})\) is large
\(\implies\) sample more clean data
Low noise levels
(small \(\sigma_t\))
High noise levels
(large \(\sigma_t\))
\(D(\red{p^{clean}_{\sigma_t}}||\blue{p^{corrupt}_{\sigma_t}})\) is small
\(\implies\) sample more corrupt data
Motivating ideas for ambient diffusion
Motivating ideas could be useful for robotics, but the exact Ambient Diffusion algorithm might not??