Boyuan Chen
Diffusion Forcing (final name TBD) is a probabilistic sequence model that interleaves time axis of auto-regressive models and noise-level axis of diffusion models.
Prior Model to predict state when there is no updated observation
Posterior Model to predict state when new observation is made
What if we relax bayes filter from the binary duo of having an observation vs no observation?
No observation -> Noisy observation -> Has observation
Masking is a fundamental technique in self-supervised learning
What if we mask by noise, instead of zero padding?
Masked
Not masked
partially masked
Diffusion
Step (k)
Time step (t)
Noise
level (k)
Time step (t)
What does this give us:
We can diffuse a frame at far future, without fully diffusing the past!
(because we are trained on it)
And after two pages of math
50%
50%
Optimal one step prediction in state space
with deterministic model
Optimal one step prediction in pixel space
with deterministic model
Imitate human trajectories
Imitate human trajectories with deterministic model
A distribution we know
how to sample from
A distribution we know
how to sample from
target distribution
Let y be a variable we want to condition on, e.g. y=a label='cat'
Let y be a variable that's learned? Yes!
It could be a differentiable latent code that encodes belief given all previous observations.
Steps in diffusion are usually denoted by t. But from now on we will denote it as k, as we will introduce the actual time variable, t.
Full Trajectory rollout
Single step rollout
Our method can do both
Given initial context \(s_0\) (ommited in notation for simplicity)
$$ \ln p(o_{1:T})=\ln\int_{s_{1:T}}p(o_{1:T},s_{1:T}) \ d{s_{1:T}}$$
$$=\ln\int_{s_{1:T}} \prod_{t=1}^T p(o_t,s_t|s_{1:t-1},o_{1:t-1}) \ d{s_{1:T}}$$
$$=\ln\int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_{1:t},o_{1:t-1}) p(s_t|s_{1:t-1},o_{1:t-1}) \ d{s_{1:T}}$$
$$=\ln\int_{s_{1:T}} \prod_{t=1}^T p(o_t|s_{t}) p(s_t|s_{t-1}) \ d{s_{1:T}}$$
$$=\ln\mathop{\mathbb{E}}_{s_{1:T}}[\prod_{t=1}^T p(o_t|s_{t})] $$
How to sample \( s_{1:T}\) as in the expectation \(\mathop{\mathbb{E}}_{s_{1:T}}[\prod_{t=1}^T p(o_t|s_{t})]\)? Candidate: free-running rollout
\(p(o_{1:T})=\ln\mathop{\mathbb{E}}_{s_{1:T}}[\prod_{t=1}^T p(o_t|s_{t})]\ge \sum_{t=1}^T \mathop{\mathbb{E}}_{s_{1:T}}[\ln p(o_t|s_{t})]\)
What about importance sampling with posterior
$$\ln p(o_{1:T}) = \ln \mathop{\mathbb{E}}_{s_{1:T}}[\prod_{t=1}^T p(o_t|s_{t})]=\ln \mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1})}[\prod_{t=1}^T p(o_t|s_{t})]$$
$$=\ln\mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\prod_{t=1}^T p(o_t|s_{t}) p(s_t|s_{t-1}) / p(s_t|s_{t-1},o_t)]$$
$$=\ln\mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\prod_{t=1}^T p(o_t,s_t|s_{t-1}) / p(s_t|o_t,s_{t-1})]$$
$$=\ln\mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\prod_{t=1}^T p(o_t|s_{t-1}) ]$$
$$\ln p(o_{1:T}) =\ln\mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\prod_{t=1}^T p(o_t|s_{t-1}) ]$$
$$\ge \mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\ln(\prod_{t=1}^T p(o_t|s_{t-1}) ])$$
$$\ge \mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\sum_{t=1}^T \ln p(o_t|s_{t-1}) ])$$
Where each \(p(o_t|s_{t-1})\) can be lower bounded
by ELBO of diffusion models, conditioned on s_{t-1}!
\(\ln p(o_{1:T})\ge \mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[\sum_{t=1}^T \ln p(o_t|s_{t-1}) ])\)
We can do importance sampling via teacher forcing
ELBO1: \(\ln p(o_{1:T})\ge \sum_{t=1}^T \mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1})}[\ln p(o_t|s_{t})]\)
ELBO2: \(\ln p(o_{1:T})\ge \sum_{t=1}^T \mathop{\mathbb{E}}_{s_t\sim p(s_t|s_{t-1},o_t)}[ \ln p(o_t|s_{t-1}) ])\)
ELBO1 samples from prior model \(p(s_t|s_{t-1})\)
ELBO2 samples from posterior model \( p(s_t|s_{t-1},o_t)\)
What about something in between? \( p(s_t|s_{t-1},N(o_t))\), where N(o_t) is a noised version of \(o_t\)
What about something in between?
\( p(s_t|s_{t-1},N(o_t))\), where \(N(o_t)\) is a noised version of \(o_t\)
\( p(s_t|s_{t-1},N(o_t)) \rightarrow p(s_t|s_{t-1},o_t)\) when \(N\) is zero noise
\( p(s_t|s_{t-1},N(o_t)) \rightarrow p(s_t|s_{t-1},o_t)\) when \(N(o_t)\) is pure noise
We constructed a smooth interpolation between prior and posterior!
Diffusion
Step (k)
Time step (t)
Diffusion
Step (k)
Time step (t)
Diffusion
Step (k)
Time step (t)
Diffusion
Step (k)
Time step (t)
train on 72, sample on 500 without sliding window
(500 is due to gif size, can do 3000+ wo blowing up)
64x64
128x128
Diffusion Planning
Diffusion Planning
Diffusion Forcing
Now you are at \(o_t\). You want an action. The best you can do is to diffuse \(x_t=[o_t, a_t, r_t]\).
But you already observed \(o_t\)!
Joint distribution of policy and dynamics
After receiving ground truth \(o_{t+1}\), update posterior state by \(x_t=[o_{GT, t+1}, a_t, r_t]\)
How to get good policy instead of copying suboptimal policy?
Define probability of task success as
\(\exp(V-V^*)\) where \(V^*\) denotes max possible value
...
Take \(V = r_t + r_{t+1} ... + r_{T}\), do value guidance
You can take \(V = r_t + r_{t+1} ... + r_{T}\) for guidance
But you can also take \(V = r_t\) for guidance
unlike previous methods
...
x 10 samples
x 10 samples
x 10 samples
x 10 samples
Take \(V = avg(r_t + r_{t+1} ... + r_{T})\), do value guidance
Change \([o_{t+1}, a_t, r_t]\) should make \([o_{t+1}, a_t, r_t]\) more uncertain
While \(Noisy([o_{t+2}, a_{t+1}, r_{t+1}])\) asks it to be certain
Diffusion
Step (k)
Time step (t)
N
Denote N=most noisy
N
N
N
N
N
N-1
N-1
N-1
N-1
N-1
N-1
N-2
N-2
N-2
N-2
N-2
N-2
N
N
N
N
N
N
N-1
N-2
N
N
N
N
N
N-1
N
N
N
N
No gradient!
Calculate guidance gradient without pyramid sampling but resample to higher noise level by adding noise
Result
Result
Result
Consider a table with 3 slots. An apple an an orange are at random slots upon initialization. Task is to swap them using the third slot.
This is not markovian!
Just do imitation learning on this, w/ or w/ memory
Diffusion Forcing can be used as diffusion policy but with flexible horizon memory
(All images shown are predicted by diffusion)
Result