Generalization Dynamics of Linear Diffusion Models

NeurIPS Blitz talks, Claudia Merger, Sebastian Goldt

06.05.2025

Diffusion models\(^1\)

f_{\theta}

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. ‘Denoising Diffusion Probabilistic Models’, 2020

reverse iterative noising process by predicting noise \( \epsilon_t = \epsilon_{\theta}(x_t,t) \) at each step

Diffusion models\(^1\) first memorize, then generalize\(^2\)

f_{\theta}

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. ‘Denoising Diffusion Probabilistic Models’, 2020

reverse iterative noising process by predicting noise \( \epsilon_t = \epsilon_{\theta}(x_t,t) \) at each step

[2] Kadkhodaie, Z. et. al. Generalization in Diffusion Models Arises from Geometry-Adaptive Harmonic Representations’. April 2024.

\(\rightarrow\) Empirically, generalization occurs when \(N\) is "large enough"

How much data do diffusion models need?

\(\rightarrow\) Fully tractable model: Linear diffusion models

\(\rightarrow\) Empirically, generalization occurs around \(\text{\# training examples} \sim d\)

Assume data are \(\rho = \mathcal{N} \left( \mu,\Sigma \right) \)

Linear models learn: \(\rho_{N} \approx \mathcal{N} \left( \mu_0,\Sigma_0 +c\text{Id}\right) \)

\( \mu_0,\Sigma_0 = \) empirical mean and covariance of training data \( \neq \mu, \Sigma \)

How large should \(N\) be?

\(\rightarrow\) Fully tractable model: Linear diffusion models

test loss \( \sim \text{Tr}\frac{ \Sigma -\Sigma_0}{\left(\Sigma_0 + c\text{Id} \right)^2} +const. \)

when \(N < d\),

we find \(d-N\) directions \(\nu\) where  \( \Sigma_0 e_{\nu} = 0\)

\(\Rightarrow \) test loss \( \gtrsim \sum_{\nu} \frac{ \Sigma_{\nu,\nu}}{c^2}  \)

"fill up" all relevant directions in \( \Sigma_0 \)

Beyond the test loss?

 

\( = \)

\( +\, const.  \)

test loss

\( \rightarrow \) test loss measures difference from the best model

Beyond the test loss?

 

Kullbeck-Leibler divergence

 \(  \text{DKL} (\rho_N| \rho) \sim \ln \frac{\left| \Sigma \right|}{\left| \Sigma_0 + c \text{Id} \right|}  \)

\( \rightarrow \) distributions align when relevant directions in \(\Sigma \) are also present in \( \Sigma_0\)

\( N \sim d\)

Stay tuned for...

 

...the spectrum of \(\Sigma\)?

...regularization?

...early stopping?

...the difference between Linear and non-linear diffusion models?

Diffusion models blitz talks

By merger

Diffusion models blitz talks

  • 87