Generalization Dynamics of Linear Diffusion Models

NeurIPS Blitz talks, Claudia Merger, Sebastian Goldt

06.05.2025

Diffusion models\(^1\)

f_{\theta}

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. ‘Denoising Diffusion Probabilistic Models’, 2020

reverse iterative noising process by predicting noise \( \epsilon_t = \epsilon_{\theta}(x_t,t) \) at each step

Diffusion models\(^1\) first memorize, then generalize\(^2\)

f_{\theta}

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. ‘Denoising Diffusion Probabilistic Models’, 2020

reverse iterative noising process by predicting noise \( \epsilon_t = \epsilon_{\theta}(x_t,t) \) at each step

[2] Kadkhodaie, Z. et. al. Generalization in Diffusion Models Arises from Geometry-Adaptive Harmonic Representations’. April 2024.

\(\rightarrow\) Empirically, generalization occurs when \(N\) is "large enough"

How much data do diffusion models need?

\(\rightarrow\) Fully tractable model: Linear diffusion models

\(\rightarrow\) Empirically, generalization occurs around \(\text{\# training examples} \sim d\)

Assume data are \(\rho = \mathcal{N} \left( \mu,\Sigma \right) \)

Linear models learn: \(\rho_{N} \approx \mathcal{N} \left( \mu_0,\Sigma_0 +c\text{Id}\right) \)

\( \mu_0,\Sigma_0 = \) empirical mean and covariance of training data \( \neq \mu, \Sigma \)

How large should \(N\) be?

\(\rightarrow\) Fully tractable model: Linear diffusion models

test loss \( \sim \text{Tr}\frac{ \Sigma -\Sigma_0}{\left(\Sigma_0 + c\text{Id} \right)^2} +const. \)

when \(N < d\),

we find \(d-N\) directions \(\nu\) where \( \Sigma_0 e_{\nu} = 0\)

\(\Rightarrow \) test loss \( \gtrsim \sum_{\nu} \frac{ \Sigma_{\nu,\nu}}{c^2} \)

"fill up" all relevant directions in \( \Sigma_0 \)

Beyond the test loss?

\( = \)

\( +\, const. \)

test loss

\( \rightarrow \) test loss measures difference from the best model

Beyond the test loss?

Kullbeck-Leibler divergence

\( \text{DKL} (\rho_N| \rho) \sim \ln \frac{\left| \Sigma \right|}{\left| \Sigma_0 + c \text{Id} \right|} \)

\( \rightarrow \) distributions align when relevant directions in \(\Sigma \) are also present in \( \Sigma_0\)

\( N \sim d\)

Stay tuned for...

...the spectrum of \(\Sigma\)?

...regularization?

...early stopping?

...the difference between Linear and non-linear diffusion models?

Diffusion models blitz talks

By merger

Diffusion models blitz talks

Generalization Dynamics of Linear Diffusion Models

Diffusion models\(^1\)

Diffusion models\(^1\) first memorize, then generalize\(^2\)

How much data do diffusion models need?

How large should \(N\) be?

Beyond the test loss?

Beyond the test loss?

Stay tuned for...

Diffusion models blitz talks

Diffusion models blitz talks

merger

More from merger