Claudia Merger
10.07.2025
Established in 2020; today:
4 PIs
25 students
13 postdocs
TSDS & friends, October 2024
Theoretical Neuroscience and DS
Claudia Merger
12.08.2025
e.g. Image generation
Task: Given some data \( \mathcal{D} \) from an unknown distribution \( p \)
Generate \( x \sim p \)
Task is solved by learning \( \, p_{\theta} \approx p\)
"happy data scientist"
"summer in Trieste"
"intelligent bamboo"
Task: Given some data \( \mathcal{D} \) from an unknown distribution \( p \)
Generate \( x \sim p \)
Task is solved by learning \( \, p_{\theta} \approx p\)
Two questions:
\( p\)
\( \, p_{\theta} \)
Statistical physics provides a language to span model space. \( \rightarrow \) interactions
?
?
Write interacting theory using polynomial action \( S_{\theta} (x) = \ln p_{\theta} (x)\)
\( S_{\theta} (x)= A^{(0)} + A^{(1)}_{i} x_i + A^{(2)}_{ij} x_i x_j +A^{(3)}_{ijk} x_i x_j x_k + \dots \)
Example:
Merger, C., et. al. ‘Learning Interacting Theories from Data’. PRX, 2023
Write interacting theory using polynomial action \( S_{\theta} (x) = \ln p_{\theta} (x)\)
\( S_{\theta} (x)= A^{(0)} + A^{(1)}_{i} x_i + A^{(2)}_{ij} x_i x_j +A^{(3)}_{ijk} x_i x_j x_k + \dots \)
\( A^{(k)} \)
Observation: neural networks learn "easy" statistics first, then more complex statistics
\( \rightarrow \) principled approach to studying learning of statistics from data, from easy to hard
\( \rightarrow \) see also: Ingrosso & Goldt, 2022; Refinetti et al., 2023; Belrose et al., 2024, ...
interpolate
NICE (Dinh et. al., 2015 ), RealNVP (Dinh et. al., 2017), GLOW (Kingma et. al. , 2018)
Merger, C., et al. ‘Learning Interacting Theories from Data’. PRX, 2023
Example:
Merger, C., et al. ‘Learning Interacting Theories from Data’. PRX, 2023
Sohl-Dickstein, et al., ‘Deep Unsupervised Learning Using Nonequilibrium Thermodynamics’.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. ‘Denoising Diffusion Probabilistic Models’, 2020
reverse iterative noising process by predicting noise \( \epsilon_t = \epsilon_{\theta}(x_t,t) \) at each step
Two solutions:
Kadkhodaie, Z. et. al. Generalization in Diffusion Models Arises from Geometry-Adaptive Harmonic Representations’. April 2024.
\(\rightarrow\) Empirically, generalization occurs when \(N\) is "large enough"
= Global minimum of the training loss
see e.g. Ambrogioni 2023.
\( p\)
\( \, p_{\theta} \)
Merger, Goldt, 2025 arXiv.2505.24769.
Good performance: at least \( \# \text{training examples} \asymp d\)
\(\text{DKL}\left(p_{\theta}|p\right) \)
\( p\)
\( \, p_{\theta} \)
\( \, p^{(k>2)}_{\theta} \)
Good performance: at least \( \# \text{training examples} \asymp d\)
Merger, Goldt, 2025 arXiv.2505.24769.
What if I don't have that many data, \( N \ll d \)?
\(\rightarrow \) Early stopping
\(\rightarrow \) regularization
Merger, Goldt, 2025 arXiv.2505.24769.
Two questions:
\( p\)
\( \, p_{\theta} \)
Using interactions, we can
learned data distribution
\( p_{\theta} (x) =p_Z \left( f_{\theta}(x) \right) \big| \det J_{f_{\theta}} (x) \big| \)
loss function
\( \mathcal{L}\left(\mathcal{D}\right) =-\sum_{x \in \mathcal{D}} \ln p_{\theta}(x) \)
NICE (Dinh et. al., 2015 ), RealNVP (Dinh et. al., 2017), GLOW (Kingma et. al. , 2018)
Claudia Merger
10.07.2025
Kullbeck-Leibler divergence
\( \text{DKL} (\rho_N| \rho) \sim \ln \frac{\left| \Sigma \right|}{\left| \Sigma_0 + c \text{Id} \right|} \)
\( \rightarrow \) distributions align when relevant directions in \(\Sigma \) are also present in \( \Sigma_0\)
\( N \asymp d\)
Data drawn from
\(\rho = \mathcal{N} \left( \mu,\Sigma \right) \)
\( \mu_0,\Sigma_0 = \) empirical mean and covariance of training data \( \neq \mu, \Sigma \)
\(\rightarrow\) Fully tractable model: Linear diffusion models
Linear models learn:
\(\rho_{N} \approx \mathcal{N} \left( \mu_0,\Sigma_0 +c\text{Id}\right) \)
\(L= \sum_t \bigg\langle||\epsilon -\epsilon_{\theta} \left(x_t,t\right) ||^2 \bigg\rangle_{\epsilon, x_0} \)
\(\epsilon_{\theta} \left(x_t,t\right)=W_t(x_t+b_t) \)