Building a model-selection criterion which survives replication
Alexandre René
rene@netsci.rwth-aachen.de
D-IEP semiar • 6 Nov 2025 • Amsterdam
https://alcrene.github.io/emd-paper
Paper (HTML)
These slides
https://slides.com/alexrene/
truth-in-the-age-of-dl
Paper (PDF)
https://doi.org/10.1038/s41467-025-64658-7
PyPI package
https://pypi.org/project/emdcmp
Executable capsule
https://codeocean.com/capsule/0868474/tree/v1
Anecdotal
observations
Conceive theory
Conceive experiment
Accumulate data
Compare
Make prediction
New experiment?
Assumptions
Symmetry
Conservation
Exchangeability
Anecdotal
observations
Validate/falsify
Prinz et al., Nat Neurosci (2004)
René, Pyloric simulator, PyPI (2025)
Fitting parameters
→ Distinct local solutions
Different equations
Different parameters
Same equations
Different parameters
Rayleigh-Jeans
Planck
Standard statistical criteria
EMD criterion
Prinz et al., Nat Neurosci (2004)
René, Pyloric simulator, PyPI (2025)
Different equations
Different parameters
Same equations
Different parameters
Rayleigh-Jeans
Planck
selection
criterion
We will use risk: \(\mathbb{E}_{\mathcal{M}_{\mathrm{true}}}[Q]\) to rank models
Some loss
\(θ_A\) subsumed into \(\mathcal{M}_A\)
Not all comparisons should be conclusive
Result is not consistent across replications
Intuition: More predictive accuracy ⇒ More reliable comparison
Why? If we know a source of variable, we can:
EMD assumption: Model discrepancies are due to unknown variability
Unknown sources of variability may change across experiments
\(R\): Risk
(lower is better)
Are these differences in risk all meaningful?
Data
Pointwise loss
(Empirical) risk
\((x_i, y_i) \sim \mathcal{D}_{\mathrm{true}} \)
\(Q(x_i, y_i \mid \mathcal{M}_A) \to \mathbb{R}\)
\(\mathbb{E}\bigl[Q(x_i, y_i \mid \mathcal{M}_A) \bigr] \approx \frac{1}{L} \;\sum\limits_{\mathclap{\qquad(x_i, y_i) \sim \mathcal{D}_{\mathrm{true}}}}\;\; Q(x_i, y_i \mid \mathcal{M}_A) \)
NB: \(θ\) subsumed into \(\mathcal{M}_a\)
We assume to have
\(R\): Risk
(lower is better)
Are these differences in risk all meaningful?
Data
Pointwise loss
(Empirical) risk
\((x_i, y_i) \sim \mathcal{D}_{\mathrm{true}} \)
\(Q(x_i, y_i \mid \mathcal{M}_A) \to \mathbb{R}\)
\(\mathbb{E}\bigl[Q(x_i, y_i \mid \mathcal{M}_A) \bigr] \approx \frac{1}{L} \;\sum\limits_{\mathclap{\qquad(x_i, y_i) \sim \mathcal{D}_{\mathrm{true}}}}\;\; Q(x_i, y_i \mid \mathcal{M}_A) \)
NB: \(θ\) subsumed into \(\mathcal{M}_a\)
We assume to have
Discrepancy
For purposes of calculating risk, we can reduce any model to \(q(Φ)\) without loss of information
For purposes of calculating risk, we can reduce any model to \(q(Φ)\) without loss of information
EMD assumption (reframed): Candidate models represent that part of the experiment which we understand and control across replications
We can estimate \(R_A\) in two different ways:
Mixed \(q_A^*\)
Synth \(\tilde{q}_A\)
Repeat for each \(\mathcal{M}\)
Any process \(\mathcal{Q}\) should be
There is no way to coax a Wiener process to yield what we need
Variance must not depend on \(Φ\), only on \(δ^{\mathrm{EMD}}(Φ)\)
Instead of accumulating increments left-to-right, we successively refine the interval
We draw increment pairs, under the constraint
\(Δq_{ΔΦ}(Φ) \stackrel{!}{=} Δq_{ΔΦ/2}(Φ) + Δq_{ΔΦ/2}(Φ+ΔΦ/2)\)
We need a compositional distribution
Mateu-Figueras et al., Distributions on the Simplex Revisited, 2021
The simplest 2-D compositional distributon is the beta distribution
We draw increment pairs, under the constraint
\(Δq_{ΔΦ}(Φ) \stackrel{!}{=} Δq_{ΔΦ/2}(Φ) + Δq_{ΔΦ/2}(Φ+ΔΦ/2)\)
Beta
Compositional form
By construction
Determine \(α\) and \(β\)
We draw increment pairs, under the constraint
\(Δq_{ΔΦ}(Φ) \stackrel{!}{=} Δq_{ΔΦ/2}(Φ) + Δq_{ΔΦ/2}(Φ+ΔΦ/2)\)
Beta
Because of the constraint, mean and variance are not natural statistics for compositional distributions
Mateu-Figueras et al., Distributions on the Simplex Revisited, 2021
Instead it is better to use the center and metric variance
Two equations ⇒ Solve for \(α\) and \(β\)
Dataset size
“Strength” of evidence
vol(posterior)
vol(posterior)
Use the fact that \(B^\mathrm{EMD}_{AB}\) are true probabilities:
Use the fact that \(B^\mathrm{EMD}_{AB}\) are true probabilities:
(white region)
(true)
(theory)
Repeat for each \(\mathcal{M}\)
Calibration
Repeat for each \(\mathcal{M}\)
Calibration
All of this can be automated
emdcmp on PyPI
from emdcmp import Bemd, make_empirical_risk, draw_R_samples
synth_ppfA = make_empirical_risk(lossA(modelA.generate(Lsynth)))
synth_ppfB = make_empirical_risk(lossB(modelB.generate(Lsynth)))
mixed_ppfA = make_empirical_risk(lossA(data))
mixed_ppfB = make_empirical_risk(lossB(data))
Bemd(mixed_ppfA, mixed_ppfB, synth_ppfA, synth_ppfB, c=c)
Chair of Computational Network Science
(Prof. Michael Schaub)
netsci.rwth-aachen.de
Alexandre René
rene@netsci.rwth-aachen.de
www.arene.ca
HOOC Workshop takes place here
(cerca August 2026)
Multiple LP candidates with similar responses
Prinz et al., Nat Neurosci (2004)
René, pyloric simulator, PyPI (2025)
8D Parameter sweep
…
René et al., Neural Comp (2020)
Back-propagation through time
At a large scale, what kinds of variations do we want to account for?
High-level
How do we define/quantify these variations and the selection objective?
Specific
Higher-level assessment.
These follow from the choice of paradigm.
Functional
Bayesian information criterion
aka model evidence
minimum description length
Akaike information criterion
expected log pointwise predictive density
Ignoring
model
vs
discrete params
vs
continuous params
See esp. “Holes in Bayesian Statistics”, Gelman, Yao, J. Phys. G (2020)
prior over models
prior over models
posterior over params
vol(posterior)
vol(posterior)
vol(posterior)
vol(posterior)
vol(posterior)
vol(posterior)
Dataset size
“Strength” of evidence
Would you confidently select the Planck model based on these data?
Why not?
And yet…
Statistical criteria are descriptive
They consider only the data we have today, not those we will collect tomorrow