Lessons learned in ML and Cosmology

IAIFI Fellow, MIT

Carolina Cuesta-Lazaro

Art: "A philosopher" by Salomon Konicnk

Unicorns, rainbows and the real Universe

\Lambda \mathrm{CDM}

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

["DESI 2024 VI: Cosmological Constraints from the Measurements of Baryon Acoustic Oscillations" arXiv:2404.03002]

What role did Machine Learning play?

Dark Energy is constant over time

DESI's Dark Energy constraints

Astrophysics dominates Simulation-based Inference

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Dataset Size = 1 

Can't poke it in the lab 

Simulations

Bayesian statistics

Cosmology is hard

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

1-Dimensional

Machine Learning

Secondary anisotropies

Galaxy formation

Intrinsic alignments

DESI, DESI-II, Spec-S5

Euclid / LSST

Simons Observatory

CMB-S4

Ligo

Einstein

The era of Big Data Cosmology

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Unicorn land The promise of ML for Cosmology

Reality Check Roadblocks & Bottlenecks

Outline of this talk

Mapping dark matter

Reverting gravitational evolution

Field Level Inference

Learning to represent baryonic feedback

Data-driven hybrid simulators

Unsupervised problems

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

[Image Credit: Claire Lamman (CfA/Harvard) / DESI Collaboration]

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

["A point cloud approach to generative modeling for galaxy surveys at the field level" 
Cuesta-Lazaro and Mishra-Sharma 

arXiv:2311.17141]

Base Distribution

Target Distribution

  • Sample
  • Evaluate

Long range correlations

Huge pointclouds (20M)

Homogeneity and isotropy

Siddharth Mishra-Sharma

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #1: leverage data representations + symmetries

Fixed Initial Conditions / Varying Cosmology

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}

Diffusion model

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

CNN

Diffusion

Increasing Noise

p(\sigma_8|\delta_m)
p(\sigma_8|\delta_m + 0.01 \epsilon)
p(\sigma_8|\delta_m + 0.02 \epsilon)
["Diffusion-HMC: Parameter Inference with Diffusion Model driven Hamiltonian Monte Carlo" 
Mudur, Cuesta-Lazaro and Finkbeiner]

 

Nayantara Mudur

["Your diffusion model is secretly a certifiably robust classifier" 
Chen et al

arXiv:2402.02316]

CNN

Diffusion

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #2: learning likelihoods can be more robust than poseriors

Do we actually need Density Estimation?

Just use binary classifiers!

x, \theta \sim p(x,\theta)
x, \theta \sim p(x)p(\theta)
y = 1
y = 0
\theta
x

Binary cross-entropy

Sample from simulator

Mix-up

Likelihood-to-evidence ratio

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

["Likelihood-free MCMC with Amortized Approximate Ratio Estimator" 
Hermans et al]

 

Lesson #3: Classifiers are awesome

r(x,\theta) = \frac{p(x\mid \theta)}{p(x)}

Likelihood-to-evidence ratio

p(x,\theta|y) = p(x,\theta) \, \, \mathrm{if} \, \, y=1
p(x,\theta|y) = p(x)p(\theta) \, \, \mathrm{if} \, \, y=0
p(y=1 \mid x,\theta) = \frac{p(x,\theta|y=1)p(y)}{p(x,\theta)}
\left(p(x,\theta \mid y=0) + p(x,\theta \mid y=1) \right)p(y)
p(y=1 \mid x,\theta) = \frac{p(x,\theta)}{p(x)p(\theta) + p(x,\theta)}
= \frac{p(x,\theta)}{p(x)p(\theta)}
= \frac{r(x,\theta)}{r(x,\theta) + 1}

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

How good is my model?

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

["Do Deep Generative Models know what they don't know?" Nalisnick et al]

 

p(x)

Classsifier

Simulations

\log p(x)

Observation

Lesson #4: What should x be?

Observed

Simulated

p_\phi(\rho_\mathrm{DM}|\rho_\mathrm{Galaxies})

1 to Many:

Distribution of Galaxies

Underlying Dark Matter 

["Debiasing with Diffusion: Probabilistic reconstruction of Dark Matter fields from galaxies" 
Ono et al arXiv:2403.10648]

 

Victoria Ono

Core Park

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #5: Most problems 1 to Many

Truth

Sampled

Observed

Small

Large

Scale (k)

Power Spectrum

Small

Large

Scale (k)

Cross correlation

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

TNG-300

True DM

Sample DM

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

["3D Reconstruction of Dark Matter Fields with Diffusion Models: Towards Application to Galaxy Surveys" 
Park, Mudur, Cuesta-Lazaro et al (in-prep)]

 

Posterior Sample

Posterior Mean

Debiasing Cosmic Flows

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Reconstructing dark matter back in time

Stochastic Interpolants

NF

p(\delta_\mathrm{ICs}, \theta|\delta_\mathrm{Final}) =
p(\delta_\mathrm{ICs}|\delta_\mathrm{Final})
p(\theta|\delta_\mathrm{ICs},\delta_\mathrm{Final})

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #6: Match two distributions that are already close!

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

p(x_1|x_0)
x_0

?

I_s = \alpha_s x_0 + \beta_s x_1 + \sigma_s W_s
x_1
s
["Probabilistic Forecasting with Stochastic Interpolants and Foellmer Processes" 
Chen et al arXiv:2403.10648 (Figure adapted from arXiv:2407.21097)]

 

Simulating what you need (and sometimes what you want)

Guided simulations with fuzzy constraints

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Simulating what you need (and sometimes what you want)

Can we run larger simulations? (DESI volumes)

At high resolution?

Faster?

All this works depends on simulations, but...

Thousands of them?

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

\frac{\mathrm{d} \mathbf{x}}{\mathrm{d} a } = \frac{1}{a^3 E(a)}\mathbf{v}
\frac{\mathrm{d} \mathbf{v}}{\mathrm{d} a } = \frac{1}{a^2 E(a)}\mathbf{F}(\mathbf{x},a)
\mathbf{F}(\mathbf{x},a) = \frac{3 \Omega_m}{2} \nabla \phi^\mathrm{PM}(\mathbf{x})

Gravitational evolution ODE

Particle-mesh

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Particle-mesh

N-body

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

\mathbf{F}_\theta(\mathbf{x},a) = \frac{3 \Omega_m}{2} \nabla \left[\phi^\mathrm{PM}(\mathbf{x}) + \phi^\mathrm{corr}_\theta(\mathbf{x}, a, \phi^\mathrm{PM}, \delta^\mathrm{PM}) \right]

Hybrid Simulator - on the fly

\frac{\mathrm{d} \mathbf{x}}{\mathrm{d} a } = \frac{1}{a^3 E(a)}\mathbf{v}
\frac{\mathrm{d} \mathbf{v}}{\mathrm{d} a } = \frac{1}{a^2 E(a)}\mathbf{F}(\mathbf{x},a)

Gravitational evolution ODE

Trained to match particle velocities and positions: DIFFERENTIABLE

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Particle-mesh

N-body

Hybrid ML-Simulator

"Nbodyify: Adaptive mesh corrections for PM simulations" Cuesta-Lazaro, Modi in preps

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #7: Substantial speed ups without accuracy loss are very hard to achieve

What is the space of plausible solutions and how do we search it?

Differentiable Galaxies ODEs

Our best bet

\frac{d \mathrm{Galaxies}}{dt} = \phi(\mathrm{Dark Matter}(t))
+ \phi_\theta(?)

Neural Network corrections

Finding the missing pieces

Data-driven hybrid simulators

Are these models predictive?

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Compressing cosmological simulations

~ 10 trillion particles per snapshot stored

x Discrete snapshots

Can we learn compressed continuous representations with Neural Fields?

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

How do we learn what is the robust information?

Simulating dark matter is easy!

"Atoms" are hard" :(

N-body Simulations

Hydrodynamics

Can we improve our simulators in a data-driven way?

How well can we simulate the Universe?

(if cold!)

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

 ~ Gpc

pc

kpc

Mpc

Gpc

[Video credit: Francisco Villaescusa-Navarro]

Gas density

Gas temperature

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Small

Large

\langle\mathrm{True}\,\,\mathrm{Pred}\rangle

In-Distribution

In-Distribution

In-Distribution

Out-of-Distribution

Out-of-Distribution

Out-of-Distribution

Out-of-Distribution

Out-of-Distribution

Out-of-Distribution

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

["Multifield Cosmology with Artificial Intelligence" 
Villaescusa-Navarro et al arXiv:2109.09747]

 

Out-of-Distribution

In-Distribution

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

\Omega_m, \sigma_8

Simulator 1

Simulator 2

z
p(
, z)

Dark Matter

Feedback

Learning to parametrise feedback

\Omega_m, \sigma_8

Contrastive

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #8: Think carefully about the representations you care about

Parity violation cannot be originated by gravity

7 \sigma
x
\mathrm{Mirror}(x)
["Measurements of parity-odd modes in the large-scale 4-point function of SDSS..." 
Hou, Slepian, Chan arXiv:2206.03625]
?
1 \sigma
["Could sample variance be responsible for the parity-violating signal seen in the BOSS galaxy survey?"
 Philcox, Ereza arXiv:2401.09523]

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

x
\mathrm{Mirror}(x)
\mathrm{max} \, \left( f_\theta(x) - f_\theta(\mathrm{Mirror}(x)) \right)

Real or Fake?

x or Mirror x?

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Train

Test

Me: I can't wait to work with observations

Me working with observations:

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Lesson #9: Low data regime + low signal to noise ratio = difficult to find data-efficient architectures

1. There is a lot of information in galaxy surveys that ML methods can access

2. We can tackle high dimensional inference problems so far unatainable

3. Our ability to simulate limits the amount of information we can robustly extract

Hybrid simulators, forward models, robustness

Unsupervised problems: parity violation

 

Mapping dark matter, constrained simulations... Let's get creative!

Field level inference

Conclusions

Carolina Cuesta-Lazaro IAIFI/MIT @ DL in Solar Physics 2024

Solar2024

By carol cuesta

Solar2024

  • 82