Corentin Cadiou\({}^{1,2}\)
Oscar Agertz\({}^{1}\)
Santi Roca-Fàbrega\({}^{1}\)
\({}^1\)Division of astrophysics, Department of physics, Faculty of Science, Lund University
\({}^2\)Institut d’Astrophysique de Paris, CNRS, France
Self-consistent formation of galaxies
Initial conditions
Simulated galaxies
NASA
\(<1\,\%\)
Exponentially-expanding Universe
Dark matter
Gas (H+He)
Stars
The simulation challenge
Time
Universe: \(13.7\,\mathrm{Gyr}\)
Supernova: \(1\,\mathrm{kyr}\)
⇒ \(10^7\) steps/simulation
NASA/CXC/A.Jubett
Mass
\(10^{11}\,M_\odot\): Milky-Way
\(1\,M_\odot\): typical star
⇒ \(10^{11}\) particles
Resolution:
\(\sim 10^{6}\,\mathrm{ly}\): cosmological env.
\(\sim 1\,\mathrm{ly}\): supernova scale
⇒ \(>6\) orders of magnitude resolution
⇒ \(10^{18}\) updates
\(230\,\mathrm{yr}\) on my laptop's CPU
\(18\,\mathrm{yr}\) on my GPU
\(11\,\mathrm{month}\) on an NVIDIA H100
of existing code (RAMSES) / Vintergatan simulation
Took:
RAMSES: Teyssier 2003
Vintergatan: Cadiou in prep, Agertz+21
Efficiency: green top 500
DM
gas
star
\(\sim 30-70\,\mathrm{Gflop/W}\)
\(\sim 5-50\,\mathrm{Gflop/W}\)
\(\sim2\,\mathrm{tCO_{2,eq}}\)
\(<1\,\mathrm{tCO_{2,eq}}\)?
Data: Top-500
Manage grid, simple computation
Computation-heavy
(hydro, gravity, …)
CPU
GPU
[…]
wasted
time
wasted
time
Typical approach: offloading
Dyablo's approach:
Amdahl's law: latency kills gains of parallelisation
Manage grid, simple computation
Computation-heavy
(hydro, gravity, …)
CPU
GPU
(or CPUs)
[…]
Typical approach: offloading
Dyablo's approach: “true” GPU computing, CPU as a puppeteer
Main developers: P. Kestener, A. Durocher, M. Delorme (CEA)
+ GINEA collaboration (ObAS, IAP, CRAL, CRIStAL …)
Image credits: Arnaud Durocher
Remember the 6 GPUs equivalent to 1500 CPUs?
Scaling test with replicated explosions (one per process)
\(80\times\) more operations
No gravity, no galaxy physics, no dark matter, no expansion of the Universe, …
But all the physics is missing and
important bottlenecks remain
CPU#1
CPU#2
CPUs: Computation-dominated
Exemple: solving gravity with conjugate gradient
\[{\color{cyan}\Delta} {\color{lime}\phi} = {\color{gray}4\pi G} {\color{salmon}\rho} \quad\Leftrightarrow\quad {\color{cyan}\textbf{A}} {\color{lime}x} = {\color{salmon}b} \]
GPU#1
GPU#2
GPUs: Communication-dominated
Exemple: solving gravity with conjugate gradient
\[{\color{cyan}\Delta} {\color{lime}\phi} = {\color{gray}4\pi G} {\color{salmon}\rho} \quad\Leftrightarrow\quad {\color{cyan}\textbf{A}} {\color{lime}x} = {\color{salmon}b} \]
GPU#1
GPU#2
GPUs: Communication-dominated
Ligo collaboration
In reality:
gravity propagates at speed of light
Exemple: solving gravity with conjugate gradient
\[{\color{cyan}\Delta} {\color{lime}\phi} = {\color{gray}4\pi G} {\color{salmon}\rho} \quad\Leftrightarrow\quad {\color{cyan}\textbf{A}} {\color{lime}x} = {\color{salmon}b} \]
GPU#1
GPU#2
GPUs: Communication-dominated
GPU#1
GPU#2
Work package 1: Propagate gravity from cell-to-cell
few global communications, more costly*
*cost is limited in galaxy formation models where timestep is already small
Hilbert curve partitioning:
Best for computation-dominated
Work package 2: Use Voronoi tesselation for load-balancing
Best for communication-dominated
\[\dfrac{\mathrm{d} X_i}{\mathrm{d} \tau} \propto \text{workload gradient}\]
First test in post-processing:
-40% cells at boundary
Work package 2: Use Voronoi tesselation for load-balancing
Best for communication-dominated
\(\mathrm{age}>10\,\mathrm{Myr}\)
skip particle
\(\mathrm{age}=10\,\mathrm{Myr}\)
supernova
\(\mathrm{age}<10\,\mathrm{Myr}\)
winds
Practical exemple: star explosions (simplified)
\(\mathrm{age}>10\,\mathrm{Myr}\)
skip particle
\(\mathrm{age}=10\,\mathrm{Myr}\)
supernova
\(\mathrm{age}<10\,\mathrm{Myr}\)
winds
Practical exemple: star explosions
if (age < 10) { // Winds
A;
B;
C;
} else if (age == 10) { // SN
D;
E;
F;
} else {
G;
}
Time
GPU “warp” (N threads)
inactive
inactive
in.
inactive
AMD EPYC 7413 | 128 GiB | 24 cores | 5 GiB/unit |
NVIDIA A40 | 48 GiB | 10,000 comp. units | 4.8 MiB/unit |
Exemple on Cosmos (Lund's cluster):
AMD EPYC 7413 | 128 GiB | 24 cores | 5 GiB/unit |
NVIDIA A40 | 48 GiB | 10,000 comp. units | 4.8 MiB/unit |
Exemple on Cosmos (Lund's cluster):
Log density
AMD EPYC 7413 | 128 GiB | 24 cores | 5 GiB/unit |
NVIDIA A40 | 48 GiB | 10,000 comp. units | 4.8 MiB/unit |
Exemple on Cosmos (Lund's cluster):
1.0001
0.9998
0.9999
1.0002
Log density
1.0001
0.9998
0.9999
1.0002
Work package 4: On-the-fly compression
Raw data
Update
Compressed data
Decompress
Compress
Raw data
Compressed data
1.0000
4
Mean
Common bits
+2
-2
-1
+1
Example: \(\Delta\) encoding
Requirements:
First test in post-processing: -30% compression
First test in post-processing:
-30% compression
Compression
Decompression
\( x_i := x_i \oplus x_\mathrm{mean} \)
1.0000
4
Mean
Common bits
+2
-2
-1
+1
Work package 4: On-the-fly compression
Porting codes to GPUs:
This project:
Promising new open source code: Dyablo (github/Dyablo-HPC/Dyablo)
We thank eSSENCE for the support
\({}^1\) Astronomy @ Lund
\({}^2\) Institut d'Astrophysique de Paris, France
Corentin Cadiou\({}^{1,2}\)
Santi Roca-Fàbrega\({}^1\)
Oscar Agertz\({}^1\)
☑ Hydrodynamics
(M. Delorme, A. Durocher)
☑ Gravity
(A. Durocher, M.-A. Breton)
☑ Cosmology
(O. Marchal, D. Aubert)
Today
Spring 2025?
“Full” cosmological sim on GPU
☐ Gas cooling (CC)
☐ Star form. & feedback (incl. CC)
+ RT, MHD, testing, setting up ICs, ...
Mellier+24
Schaye+24
Sims including baryons:
\(\sim 100\times\) smaller volumes than observed
There's a lot of freedom in the initial conditions.
\(\tilde\delta=\sum_k {\color{green} a_k} \exp(i\mathbf{k r} + {\color{red}\phi_\mathrm{k}})\)
Constraint:
\(\langle a_{\mathbf{k}} a_{\mathbf{k'}}^\dagger\rangle = P(k)\delta_\mathrm{D}(\mathbf{k}-\mathbf{k}')\)
Constraint:
\(\phi_\mathrm{k}\sim \mathcal{U}(0,2\pi)\)
Utilizing inverted initial conditions to enhance estimators accuracy.
Pontzen+15, see also Chartier+21, Gábor+23
\(\tilde\delta_\mathrm{S}=\sum_{k<k_\mathrm{thr}} {\color{green} a_k} \exp(i\mathbf{k r} + {\color{red}\phi_k})+\sum_{k>k_\mathrm{thr}} {\color{green} a_k} \exp(i\mathbf{k r} + {\color{red}\phi_k+\pi})\)
\( 3^{\mathrm{rd}}\)-order correction
/ up-to-\(2^{\mathrm{nd}}\)
Most likely field \(f\) with
Mathematically, \({\color{green}f}\) is the unique solution that satisfies:
CC+21a
Same halo, different location in the cosmic web
\(z=0\)
\(z=0\)
\(z=0\)
Storck, CC+24
Same halo, different location in the cosmic web
Halo mass
Intrinsic
alignment
\(\sigma\)
Storck, CC+24
Same halo, different location in the cosmic web
\(\sigma\)
⇒ 10-100% of I.A. signal driven by the cosmic web*
*couplings beyond standard predictions, from e.g. constrained TTT
Storck, CC+24
Std. dev / mean
Numerical noise
Population scatter
Same galaxy, different tidal enviroment
CC+21b
Study response at low-\(z\)
\(k>10\,h\,\mathrm{Mpc}^{-1}\)
Perturb tides at \(z\rightarrow\infty\)
on \(10-100\,h^{-1}\,\mathrm{Mpc}/h\)
LSS perturbations propagate quasi-linearly down to \(\mathrm{kpc}\) scales
⇒ Large \(k\) contain cosmological information
“When everything else fails, use more CPU time”
FLAMINGO (Schaye+23)
\(21\,\mathrm{Gpc}^3\), SWIFT
→ SPH approach
Millenium-TNG (Springel+22)
\(0.125\,\mathrm{Gpc}^3\), Arepo
→ moving-mesh approach
Grid-based code?
Quite a shame, because it's where France's experts are
(Dubois, Hennebelle, Bournaud, …)
?