Corentin Cadiou\({}^1\) (cadiou@iap.fr)
Lead developers: Maxime Delorme (CEA), Arnaud Durocher (CEA)
Scientific collaborators:
Dominique Aubert (ObAS), Olivier Marchal (ObAS), San Han (IAP), Guillaume Tcherniatinsky (IAP), Harley Katz (U. Chicago), …
\({}^1\)Institut d’Astrophysique de Paris, CNRS, France
Credits: M. Delorme
Data: Top-500
Manage grid, simple computation
Computation-heavy
(hydro, gravity, …)
CPU
GPU
[…]
wasted
time
wasted
time
Typical approach: offloading
Dyablo's approach:
Amdahl's law: latency kills gains of parallelisation
Manage grid, simple computation
Computation-heavy
(hydro, gravity, …)
CPU
GPU
(or CPUs)
[…]
Typical approach: offloading
Dyablo's approach: “true” GPU computing, CPU as a puppeteer
Manage grid, simple computation
Computation-heavy
(hydro, gravity, …)
[…]
Typical approach: offloading
Dyablo's approach: “true” GPU computing, CPU as a puppeteer
Physicists: write subgrid models
Computer scientists:
optimize the code
Physics: independent of data layout, parallelization, ...
Code should provide high-level APIs
Implementations & models should be interchangeable depending on needs
CPU
GPU
(or CPUs)
Cell or block-based
Adaptive Mesh Refinement
Collection of \(N\times M\times L\) blocks
Traditional oct-based AMR
[...]
// Memory abstracted
Kokkos::View<double**> U(Ncell, Nfields);
// Parallelism abstracted
Kokkos::parallel_for(Ncell, KOKKOS_LAMBDA(size_t iCell) {
U(iCell, IRho) += ...
});
[...]
or
or
Single node / GPU
[...]
// Memory abstracted
Kokkos::View<double**> U(Ncell, Nfields);
// Parallelism abstracted
Kokkos::parallel_for(Ncell, KOKKOS_LAMBDA(size_t iCell) {
U(iCell, IRho) += ...
});
[...]
[...]
// Memory abstracted
Kokkos::View<double**> U(Ncell, Nfields);
// Parallelism abstracted
Kokkos::parallel_for(Ncell, KOKKOS_LAMBDA(size_t iCell) {
U(iCell, IRho) += ...
});
[...]
[...]
// Memory abstracted
Kokkos::View<double**> U(Ncell, Nfields);
// Parallelism abstracted
Kokkos::parallel_for(Ncell, KOKKOS_LAMBDA(size_t iCell) {
U(iCell, IRho) += ...
});
[...]
[...]
// Memory abstracted
Kokkos::View<double**> U(Ncell, Nfields);
// Parallelism abstracted
Kokkos::parallel_for(Ncell, KOKKOS_LAMBDA(size_t iCell) {
U(iCell, IRho) += ...
});
[...]
[...]
// Memory abstracted
Kokkos::View<double**> U(Ncell, Nfields);
// Parallelism abstracted
Kokkos::parallel_for(Ncell, KOKKOS_LAMBDA(size_t iCell) {
U(iCell, IRho) += ...
});
[...]
Solar convection (Delorme)
Zeldovich pancake (Aubert & Marchal)
Halo mass function (Aubert & Marchal)
Thermochem. (Cadiou & Katz)
KH instability (Delorme)
— weak scaling: solar convection
Image credits: Maxime Delorme
Ad Astra CPU (2 x AMD Epyc Genoa 96 cores )
Ad Astra GPU (4 AMD MI250 x2 GCD )
Note: some updates in the code between left & right, raw perfs cannot be compared directly
Hydro only: \(\sim 200\,\mathrm{Mcell/s}\)
AthenaK \(\sim1000\,\mathrm{Mcell/s}\), Stone+24
Cholla-MHD \(\sim 200\, \mathrm{Mcell/s}\), Caddy+24,
AREPO-RT \(\sim 1\,\mathrm{Mcell/s}\), Zier+24
Shamrock \(\sim 10\,\mathrm{Mcell/s}\), David-Cléris+25
24,576 cores
Physics: hydrodynamics, MHD, RT, gravity, cosmology, ...
Efficient scaling up to 10,000+ CPUs, 1000+ GPUs
Modular and future-proof code
Credits: M. Delorme
Dyablo-HPC/Dyablo
Corentin Cadiou (cadiou@iap.fr)
Implementations & models should be interchangeable depending on needs
Physicists: write subgrid models
Computer scientists:
optimize the code
Physics: independent of data layout, parallelization, ...
Code should provide high-level APIs
— strong scaling
Image credits: Arnaud Durocher
Scaling test with replicated explosions (one per process)
\(80\times\) more operations
No gravity, no galaxy physics, no dark matter, no expansion of the Universe, …
But all the physics is missing and
important bottlenecks remain
CPU#1
CPU#2
CPUs: Computation-dominated
Exemple: solving gravity with conjugate gradient
\[{\color{cyan}\Delta} {\color{lime}\phi} = {\color{gray}4\pi G} {\color{salmon}\rho} \quad\Leftrightarrow\quad {\color{cyan}\textbf{A}} {\color{lime}x} = {\color{salmon}b} \]
GPU#1
GPU#2
GPUs: Communication-dominated
Exemple: solving gravity with conjugate gradient
\[{\color{cyan}\Delta} {\color{lime}\phi} = {\color{gray}4\pi G} {\color{salmon}\rho} \quad\Leftrightarrow\quad {\color{cyan}\textbf{A}} {\color{lime}x} = {\color{salmon}b} \]
GPU#1
GPU#2
GPUs: Communication-dominated
Ligo collaboration
In reality:
gravity propagates at speed of light
Exemple: solving gravity with conjugate gradient
\[{\color{cyan}\Delta} {\color{lime}\phi} = {\color{gray}4\pi G} {\color{salmon}\rho} \quad\Leftrightarrow\quad {\color{cyan}\textbf{A}} {\color{lime}x} = {\color{salmon}b} \]
GPU#1
GPU#2
GPUs: Communication-dominated
GPU#1
GPU#2
Work package 1: Propagate gravity from cell-to-cell
few global communications, more costly*
*cost is limited in galaxy formation models where timestep is already small
Hilbert curve partitioning:
Best for computation-dominated
Work package 2: Use Voronoi tesselation for load-balancing
Best for communication-dominated
\[\dfrac{\mathrm{d} X_i}{\mathrm{d} \tau} \propto \text{workload gradient}\]
First test in post-processing:
-40% cells at boundary
Work package 2: Use Voronoi tesselation for load-balancing
Best for communication-dominated
\(\mathrm{age}>10\,\mathrm{Myr}\)
skip particle
\(\mathrm{age}=10\,\mathrm{Myr}\)
supernova
\(\mathrm{age}<10\,\mathrm{Myr}\)
winds
Practical exemple: star explosions (simplified)
\(\mathrm{age}>10\,\mathrm{Myr}\)
skip particle
\(\mathrm{age}=10\,\mathrm{Myr}\)
supernova
\(\mathrm{age}<10\,\mathrm{Myr}\)
winds
Practical exemple: star explosions
if (age < 10) { // Winds
A;
B;
C;
} else if (age == 10) { // SN
D;
E;
F;
} else {
G;
}
Time
GPU “warp” (N threads)
inactive
inactive
in.
inactive
AMD EPYC 7413 | 128 GiB | 24 cores | 5 GiB/unit |
NVIDIA A40 | 48 GiB | 10,000 comp. units | 4.8 MiB/unit |
Exemple on Cosmos (Lund's cluster):
AMD EPYC 7413 | 128 GiB | 24 cores | 5 GiB/unit |
NVIDIA A40 | 48 GiB | 10,000 comp. units | 4.8 MiB/unit |
Exemple on Cosmos (Lund's cluster):
Log density
AMD EPYC 7413 | 128 GiB | 24 cores | 5 GiB/unit |
NVIDIA A40 | 48 GiB | 10,000 comp. units | 4.8 MiB/unit |
Exemple on Cosmos (Lund's cluster):
1.0001
0.9998
0.9999
1.0002
Log density
1.0001
0.9998
0.9999
1.0002
Work package 4: On-the-fly compression
Raw data
Update
Compressed data
Decompress
Compress
Raw data
Compressed data
1.0000
4
Mean
Common bits
+2
-2
-1
+1
Example: \(\Delta\) encoding
Requirements:
First test in post-processing: -30% compression
First test in post-processing:
-30% compression
Compression
Decompression
\( x_i := x_i \oplus x_\mathrm{mean} \)
1.0000
4
Mean
Common bits
+2
-2
-1
+1
Work package 4: On-the-fly compression
Porting codes to GPUs:
This project:
Promising new open source code: Dyablo (github/Dyablo-HPC/Dyablo)
We thank eSSENCE for the support
\({}^1\) Astronomy @ Lund
\({}^2\) Institut d'Astrophysique de Paris, France
Corentin Cadiou\({}^{1,2}\)
Santi Roca-Fàbrega\({}^1\)
Oscar Agertz\({}^1\)
☑ Hydrodynamics
(M. Delorme, A. Durocher)
☑ Gravity
(A. Durocher, M.-A. Breton)
☑ Cosmology
(O. Marchal, D. Aubert)
Today
Spring 2025?
“Full” cosmological sim on GPU
☐ Gas cooling (CC)
☐ Star form. & feedback (incl. CC)
+ RT, MHD, testing, setting up ICs, ...
Mellier+24
Schaye+24
Sims including baryons:
\(\sim 100\times\) smaller volumes than observed
There's a lot of freedom in the initial conditions.
\(\tilde\delta=\sum_k {\color{green} a_k} \exp(i\mathbf{k r} + {\color{red}\phi_\mathrm{k}})\)
Constraint:
\(\langle a_{\mathbf{k}} a_{\mathbf{k'}}^\dagger\rangle = P(k)\delta_\mathrm{D}(\mathbf{k}-\mathbf{k}')\)
Constraint:
\(\phi_\mathrm{k}\sim \mathcal{U}(0,2\pi)\)
Utilizing inverted initial conditions to enhance estimators accuracy.
Pontzen+15, see also Chartier+21, Gábor+23
\(\tilde\delta_\mathrm{S}=\sum_{k<k_\mathrm{thr}} {\color{green} a_k} \exp(i\mathbf{k r} + {\color{red}\phi_k})+\sum_{k>k_\mathrm{thr}} {\color{green} a_k} \exp(i\mathbf{k r} + {\color{red}\phi_k+\pi})\)
\( 3^{\mathrm{rd}}\)-order correction
/ up-to-\(2^{\mathrm{nd}}\)
Most likely field \(f\) with
Mathematically, \({\color{green}f}\) is the unique solution that satisfies:
CC+21a
Same halo, different location in the cosmic web
\(z=0\)
\(z=0\)
\(z=0\)
Storck, CC+24
Same halo, different location in the cosmic web
Halo mass
Intrinsic
alignment
\(\sigma\)
Storck, CC+24
Same halo, different location in the cosmic web
\(\sigma\)
⇒ 10-100% of I.A. signal driven by the cosmic web*
*couplings beyond standard predictions, from e.g. constrained TTT
Storck, CC+24
Std. dev / mean
Numerical noise
Population scatter
Same galaxy, different tidal enviroment
CC+21b
Study response at low-\(z\)
\(k>10\,h\,\mathrm{Mpc}^{-1}\)
Perturb tides at \(z\rightarrow\infty\)
on \(10-100\,h^{-1}\,\mathrm{Mpc}/h\)
LSS perturbations propagate quasi-linearly down to \(\mathrm{kpc}\) scales
⇒ Large \(k\) contain cosmological information
“When everything else fails, use more CPU time”
FLAMINGO (Schaye+23)
\(21\,\mathrm{Gpc}^3\), SWIFT
→ SPH approach
Millenium-TNG (Springel+22)
\(0.125\,\mathrm{Gpc}^3\), Arepo
→ moving-mesh approach
Grid-based code?
Quite a shame, because it's where France's experts are
(Dubois, Hennebelle, Bournaud, …)
?
Self-consistent formation of galaxies
Initial conditions
Simulated galaxies
NASA
\(<1\,\%\)
Exponentially-expanding Universe
Dark matter
Gas (H+He)
Stars
The simulation challenge
Time
Universe: \(13.7\,\mathrm{Gyr}\)
Supernova: \(1\,\mathrm{kyr}\)
⇒ \(10^7\) steps/simulation
NASA/CXC/A.Jubett
Mass
\(10^{11}\,M_\odot\): Milky-Way
\(1\,M_\odot\): typical star
⇒ \(10^{11}\) particles
Resolution:
\(\sim 10^{6}\,\mathrm{ly}\): cosmological env.
\(\sim 1\,\mathrm{ly}\): supernova scale
⇒ \(>6\) orders of magnitude resolution
⇒ \(10^{18}\) updates
\(230\,\mathrm{yr}\) on my laptop's CPU
\(18\,\mathrm{yr}\) on my GPU
\(11\,\mathrm{month}\) on an NVIDIA H100
of existing code (RAMSES) / Vintergatan simulation
Took:
RAMSES: Teyssier 2003
Vintergatan: Cadiou in prep, Agertz+21
Efficiency: green top 500
DM
gas
star
\(\sim 30-70\,\mathrm{Gflop/W}\)
\(\sim 5-50\,\mathrm{Gflop/W}\)
\(\sim2\,\mathrm{tCO_{2,eq}}\)
\(<1\,\mathrm{tCO_{2,eq}}\)?