How Not to Write a High-Performance N-body Code

Lehman Garrison (SCC, formerly CCA)

FWAM, High-Performance Computing Case Study

Oct. 28, 2022

slides.com/d/JYBnZPE/live

https://github.com/lemmingapex/N-bodyGravitySimulation

The Large-Scale Structure of the Universe

First 8 weeks of DESI Survey data (2021; Credit: D. Schlegel/Berkeley Lab)

Telescope surveys see that galaxies cluster together, forming a "cosmic web"

Simulating the Cosmic Web

Gravitational instability in an expanding universe explains the cosmic web
Can simulate by treating all matter as collisionless particles
- Ignore gas dynamics; good enough for large scales
Discretize the matter into N particles, evolve equations of motion under Newtonian gravity in a GR background

Credit: Benedikt Diemer & Phil Mansfield

N-body: the computational task

Integrate equations of motion of each particle using leapfrog
Requires gravitational force at each time step: $$ \vec{F_i} \propto \sum_j^N \frac{\vec{r_j} - \vec{r_i}}{|\vec{r_j} - \vec{r_i}|^3} $$
- Naive time complexity $ \mathcal{O}(N^2) $
Must also respect periodic boundary conditions
- Won't use $ \mathcal{O}(N^2) $ on large scales; might still on small scales

It gets worse...

Tiny galaxies shine across vast cosmic distances
- Need high mass resolution, large simulation volume
Non-trivial gravitational coupling of large scales to small
Typical flagship N-body simulations have 1+ trillion particles in 2022
1 trillion particles
- Typically 32+ TB of particle state
- All-to-all communication
- Thousands of time steps
- Tightly coupled

The Abacus N-body Code

New code for cosmological N-body simulations (Garrison et al., 2021)
Based on new multipole method with exact near-field/far-field split
Near field GPU-accelerated; far field CPU SIMD-accelerated
Fastest published per-node performance of any cosmological N-body code
Tiny force errors, typically $10^{-5}$ (frac.)
Written in C++/CUDA, with MPI+OpenMP parallelization

Hearken back to 2013...

GPU computing was just entering mainstream
Beyoncé only had 17 Grammys
Few academic GPU clusters
A single node could house enough GPUs to run a simulation, but how do you hold it in memory? With disk!
Abacus "slab pipeline" designed to only need a thin slice of the simulation in memory
Rest of the simulation can be on disk (2013), on another node (2020), etc
Lots of IO optimization

Abacus cluster in the Harvard Astro Dept. basement

Lesson #0: Don't cook your hard drives

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

$X$-axis

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Far-field
force

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Far-field
force

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Near-field
force (GPU)

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Near-field
force (GPU)

Far-field
force

Kick/Drift/
Multipoles

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Near-field
force (GPU)

Far-field
force

Kick/Drift/
Multipoles

Write

Abacus Slab Pipeline

Slab 7

Slab 6

Slab 5

Slab 4

Slab 3

Slab 2

Slab 1

Read

Near-field
force (GPU)

Far-field
force

Kick/Drift/
Multipoles

Write

Abacus Slab Pipeline

Uses the exact decomposition of the multipole method to process the simulation volume slab-by-slab
FFT convolution of multipoles happens after pipeline, produces far-field for next step
Pipeline actually expressed as an event loop (code example)
Easily incorporates asynchronous activity, like IO, GPU, or MPI

void slab_pipeline(){
  while(!Finish.doneall()){
      if(LaunchRead.precondition()) LaunchRead.action();
      if(NearForce.precondition())  NearForce.action();
      if(FarForce.precondition())   FarForce.action();
      if(Kick.precondition())       Kick.action();
      if(Drift.precondition())      Drift.action();
      if(Finish.precondition())     Finish.action();
  }
}

class NearForce : Stage {
    int precondition(int slab){
        for(int i = -RADIUS; i <= RADIUS; i++){
            if(!Read.done(slab+i)) return 0;
        }
        return 1;
    }
    void action(int slab){
        SubmitToGPU(slab);
    }
};

From single-node to multi-node

From a node's perspective, it doesn't matter if the not-in-memory particles are on disk or on another node
In 2018, we added MPI parallelization, reusing the slab decomposition
This 1D parallelization had perfect scaling to 140+ nodes on Summit
- Any larger and the domains become too thin for halo finding
Rolling decomposition: a node ends with a different domain than it starts with
No boundary approximations!

Custom OpenMP Scheduler

Modern systems have "Non-Uniform Memory Access" (NUMA)
- Cores access memory on the same socket (NUMA node) faster
Abacus slabs are divided over sockets. Want the CPUs to only work on rows on that socket.
Abacus used to use static scheduling, but started choking on load balancing in big sims

Socket 1

Socket 2

CPU 1

CPU 2

CPU 3

CPU 4

void KickSlab(int slab){
    #pragma omp parallel for schedule(???)
    for(int j = 0; j < nrow; j++){
        KickRow(slab,j);
    }
}

CPU 2

CPU 4

Custom OpenMP Scheduler

OpenMP 4 doesn't have a "dynamic-within-socket" schedule
Nor does it let you write a custom scheduler
But we can hack one together...
- Use OpenMP tasks
- Atomic addition to grab next loop iteration
25% improvement in performance of far-field calculation
But can't use "reduction" clauses, have to implement manually...

#define NUMA_FOR(N)
    _next_iter[0] = 0;   // node 0
    _last_iter[0] = N/2;
    
    _next_iter[1] = N/2; // node 1
    _last_iter[1] = N;
    
    #pragma parallel
    for(int j; ; ){
        #pragma atomic capture
        j = _next_iter[NUMA_NODE]++;
        if(j >= _next_iter[NUMA_NODE+1])
            break;

#pragma omp parallel for
for(int j = 0; j < nrows; j++){
    KickRow(slab,j);
}

NUMA_FOR(nrow)
    KickRow(slab,j);
}

Given the following code:

Rewrite as:

Lesson #1: don't let your threads access memory on the wrong NUMA node

Lesson #1b: Don't write a custom OpenMP scheduler

Lesson #1a: don't let your threads access memory on the wrong NUMA node

Loop Unrolling & SIMD Metacode

void multipoles(const double *pos, const int N, double *multipole){
    for(int i = 0; i < N; i++){
        double mx = 1.;
        for(int a = 0; a <= ORDER; a++){
            double mxy = mx;
            for(int b = 0; b <= ORDER - a; b++){
                double mxyz = mxy;
                for(int c = 0; c <= ORDER - a - b; c++){
                    multipole[a,b,c] += mxyz;
                    mxyz *= pos[2,i];
                }
                mxy *= pos[1,i];
            }
            mx *= pos[0,i];
        }
      }
   }
}

M_{a,b,c} = \sum_q^N x_q^a y_q^b z_q^c \begin{cases} 0 \le a \le p \\ 0\le b \le p-a \\ 0\le c \le p-a-b \end{cases}

Performance: $7\times10^6$ particles/sec (1 core)

print('double mx = 1.;')
for a in range(ORDER + 1):
    print('double mxy = mx;')
    for b in range(ORDER - a + 1):
        print('double mxyz = mxy;')
        for c in range(ORDER - a - b + 1):
            print(f'multipoles[{a},{b},{c}] += mxyz;')
            print('mxyz *= pos[2,i];')
        print('mxy *= pos[1,i];')
    print('mx *= pos[0,i];')

$19\times10^6$ particles/sec

print('__m512d mx = _mm512_set1_pd(1.);')
for a in range(ORDER + 1):
    print('__m512d mxy = mx;')
    for b in range(ORDER - a + 1):
        print('__m512d mxyz = mxy;')
        for c in range(ORDER - a - b + 1):
            print(f'multipoles[{a},{b},{c}] = '\
              f'_mm512_add_pd(multipoles[{a},{b},{c}],mxyz);')
            print('mxyz = _mm512_mul_pd(mxyz,pos[2,i]);')
        print('mxy = _mm512_mul_pd(mxy,pos[1,i]);')
    print('mx = _mm512_mul_pd(mx,pos[0,i]);')

$54\times10^6$ particles/sec

Loop Unrolling & SIMD Metacode

Lesson #2: don't trust the compiler to optimize complicated, performance-critical loops!

Other tricks: interleaving vectors, FMA, remainder iterations

PTimer Cache Line Contention

Abacus PTimer (“parallel timer”) is used to profile performance-intensive, multi-threaded loops
Each thread has a binary flag that it sets when it is active
All flags were in a contiguous array
20 threads fighting for the same two cache lines inside tight nested loops
Resolving this gave an immediate 50% CPU performance boost
Typical cache line size 64 bytes
- Keep threads away from each other by at least this much

Lesson #3: don't share cache lines, especially in tight loops

GPU Data Model: Cells vs. Pencils

Old version of GPU code processed cell pairs
NVIDIA A100 GPU: 1500 GB/s memory bandwidth
- 125 Gparticles/sec
20 TFLOPS compute
- 770 Gforces/sec
770/125 = 6 forces per load
- Each particle should be used at least 6 times per load to shared mem
25 GB/s host to device PCIe bandwidth
- 2 Gparticles/sec
- 770/2 = 385 forces per host-device transfer
Pencil-on-pencil
- Each particle acts on 5 sink pencils, 5 cells each
- Need at least 15 particle per cell—not bad!

Lesson #4: don't send data to the GPU unless you have a lot of floating-point work to do on it

What did we do with all this?

60 trillion particles
- More particles than all N-body simulations ever run, combined
Performance of 300+ trillion directs/sec
7+ PB of raw outputs, processed and compressed to 2 PB
60% of theoretical peak usage of Summit GPUs
Primary N-body simulation suite used by the DESI galaxy survey (Stage-IV DOE Dark Energy Experiment)
Public data release: abacusnbody.org

Conclusions

Many lessons about HPC learned from mistakes made in the Abacus code
- Don't cook hard drives
- Don't use static scheduling as a replacement for NUMA-awareness
- Don't trust compiler unrolling of complicated loops
- Don't let threads work within the same cache line
- Don't send data to the GPU unless you have lots of work to do on it
- Event loops are a powerful simulation programming model
If you have a better approach for one of these problems, I'm eager to talk about it!
Many other optimizations I didn't get to talk about:
- Thread pinning
- Memory allocators (tcmalloc)
- GPU memory pinning
- Ramdisk/mmap
- MPI point-to-point vs all-to-all

Checksumming
Compression/file formats
Representation of particle positions
SSD TRIM
GPU intermediate accumulators

Abacus FWAM Presentation

How Not to Write a High-Performance N-body Code

The Large-Scale Structure of the Universe

Simulating the Cosmic Web

N-body: the computational task

It gets worse...

The Abacus N-body Code

Hearken back to 2013...

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

Abacus Slab Pipeline

From single-node to multi-node

Custom OpenMP Scheduler

Custom OpenMP Scheduler

Loop Unrolling & SIMD Metacode

Loop Unrolling & SIMD Metacode

PTimer Cache Line Contention

GPU Data Model: Cells vs. Pencils

What did we do with all this?

Conclusions