Multi-GPU Computing in JAX for Automatically Differentiable
High Performance Computing

Francois Lanusse, @EiffL

Why do we need to start thinking about scaling up?

Motivation from ML perspective:
- Machine Learning models are getting better, but bigger
- The dimensionality of data increases (e.g. high resolution images, 3D)
Motivation from Physics perspective:
- Models become very large for Stage IV surveys (i.e. N-body sims)

=> In both cases, a given model will not fit on a single GPU!

Credit: Li et al. 2021

We also have access to a new generation of supercomputers

NERSC 9 system: Perlmutter

1536 GPU nodes, each one with 4x NVIDIA A100 (40GB)
High performance HPE/Cray Slingshot interconnect
Ranks in top 10 most powerful systems in the world

How does parallel computing works?

GPUs are great for SIMD (Single Instruction Multiple Data)
- This requires many many simple cores, which all have access to the same memory
- If your problem fits in memory, this is the best solution!
When the data is so large that it cannot fit into a single computer, you need SPMD (Single Program Multiple Data)
- Each process can live on a different physical device, and only in charge of storing and preprocessing a fraction of the total data
- Processes need to talk to each other in order to complete the desired global computation (e.g. MPI)

NVIDIA Ampere architecture

Technical solutions for fast communication between GPUs

CUDA-aware MPI: Messaging Passing Interface (MPI) standard which allows for direct memory exchange between GPUs potentially on different physical machines

NVIDIA Collective Communication Library (NCCL): Proprietary NVIDIA library, highly optimized for GPU communications directly within CUDA kernels

Where does JAX comes in in this picture?

JAX is awesome for several reasons:
- Allows you to write NumPy code, that executes on GPU
- Provides automatic differentiation
How can we use it for large-scale High Performance Computing?

The Manual Way - MPI4JAX

MPI4JAX - Zero-copy MPI communication of JAX arrays

In a nutshell, provides a JAX wrapper around MPI primitives
- Compiled against MPI4PY, rely on CUDA-aware MPI for GPUDirect RDMA
- Primitives can be included directly in jitted code!

https://github.com/mpi4jax/mpi4jax

from mpi4py import MPI
import jax
import jax.numpy as jnp
import mpi4jax

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

@jax.jit
def foo(arr):
   arr = arr + rank
   arr_sum, _ = mpi4jax.allreduce(arr, op=MPI.SUM, comm=comm)
   return arr_sum

a = jnp.zeros((3, 3))
result = foo(a)

if rank == 0:
   print(result)

This code is executed on all processes, each one has a single GPU

mpirun -n 4 python myapp.py

How to make this work on Perlmutter?

Step I: Follow the instructions of the jax-perlmutter-tutorials GitHub repo to setup a JAX environment at NERSC:
Step II: For maximum convenience "The mpi4py provided by the python or cray-python modules is not CUDA-aware. You will have to build CUDA-aware mpi4py in a custom environment using the instructions below." (source). So, you need to build it:
Step III: Launch your Python script like so (from an salloc'd node, for instance):

$ module load python cudnn/8.2.0 nccl/2.11.4 cudatoolkit
$ pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_releases.html

$ module load PrgEnv-gnu # In addition to the previously loaded modules
$ MPICC="cc -target-accel=nvidia80 -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

$ srun  -n 4 -c 32 --gpus-per-task 1 --gpu-bind=none python my_script.py

Find out more on the MPI4JAX doc: https://mpi4jax.readthedocs.io/en/latest/shallow-water.html

Example of a physical nonlinear shallow-water model distributed on 4 GPUs (Hafner & Vincentini, 2021)
MPI is used to ensure proper boundary conditions between processes by performing a halo exchange

Examples of Applications

For a more cosmology-oriented problem: MPI parallelism in JaxPM

A key ingredient in fast N-body solvers is the ability to compute distributed 3D Fast Fourier Transforms.
- Requires transposing a 3D density field, everytime redistributing the array differently on the processor mesh
- Requires AlltoAll operations
In a WIP branch, JaxPM has the tools needed to distribute a FastPM implemenation.

Dalcin et al. 2018

Density field computed on 8 GPUs with MPI4JAX

So, this works, but...

The developer (you!) needs to manually take care of all the collective operations needed to ensure the correct result.
For complex collectives (i.e. other than all gather) the gradients are not known a priori. The developer will have to implement custom gradients around the functions that have communications.
MPI has a well-known limitation that it does not handle messages larger than 2 GB. MPI4JAX currently doesnt implemement a workaround for that.

def fft3d(arr, comms=None):
    """ Computes forward FFT, note that the output is transposed
    """
    if comms is not None:
        shape = list(arr.shape)
        nx = comms[0].Get_size()
        ny = comms[1].Get_size()

    # First FFT along z
    arr = jnp.fft.fft(arr)  # [x, y, z]
    arr = arr.reshape(shape[:-1]+[nx, shape[-1] // nx])
    arr = arr.transpose([2, 1, 3, 0])  # [y, z, x]
    arr, token = mpi4jax.alltoall(arr, comm=comms[0])
    arr = arr.transpose([1, 2, 0, 3]).reshape(shape)  # Now [y, z, x]

    # Second FFT along x
    arr = jnp.fft.fft(arr)
    arr = arr.reshape(shape[:-1]+[ny, shape[-1] // ny])
    arr = arr.transpose([2, 1, 3, 0])  # [z, x, y]
    arr, token = mpi4jax.alltoall(arr, comm=comms[1], token=token)
    arr = arr.transpose([1, 2, 0, 3]).reshape(shape)  # Now [z, x, y]

    # Third FFT along y
    return jnp.fft.fft(arr)

=> It's very artisanal, not very jaxy!

The Magical

Near
Future of JAX

The low-level technical side

JAX relies on the XLA (Accelerated Linear Algebra) library for compiling and executing jitted code.
Around 2021-2022, support for low-level collective operations as been added to XLA, with NCCL as a backend on GPU clusters \o/

=> JAX is technically natively parallelisable through XLA communication primitives on machines like Perlmutter.

The high-level JAX parallelism API

Things are still evolving a lot! Jax 0.4.0 is around the corner and will change everything!
The idea: You should be able to write your code as if it would execute on a single GPU, JAX should figure out the rest to make it run on many GPUs! Compatible with vmap, jit, grad, etc.
Up until JAX v0.3 two methods exist, xmap and pjit, each documented here:
- Named axes and easy-to-revise parallelism
- Introduction to pjit

import jax
from jax.experimental import maps
from jax.experimental import PartitionSpec
from jax.experimental.pjit import pjit
import numpy as np

mesh_shape = (4, 2)
devices = np.asarray(jax.devices()).reshape(*mesh_shape)
# 'x', 'y' axis names are used here for simplicity
mesh = maps.Mesh(devices, ('x', 'y'))

in_axis_resources=None
out_axis_resources=PartitionSpec('x', 'y')

f = pjit(
  lambda x: 2*x +1,
  in_axis_resources=None,
  out_axis_resources=PartitionSpec('x', 'y'))
 
# Sends data to accelerators based on partition_spec
with maps.Mesh(mesh.devices, mesh.axis_names):
 data = f(input_data)

pjit example

Conclusion

We should not shy away from thinking large-scale, it is already possible (e.g. mpi4jax), and will only get easier with time.
JAX is moving in the direction of automated parallelisation on GPU clusters!
Things to keep an eye on:
- New JAX Array mechanism with upcoming JAX v0.4.0
- jaxDecomp: JAX bindings to NVIDIA cuDecomp library (join me!)

jax-hpc

By eiffl

jax-hpc

A little overview of how to use JAX for High Performance Computing on GPU clusters

2 years ago
689

Multi-GPU Computing in JAX for Automatically Differentiable High Performance Computing

Why do we need to start thinking about scaling up?

We also have access to a new generation of supercomputers

How does parallel computing works?

Technical solutions for fast communication between GPUs

Where does JAX comes in in this picture?

The Manual Way - MPI4JAX

MPI4JAX - Zero-copy MPI communication of JAX arrays

How to make this work on Perlmutter?

Examples of Applications

For a more cosmology-oriented problem: MPI parallelism in JaxPM

So, this works, but...

The Magical

Near Future of JAX

The low-level technical side

The high-level JAX parallelism API

Named axes and easy-to-revise parallelism

Introduction to pjit

Conclusion

jax-hpc

More from eiffl

Multi-GPU Computing in JAX for Automatically Differentiable
High Performance Computing

Near
Future of JAX