GPUs For All (Astrophysicists)!
DAp GPU Half Day, March 29th, 2023
data:image/s3,"s3://crabby-images/c454c/c454c7900533b0e855c10b86c417d6fa62e0a8ac" alt=""
Francois Lanusse, @EiffL
data:image/s3,"s3://crabby-images/bc9a8/bc9a8eb6d91e956ffd66528ec88b2200359721e7" alt=""
data:image/s3,"s3://crabby-images/1b85b/1b85b93e9ae0469dec783e0db26c214efb01e518" alt=""
data:image/s3,"s3://crabby-images/41272/412728b1ff241ce584d0fb6b83da1a7e254c22a4" alt=""
data:image/s3,"s3://crabby-images/5a829/5a829005149a432758da29958177a92a000e31c9" alt=""
data:image/s3,"s3://crabby-images/a4abf/a4abf5c8078204e6ac8687a0f5561589830e11bc" alt=""
data:image/s3,"s3://crabby-images/d2324/d23241696c118cf9674c893b3e05fcb7bda95c1f" alt=""
Intel Alder Lake i5 CPU
6 Cores
NVIDIA GeForce RTX 3090 GPU
10,496 CUDA Cores
data:image/s3,"s3://crabby-images/94e31/94e31347640173d579b916eec13329e27ac35039" alt=""
Connection to the CPU
Streaming Multi Processor
(128 cores)
High Speed GPU Memory
Fast Interconnect with other GPUs
data:image/s3,"s3://crabby-images/d3748/d37486d0a89150b08782673634272e365b55926e" alt=""
data:image/s3,"s3://crabby-images/7ac97/7ac97e2a9426b812922b5355bfdd7df2c62b72eb" alt=""
What is the difference between CPU and GPU cores?
- CPU cores are independent, full-featured, with complex instruction sets
- GPU cores not-independent, simpler instruction-set
- All cores in a "warp" execute the same instruction at the same time!
- Single Instruction Multiple Data (SIMD) parallelism
data:image/s3,"s3://crabby-images/d3748/d37486d0a89150b08782673634272e365b55926e" alt=""
Where GPUs excel
Computer Graphics
Deep Learning
data:image/s3,"s3://crabby-images/55a9b/55a9be4de21d75e9a4a35d5dee31a0ac7e9586ad" alt=""
High Performance Computing
Why are GPUs so good for Deep Learning?
-
Simple computations: Neural networks, at the end of the day, only compute very simple linear algebra operations.
- Batch processing: Training neural networks requires computing their loss function over a very large number of examples, which can be processed by "batches".
data:image/s3,"s3://crabby-images/94c90/94c90b4f88ae0c0f1834d53dd0c132be8b73ca3a" alt=""
Attention layer
cornerstone of ChatGPT
data:image/s3,"s3://crabby-images/5312d/5312db82801159a6fce75a403364b79ad8d4c68b" alt=""
Main technical takeaways
- Typically a GPU is a discrete accelerator optimized for parallel processing
- Used by the CPU to offload compute-intensive tasks
- Used by the CPU to offload compute-intensive tasks
- Each GPU has its own memory (distinct from the main system RAM)
- You cannot program a GPU in the same way as you program a CPU!
data:image/s3,"s3://crabby-images/2f6d7/2f6d76ef31a396513960e3a867589a2a3bac848b" alt=""
Illustration of one node of the Perlmutter Supercomputer (NERSC)
Why are we talking about GPUs today?
The road towards exascale!
NERSC 9 system: Perlmutter
- 1536 GPU nodes, each one with 4x NVIDIA A100 (40GB)
- High performance HPE/Cray Slingshot interconnect
- Ranks in top 10 most powerful systems in the world (oct. 2022: 93.75 PFlop/s)
data:image/s3,"s3://crabby-images/f583c/f583c2d8326028caaaca8e51b8ae3273a625ac1c" alt=""
data:image/s3,"s3://crabby-images/3e4f4/3e4f4a9352b8d0e3c5e9cc136a94186782e19b2c" alt=""
=> GPUs are becoming essential for HPC, and key to reaching exascale!
- More and more computing centers will have GPU nodes
data:image/s3,"s3://crabby-images/52669/5266948eddbf2610cb4c8133c4694db4dff355d8" alt=""
GPU Programming is Becoming Increasingly Easy!
- In 2012, Deep Learning started to boom, in part thanks to being able to run neural networks on GPUs
- Many frameworks have emerged to implement neural networks:
- Caffe (2012-2017)
- Theano (2007-2020)
- Caffe2 (2017-2018)
- CNTK (2016-2019)
- TensorFlow (2015-present)
- PyTorch (2016-present)
- JAX (2018-present)
data:image/s3,"s3://crabby-images/1ccb8/1ccb896a564520b3ab27aa49653be4df2f1d12fe" alt=""
data:image/s3,"s3://crabby-images/5c232/5c232626d276fce454d63b9df062462dd6ab886c" alt=""
=> Modern Deep Learning Frameworks look like normal Python but execute code on GPU!
- Can be used to perform all kinds of computing
Different ways to code on GPU
- If you are an HPC specialist and you care about performance (see Arnaud and Maxime's Dyablo talk)
- The artisanal way (back in my days!): coding GPU kernels in CUDA C/C++/Fortran
- The modern way: back-end agnostic higher level C++ frameworks like Kokkos
- If you are a computing-savvy astrophysicist, looking for performance, but want to code in Python as much as possible
- Numpy, Scipy -> Cupy
- Scikit-learn -> RAPIDS CuML
- If you are an astrophysicist who wants to use GPUs
but not after each millisecond and profit from
autodiff (see Denise's and Benjamin's talks)
data:image/s3,"s3://crabby-images/745ba/745bad9fcec0175e829cf2c787b3b970bad48707" alt=""
JAX: NumPy + Autograd + XLA
- JAX uses the NumPy API
=> You can copy/paste existing code, works pretty much out of the box
- JAX is a successor of autograd
=> You can transform any function to get forward or backward automatic derivatives (grad, jacobians, hessians, etc)
- JAX uses XLA as a backend
=> Same framework as used by TensorFlow (transparently supports CPU, GPU, TPU execution)
import jax.numpy as np
m = np.eye(10) # Some matrix
def my_func(x):
return m.dot(x).sum()
x = np.linspace(0,1,10)
y = my_func(x)
from jax import grad
df_dx = grad(my_func)
y = df_dx(x)
data:image/s3,"s3://crabby-images/b5306/b5306927c04df112dd667f245ed75a274e828248" alt=""
data:image/s3,"s3://crabby-images/5e7c8/5e7c80e829f9134f0bd2c4500128b521b2e68021" alt=""
MPI4JAX - Zero-copy MPI communication of JAX arrays
-
In a nutshell, provides a JAX wrapper around MPI primitives
-
Compiled against MPI4PY, rely on CUDA-aware MPI for GPUDirect RDMA
- Primitives can be included directly in jitted code!
-
Compiled against MPI4PY, rely on CUDA-aware MPI for GPUDirect RDMA
data:image/s3,"s3://crabby-images/4e9bd/4e9bde7f9019a6d7284a578f6009face5c8c6346" alt=""
from mpi4py import MPI
import jax
import jax.numpy as jnp
import mpi4jax
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
@jax.jit
def foo(arr):
arr = arr + rank
arr_sum, _ = mpi4jax.allreduce(arr, op=MPI.SUM, comm=comm)
return arr_sum
a = jnp.zeros((3, 3))
result = foo(a)
if rank == 0:
print(result)
This code is executed on all processes, each one has a single GPU
data:image/s3,"s3://crabby-images/eaff6/eaff63dc24c57a267c63a13bd4b1bbe4d4a7dcff" alt=""
mpirun -n 4 python myapp.py
Find out more on the MPI4JAX doc: https://mpi4jax.readthedocs.io/en/latest/shallow-water.html
- Example of a physical nonlinear shallow-water model distributed on 4 GPUs (Hafner & Vincentini, 2021)
- MPI is used to ensure proper boundary conditions between processes by performing a halo exchange
Examples of Applications
The near future: native SPMD in XLA
- JAX relies on the XLA (Accelerated Linear Algebra) library for compiling and executing jitted code.
- Around 2021-2022, support for low-level collective operations as been added to XLA, with NCCL as a backend on GPU clusters \o/
=> JAX is technically natively parallelisable through XLA communication primitives on machines like Jean Zay.
JAX API is still in flux, but it's going to be awesome
data:image/s3,"s3://crabby-images/e01f3/e01f301d70fe27e35641a1291cb5c8264cccc120" alt=""
data:image/s3,"s3://crabby-images/d55e7/d55e7306e7be5fa231c0a5117d7ae812300e70cc" alt=""
Example: jax-cosmo, GPU Accelerated Cosmology Library
- Looks like simple Python/NumPy code, precisely compute cosmological predictions for weak lensing and galaxy clustering.
- Because it is coded in JAX you can:
- Run transparently on CPU/GPU/TPU
- Run large "batches" of computations in parallel
- Access gradients of cosmological likelihoods (useful for advanced MCMC samplers)
data:image/s3,"s3://crabby-images/9f549/9f549aaed3cdb70013ff7ff5638ab617c52e7044" alt=""
DES Y1 3x2pt posterior, jax-cosmo HMC vs Cobaya MH
(credit: Joe Zuntz)
def log_posterior( theta ):
return gaussian_likelihood( theta ) + log_prior(theta)
score = jax.grad(log_posterior)(theta)
Example: FlowPM, distributed GPU accelerated cosmological N-body solver
- High Performance Particle-Mesh N-body solver written in TensorFlow. Validated against reference FastPM.
- Supports distribution over multiple GPUs, tested on up to 256 GPUs on Jean-Zay
- Direct GPU-to-GPU communications through NCCL
- Direct GPU-to-GPU communications through NCCL
- See Denise's talk for applications
data:image/s3,"s3://crabby-images/a4431/a443172b3bbe7e151aa74016df4becbe6c1001e4" alt=""
Example: Deep Learning MRI reconstruction
data:image/s3,"s3://crabby-images/35203/3520365e30bd541fa6a0b97dcf574ed2dd5ce29f" alt=""
- Implemented end-to-end in JAX
- Involves large neural network (UNet)
- Requires performing a large number of FFTs to model the data acquisition of the MRI instrument
A few other examples in the wild
-
WaveDiff: GPU accelerated Euclid PSF modeling code developed at CosmoStat
- TensorFlow
- TensorFlow
-
JAX-GalSim: GPU accelerated clone of the GalSim galaxy image simulation code
- JAX
- JAX
-
DSPS: GPU accelerated Differentiable Stellar Population Synthesis
- JAX
- JAX
-
pmwd: GPU accelerated particle mesh with derivatives N-body code
- JAX
Where can I get some GPUs?
data:image/s3,"s3://crabby-images/c1905/c1905766b14e04ff212c9960a861a87cc0369cb6" alt=""
Depending on your needs
- If you just want to try a few things out:
-
Google Colab: https://colab.research.google.com
- Free access to GPUs and TPUs (within some limits)
- Super easy to use, nothing to install!
- Really great to allow others to reproduce your results/demos
- (Check with your supervisor first if they are ok with you putting your research on Google platforms)
-
Google Colab: https://colab.research.google.com
- If you want a free, secure, robust GPU platform, from getting started to doing serious compute:
-
Jean Zay: http://www.idris.fr/eng/info/gestion/demandes-heures-eng.html
- Supercomputer with thousands of GPUs (state of the art ones!)
- Dynamic Access (AD) has a low barrier to entry
- Once everything is setup, very easy to use
- Highly recommended over buying your own GPU machine!!!
-
Jean Zay: http://www.idris.fr/eng/info/gestion/demandes-heures-eng.html
Takeaways
- I usually expect a speed up of a factor 100x between CPU and GPU code (for my typical applications).
- GPUs can be used to accelerate a wide range of problems, they are not only reserved to HPC and deep learning!
- These days getting started with GPUs is as easy as
import jax.numpy as np
-
Ask your doctor if GPU is right for you!
- Happy to chat about your use cases and orient you towards the best solution!
data:image/s3,"s3://crabby-images/0db59/0db599d1c35aba868798ba7aa778028841c3bebd" alt=""
Wait, what about my Mac?
data:image/s3,"s3://crabby-images/266dd/266dd014ea1469b674ea873dcf51db87c795cea0" alt=""
DAp GPU Day
By eiffl
DAp GPU Day
- 404