CUDA Debugging Tools
A case study with jax-finufft
Lehman Garrison
SCC Group Meeting
January 30, 2024
The Setup
- Want to do high-dimensional inference, sampling, and optimization of non-uniform FFT problems
- Period fitting for astronomical time series
- MRI reconstruction
- Any model-fitting in Fourier space to unevenly sampled data
- Gradients of the likelihood enable greatly accelerated inference (e.g. Hamiltonian Monte Carlo)
- JAX JIT-compiles the Python model to arch-specific binary using LLVM via XLA -- including gradients
- Might want to include custom compiled code (e.g. finufft) rather than relying on Python JIT
The Problem
- Sporadic test failures on the GPU
- One possibility: JAX provides a CUDA stream that all operations are supposed to use. Is something accidentally using the default stream?
- CUDA streams
- Logical queue of operations for the GPU to execute in order
- Streams may execute asynchronously with respect to other streams (and host code), but operations in a stream are strictly sequential
(venv-23) scclin021:~/jax-finufft$ python -m pytest -v tests/
=============================================== test session starts ===============================================
platform linux -- Python 3.10.12, pytest-7.4.2, pluggy-1.3.0 -- /mnt/home/lgarrison/jax-finufft/venv-23/bin/python
cachedir: .pytest_cache
rootdir: /mnt/home/lgarrison/jax-finufft
collected 54 items
tests/ops_test.py::test_nufft1_forward[1-False-50-75--1] SKIPPED (1D transforms not implemented on GPU) [ 1%]
tests/ops_test.py::test_nufft1_forward[1-False-50-75-1] SKIPPED (1D transforms not implemented on GPU) [ 3%]
tests/ops_test.py::test_nufft1_forward[1-True-50-75--1] SKIPPED (1D transforms not implemented on GPU) [ 5%]
tests/ops_test.py::test_nufft1_forward[1-True-50-75-1] SKIPPED (1D transforms not implemented on GPU) [ 7%]
tests/ops_test.py::test_nufft1_forward[2-False-50-75--1] FAILED [ 9%]
tests/ops_test.py::test_nufft1_forward[2-False-50-75-1] FAILED [ 11%]
tests/ops_test.py::test_nufft1_forward[2-True-50-75--1] FAILED [ 12%]
tests/ops_test.py::test_nufft1_forward[2-True-50-75-1] FAILED [ 14%]
tests/ops_test.py::test_nufft1_forward[3-False-50-75--1] PASSED [ 16%]
tests/ops_test.py::test_nufft1_forward[3-False-50-75-1] PASSED [ 18%]
tests/ops_test.py::test_nufft1_forward[3-True-50-75--1] FAILED [ 20%]
tests/ops_test.py::test_nufft1_forward[3-True-50-75-1] FAILED [ 22%]
tests/ops_test.py::test_nufft2_forward[1-False-50-75--1] SKIPPED (1D transforms not implemented on GPU) [ 24%]
tests/ops_test.py::test_nufft2_forward[1-False-50-75-1] SKIPPED (1D transforms not implemented on GPU) [ 25%]
tests/ops_test.py::test_nufft2_forward[1-True-50-75--1] SKIPPED (1D transforms not implemented on GPU) [ 27%]
tests/ops_test.py::test_nufft2_forward[1-True-50-75-1] SKIPPED (1D transforms not implemented on GPU) [ 29%]
tests/ops_test.py::test_nufft2_forward[2-False-50-75--1] PASSED [ 31%]
tests/ops_test.py::test_nufft2_forward[2-False-50-75-1] PASSED [ 33%]
tests/ops_test.py::test_nufft2_forward[2-True-50-75--1] FAILED [ 35%]
tests/ops_test.py::test_nufft2_forward[2-True-50-75-1] FAILED [ 37%]
tests/ops_test.py::test_nufft2_forward[3-False-50-75--1] PASSED [ 38%]
tests/ops_test.py::test_nufft2_forward[3-False-50-75-1] PASSED [ 40%]
tests/ops_test.py::test_nufft2_forward[3-True-50-75--1] FAILED [ 42%]
tests/ops_test.py::test_nufft2_forward[3-True-50-75-1] FAILED [ 44%]
tests/ops_test.py::test_nufft1_grad[1-50-35--1] SKIPPED (1D transforms not implemented on GPU) [ 46%]
tests/ops_test.py::test_nufft1_grad[1-50-35-1] SKIPPED (1D transforms not implemented on GPU) [ 48%]
tests/ops_test.py::test_nufft1_grad[2-50-35--1] FAILED [ 50%]
tests/ops_test.py::test_nufft1_grad[2-50-35-1] FAILED [ 51%]
tests/ops_test.py::test_nufft1_grad[3-50-35--1] FAILED [ 53%]
tests/ops_test.py::test_nufft1_grad[3-50-35-1] FAILED [ 55%]
NVIDIA Nsight Systems
- We can systematically examine which operations are using which CUDA streams using NVIDIA Nsight Systems
- Nsight Systems is the successor to the NVIDIA Visual Profiler (nvvp)
- Collect information in CLI, analyze in GUI:
-
nsys profile python script.py
-
nsys-ui report1.nsys-rep
-
-
Works out of the box on FI Linux workstations
- https://docs.nvidia.com/nsight-systems/UserGuide/index.html
Fixing the CUDA Stream Usage
- cuFFT lets us set the stream a certain FFT plan should be executed on
- But it has to be set after plan creation, not before!
- Contrast with FFTW, where planning itself involves FFT execution
cufftHandle fftplan; cufftResult_t cufft_status; - cufftSetStream(fftplan, stream); cufft_status = cufftPlanMany(& fftplan, 1, n, inembed, 1, inembed[0], inembed, 1, inembed[0], cufft_type<T>(), maxbatchsize ); + cufftSetStream(fftplan, stream);
Debugging CUDA Illegal Memory Access
- Next test failure: a CUDA illegal memory access
-
cudaErrorIllegalAddress: an illegal memory access was encountered
- Looks like reading off the end of a device array
(venv-23) scclin021:~/jax-finufft$ python -m pytest -v tests/ops_test.py::test_issue37 ============================================= test session starts ============================================= platform linux -- Python 3.10.12, pytest-7.4.2, pluggy-1.3.0 -- /mnt/home/lgarrison/jax-finufft/venv-23/bin/python cachedir: .pytest_cache rootdir: /mnt/home/lgarrison/jax-finufft collected 1 item tests/ops_test.py::test_issue37 FAILED [100%] ================================================== FAILURES =================================================== ________________________________________________ test_issue37 _________________________________________________ jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exc eption. Set JAX_TRACEBACK_FILTERING=off to include these. The above exception was the direct cause of the following exception: def test_issue37():
@partial(jax.vmap, in_axes=(0, 0, None)) def cconv_test(f, xs, kernel): # f.shape = (n_grid, in_features) # x.shape = (n_grid, ndim) # kernel.shape = (*k_grid, in_features, out_features) ndim = xs.shape[-1] k_grid_shape = kernel.shape[:-2] f_ = f.astype( {np.float32: np.complex64, np.float64: np.complex128}[f.dtype.type] ).transpose() coords = [xs[..., i] for i in range(ndim)] f_hat = nufft1(k_grid_shape, f_, *coords, iflag=-1) c_hat = jnp.einsum("a...,...ab->b...", f_hat, kernel) return nufft2(c_hat, *coords, iflag=1) kernel = jnp.array(np.random.randn(32, 32, 32, 16, 16)) f = jnp.array(np.random.randn(8, 100, 16)) x = jnp.array(np.random.uniform(low=-np.pi, high=np.pi, size=(8, 100, 3))) a = cconv_test(f, x, kernel) > b = jax.jit(cconv_test)(f, x, kernel) E jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.custom_call' failed: nufft3d1f XLA extension have thrown an exception: inclusive_scan failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered; current tracing scope: custom-call.2; current profiling annotation: XlaModule:#hlo_module=jit_cconv_test,program_id=12#.
cuda compute sanitizer
- CUDA Compute Sanitizer is a suite of tools for runtime correctness checking
- memcheck, racecheck, initcheck, synccheck, etc
- Although the name invokes modern compiler sanitizers, behaves much more like Valgrind
- Don't need to recompile or instrument the executable
- Identifies the jax-finufft problem as a read off the end of an array
- Compute Sanitizer docs
(venv-23) scclin021:~/jax-finufft/scripts$ compute-sanitizer python jittest.py ========= COMPUTE-SANITIZER ========= Invalid __global__ read of size 4 bytes ========= at 0x31f0 in void cufinufft::spreadinterp::calc_inverse_of_global_sort_index_3d<float>(int, int, int, int, int, int, int, const int *, const int *, const T1 *, const T1 *, const T1 *, int *, int, int, int, int) ========= by thread (58,0,0) in block (0,0,0) ========= Address 0xed2002048 is out of bounds ========= and is 69 bytes after the nearest allocation at 0xed2001800 of size 2,052 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x3304e0] ========= in /lib64/libcuda.so.1 ========= Host Frame: [0x16987e] ========= in /mnt/home/lgarrison/jax-finufft/venv-23/lib/python3.10/site-packages/jax_finufft/jax_finufft_gpu.cpython-310-x86_64-linux-gnu.so ========= Host Frame: [0x1c9a8e] ========= in /mnt/home/lgarrison/jax-finufft/venv-23/lib/python3.10/site-packages/jax_finufft/jax_finufft_gpu.cpython-310-x86_64-linux-gnu.so ========= Host Frame:void cufinufft::spreadinterp::calc_inverse_of_global_sort_index_3d<float>(int, int, int, int, int, int, int, int const*, int const*, float const*, float const*, float const*, int*, int, int, int, int) [0x14dcc8] ========= in /mnt/home/lgarrison/jax-finufft/venv-23/lib/python3.10/site-packages/jax_finufft/jax_finufft_gpu.cpython-310-x86_64-linux-gnu.so ========= Host Frame:int cufinufft::spreadinterp::cuspread3d_subprob_prop<float>(int, int, int, int, cufinufft_plan_t<float>*) [0x150428] ========= in /mnt/home/lgarrison/jax-finufft/venv-23/lib/python3.10/site-packages/jax_finufft/jax_finufft_gpu.cpython-310-x86_64-linux-gnu.so ========= Host Frame:cufinufftf_setpts [0x15e2c7] ========= in /mnt/home/lgarrison/jax-finufft/venv-23/lib/python3.10/site-packages/jax_finufft/jax_finufft_gpu.cpython-310-x86_64-linux-gnu.so
cuda-gdb
- For interactive debugging, can use cuda-gdb
- Extends GDB to support device-side debugging
- Single-stepping execution
- Inspecting and modifying variables
- Jumping between host and device contexts
- Breakpoints and conditions
- Device hardware watchpoints *not* supported
- Can be slow to start sometimes
(venv-23) scclin021:~/jax-finufft/scripts$ cuda-gdb --args python jittest.py NVIDIA (R) CUDA Debugger CUDA Toolkit 12.2 release Portions Copyright (C) 2007-2023 NVIDIA Corporation GNU gdb (GDB) 12.1 [...] (cuda-gdb) start [...] CUDA Exception: Warp Illegal Address The exception was triggered at PC 0x7fff27d14a20 (spreadinterp3d.cuh:69) Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address. [Switching focus to CUDA kernel 51, grid 860, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 2, lane 0] 0x00007fff27d14ae0 in cufinufft::spreadinterp::calc_inverse_of_global_sort_index_3d<<<(1,1,1),(512,1,1)>>> (M=100, b in_size_x=16, bin_size_y=16, bin_size_z=2, nbinx=4, nbiny=4, nbinz=32, bin_startpts=0x302001000, sortidx=0x303002a00, x=0x7 ffaaa401980, y=0x7ffaaa402600, z=0x7ffaaa403280, index=0x303002800, pirange=1, nf1=64, nf2=64, nf3=64) at /mnt/home/lgarrison/jax-finufft/build/cp310-cp310-manylinux_2_28_x86_64/../../vendor/finufft/src/cuda/3d/spreadinterp3d.cuh:71 71 } (cuda-gdb)
Live cuda-gdb Demo
template <typename T> __global__ void calc_inverse_of_global_sort_index_3d( int M, int bin_size_x, int bin_size_y, int bin_size_z, int nbinx, int nbiny, int nbinz, const int *bin_startpts, const int *sortidx, const T *x, const T *y, const T *z, int *index, int pirange, int nf1, int nf2, int nf3 ) { int binx, biny, binz; int binidx; T x_rescaled, y_rescaled, z_rescaled; for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < M; i += gridDim.x * blockDim.x) { x_rescaled = RESCALE(x[i], nf1, pirange); y_rescaled = RESCALE(y[i], nf2, pirange); z_rescaled = RESCALE(z[i], nf3, pirange); binx = floor(x_rescaled / bin_size_x); binx = binx >= nbinx ? binx - 1 : binx; binx = binx < 0 ? 0 : binx; biny = floor(y_rescaled / bin_size_y); biny = biny >= nbiny ? biny - 1 : biny; biny = biny < 0 ? 0 : biny; binz = floor(z_rescaled / bin_size_z); binz = binz >= nbinz ? binz - 1 : binz; binz = binz < 0 ? 0 : binz; binidx = common::calc_global_index_v2(binx, biny, binz, nbinx, nbiny, nbinz); index[bin_startpts[binidx] + sortidx[i]] = i; } }
(cuda-gdb) info args M = 100 bin_size_x = 16 bin_size_y = 16 bin_size_z = 2 nbinx = 4 nbiny = 4 nbinz = 32 bin_startpts = 0x302001000 sortidx = 0x303002a00 x = 0x7ff840401980 y = 0x7ff840402600 z = 0x7ff840403280 index = 0x303002800 pirange = 1 nf1 = 64 nf2 = 64 nf3 = 64 (cuda-gdb) info locals x_rescaled = 59.3145142 z_rescaled = 270.639923 biny = 24 binidx = 2243 y_rescaled = 415.313934 binx = 3 binz = 134 (cuda-gdb) p bin_startpts[binidx] $1 = -499364480
Fixing CUDA Illegal Memory Access
- Problem turned out to be a host code error in how the device arrays were indexed for batched transforms
diff --git a/lib/kernels.cc.cu b/lib/kernels.cc.cu index 38ee6fc..ef4fcf5 100644 --- a/lib/kernels.cc.cu +++ b/lib/kernels.cc.cu @@ -22,11 +22,10 @@ void run_nufft(int type, const NufftDescriptor *descriptor, T *x, T *y, T *z, makeplan(type, ndim, descriptor->n_k, descriptor->iflag, descriptor->n_transf, descriptor->eps, &plan, opts); for (int64_t index = 0; index < descriptor->n_tot; ++index) { + int64_t i = index * descriptor->n_j; int64_t j = index * descriptor->n_j * descriptor->n_transf; int64_t k = index * n_k * descriptor->n_transf; - setpts(plan, descriptor->n_j, &(x[j]), y_index(y, j), z_index(z, j), 0, + setpts(plan, descriptor->n_j, &(x[i]), y_index(y, i), z_index(z, i), 0, NULL, NULL, NULL);
Summary
- CUDA debugging experience is pretty good these days
- CUDA versions of debugger and sanitizer, two linchpin tools for software development, are useful and mature
- NVIDIA Systems useful for a bird's-eye view of streams, threads, processes, and communication
- Was able to find and fix several CUDA bugs in finufft and jax-finufft with these tools
GPU tests passing on Jenkins!
CUDA Debugging Tools
By Lehman Garrison
CUDA Debugging Tools
- 95