Daniel Magee
Kyle E. Niemeyer
School of Mechanical, Industrial, & Manufacturing Engineering
Oregon State University
January 11 - AIAA SciTech 2017 - Grapevine, TX
Heterogeneous clusters are becoming more common
GPGPUs offer significant parallelism in a workstation
Important aspects:
Important aspects:
"The swept rule for breaking the latency barrier in time advancing PDEs", Journal of Computational Physics, 2015
t
x
How to effectively decompose the space-time domain for parallel processing?
t
x
t
x
Ghost Region
Latency | Bandwidth | |
---|---|---|
Analogy | Fixed cost | Variable cost |
Best Case | 700 ns | ~ .1 ns per double |
Node
0
Node
1
"The swept rule for breaking the latency barrier in time advancing PDEs", Journal of Computational Physics, 2015
Do as much work as possible on the data in the fastest memory before writing to the slowest memory.
t
x
Block
Streaming Multiprocessor (SM)
Warp
Thread
Physical
Abstract
32 threads
Global
Shared
Register
All threads
Block
Warp
Accessible to
Lifetime
Application
Kernel
Kernel
Slowest
Fastest
Advance n (node length) timesteps with 2 communications
To save every point requires:
If n=256 at double precision
t
x
t
x
Now a triangle with base 256 can be stored in two rows of 256.
This value is required for
this calculation
Extend stencil in space not time.
Flatter triangle with no intermediate spatial steps is able to be folded as in previous example.
SharedGPU:
Hybrid:
Register:
__global__ void classicKS(const REAL *ks_in, REAL *ks_out, bool finally)
{
int gid = blockDim.x * blockIdx.x + threadIdx.x; //Global Thread ID
int lastidx = ((blockDim.x*gridDim.x)-1);
int gidz[5];
#pragma unroll
//indices of previous values
for (int k=-2; k<3; k++) gidz[k+2] = (gid+k) & lastidx;
if (finally)
{
ks_out[gid] += finalStep(ks_in, gidz);
}
else
{
ks_out[gid] = predictorStep(ks_in, gidz);
}
}
Problem sizes 2048–1,048,576 by powers of 2
Block sizes 32–1024 by powers of 2
Double precision
Results are the best block size for each scenario.
GPU: Nvidia Tesla K40c
745 MHz | 15 SM
CPU: Intel Xeon 2630-E5
2.4 GHz | 8 cores | 16 threads
"The swept rule for breaking the latency barrier in time advancing PDEs", Journal of Computational Physics, 2015
MPI program from Alhubail and Wang:
code available at
github.com/Niemeyer-Research-Group/1DsweptCUDA