Conjugate Gradient Method

for solving Ax = b

PHPC project

Nicolas Stucki

August 20th 2020

Theoretical Analysis

Algorithm

Θ(1)

Θ(log n): Parallel reduction

Θ(m): In parallel over n

Complexity

Sequential

Memory

Θ(k·\frac{n^2}{p})

Θ(k·(\frac{n^2}{p} + \log_2{\frac{p}{n}}))

p \leq n

n < p \leq n^2

Parallel

Θ(k·n^2)

(2k + 2)·n^2 + (9·k − 3)·n + (k − 2)

FLOPs

Θ(n^2)

Total

Θ(\frac{n^2}{p})

Per parallel task

Θ(k·n)

Syncronization

p \leq n

Concurrency

Two operations
Two operations

Asynchronous memory copies

Θ(n)

Θ(1)

Speedup

Amdahl’s law: ✘
Gustafson’s law:

MPI

MPI implementation

Use p nodes
- Split matrix and vector into p
Initialization
- Matrix: MPI_Iscatter
- Vectors: MPI_Ibcast
Iterations:
- Sync: MPI_Alltoall
- Reduce: MPI_Allreduce
Rest done sequentially
- Also experimented with SIMD

MPI benckmakrs

MPI + SIMD

On one machine

MPI speedup

MPI performance

CUDA

Initialization
- Initialize GPU
- Copy all memory to GPU
Iterations
- Asynchronously start all operations
- Sync with GPU once per iteration
- 1 double copied from GPU to CPU
End
- Copy result to CPU

CUDA implementation

CUDA implementation

Tiled matrix multiplication
- Using CUDA blocks and grids
- Aligned atomic addition operations
Dot products
- Using standard parallel reduction
Scalars on single threaded kernel
Use 2 streams
- Async memory copies
- Kernel concurrency
  - Kernel atomicity and events

CUDA benchmarks

MPI + CUDA Budget

Matrix of size:
Sequential: 156 minutes
CUDA with GeForce GTX 1080 Ti:
- 8.7 seconds if would fit in memory
- 91 distributed GPUs
  - 107 ms with an efficiency of 0.9
  - 9.6 seconds with an efficiency of 0.1

n=10^6

End

Made with Slides.com