Presentation 1
What's are we doing here?
What do we mean by parallel numerical methods?
Numerically solving linear systems and friends?
Parallel Numerical Algorithms
Editors: Keyes, David E., Sameh, Ahmed, Venkatakrishnan, V. (Eds.)
Published 1997
This book is sorta old...
One dimensional boundary value problem
We want to find a function \(u(x)\) satisfying the above.
Model Problem
For \(0 < x < 1\), \(\sigma > 0\) with \(u(0)=u(1)=0\)
One dimensional boundary value problem
Grid:
\(h:=\frac{1}{N}, x_j = jh, j = 0,1, \cdots N\)
\(x_0, x_1, x_2\)
\(x_j\)
\(x_N\)
Let \(f_i = f(x_i)\) and \(v_i \approx u(x_i)\), we want to recover \(v_i\)
\(\cdots\)
\(\cdots\)
\(\cdots\)
Taylor Series Approximation
\(u(x_{i+1}) = u(x_i) + hu^\prime(x_i) + \frac{h^2}{2}hu^{\prime\prime}(x_i) + \frac{h^3}{6}u^{\prime\prime\prime}(x_i) + O(h^4)\)
\(u(x_{i-1}) = u(x_i) - hu^\prime(x_i) + \frac{h^2}{2}hu^{\prime\prime}(x_i) - \frac{h^3}{6}u^{\prime\prime\prime}(x_i) + O(h^4)\)
Add these two together, then solve for \(u^{\prime\prime}\). This gives:
\(u^{\prime\prime}(x_i) = \frac{u(x_{i+1})-2u(x_i) + u(x_{i-1})}{h^2} + O(h^2)\)
Linear System
We approximate the original BVP with the finite difference scheme
\(\frac{-v_{i-1}+2v_i - v_{i+1}}{h^2} + \sigma v_i = f_i\)
With
\(v_0 = v_N = 0\)
for \(i=1,2,\cdots, N-1\)
First we let:
Obtain an approximate solution by solving \(A\bm{v} = \bm{f}\) with:
We're concerned with (computationally) solving problems of the form
\(Ax=b\)
Most other problems can be restated or approximated in terms of solving a linear equation (or applying/inverting a linear operator)
Dense Matrix: Most elements in a matrix are non-zero
Sparse Matrix: Most elements in a matrix are zero
Direct Methods: Directly solving or inverting a problem
Iterative Methods: Applying an operation multiple times until convergence to a solution
Computation
How do we actually compute these things?
Basic Linear Algebra Subprograms (BLAS)
Level 1: Scaler-vector operations
Level2: Matrix-vector operations
Level 3: Matrix-matrix operations
Let's focus on solving equations
\(Ax=b\)
So let's compute \(A^{-1}\), then find compute \(x = A^{-1}b\)?
Why is this bad?
\(PA=LU\)
Remember, from Linear algebra?
For a given \(N\times N\) matrix \(A\), the LU factorization of \(A\) is
Where \(L\) is a lower triangular matrix, and \(U\) is an upper triangular matrix
Why is this better than directly solving for \(A^{-1}\)?
\(PA=LU\)
Solving \(Ax = b\)
Since \(U\) is upper triangular, we can compute \(U^{-1}b\) in \(O(n^2)\) operations. Likewise for \(L\).
Solving \(P^TLUx = b\)
\(A=LL^*\)
For a given \(N\times N\) Hermatian positive definite matrix \(A\), the cholesky factorization of \(A\) is
Where \(L\) is a lower triangular matrix, and \(L^*\) is the conjugate transpose of \(L\).
\(A=QR\)
For a given \(N\times N\) real matrix \(A\), the QR factorization of \(A\) is
Where \(Q\) is an orthogonal matrix, and \(R\) is an upper triangular matrix.
\(A=U\Sigma V^*\)
For a given \(M\times N\) real matrix \(A\), the SVD decomposition of \(A\) is
Where
\(A = Q\Lambda Q^{-1}\)
For a given \(N\times N\) matrix \(A\) with linearly independent eigenvectors, the spectral factorization of \(A\) is
Where \(Q\) is the matrix of eigen vectors, and \(\Lambda\) diagonal matrix of eigen values.
Cholesky
double *cholesky(double *A, int n) {
double *L = (double*)calloc(n * n, sizeof(double));
for (int i = 0; i < n; i++)
for (int j = 0; j < (i+1); j++) {
double s = 0;
for (int k = 0; k < j; k++)
s += L[i * n + k] * L[j * n + k];
L[i * n + j] = (i == j) ?
sqrt(A[i * n + i] - s) :
(1.0 / L[j * n + j] * (A[i * n + j] - s));
}
return L;
}
Serial Implementations
What type of architecture is your CPU? GPU?
Algorithms need to be adapted to fit the architecture of the machine they execute upon
Grainularity
Coarse Grain Parallelisim:
A program is split into large tasks. For example, block matrix decompositions.
Pros, cons?
What types of architectures?
Grainularity
Medium Grain Parallelisim:
A program is split into medium tasks. Higher throughput, but higher communication and synchronization cost.
Pros, cons?
What type of architectures?
Grainularity
Fine Grain Parallelisim:
A program is split into small tasks. Again, higher throughput, but higher communication and synchronization cost.
Pros, cons?
What type of architectures?
Grainularity
void main()
{
switch (Processor_ID)
{
case 1: Compute element 1; break;
case 2: Compute element 2; break;
case 3: Compute element 3; break;
.
.
.
.
case 100: Compute element 100;
break;
}
}
void main()
{
switch (Processor_ID)
{
case 1: Compute elements 1-4; break;
case 2: Compute elements 5-8; break;
case 3: Compute elements 9-12; break;
.
.
case 25: Compute elements 97-100;
break;
}
}
void main()
{
switch (Processor_ID)
{
case 1: Compute elements 1-50;
break;
case 2: Compute elements 51-100;
break;
}
}
Fine
Medium
Coarse
Levels of Parallelism
Instruction-level parallelism is a measure of how many of the instructions in a computer program can be executed simultaneously.
What types of architectures?
SIMD, Pipelining, OOO execution/register renaming, speculative execution...
Levels of Parallelism
Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures.
Types of architectures?
When can you do this?
Levels of Parallelism
Similarly, more coarse levels exist, such as subroutine and program level parallelism
Types of architectures?
Acheiving best performance
How do we optimize for performance (what type of granularity and level of parallelism)?
Cholesky
double *cholesky(double *A, int n) {
double *L = (double*)calloc(n * n, sizeof(double));
for (int i = 0; i < n; i++)
for (int j = 0; j < (i+1); j++) {
double s = 0;
for (int k = 0; k < j; k++)
s += L[i * n + k] * L[j * n + k];
L[i * n + j] = (i == j) ?
sqrt(A[i * n + i] - s) :
(1.0 / L[j * n + j] * (A[i * n + j] - s));
}
return L;
}