Parallel Numerical Methods

Iterative Methods, Conjugate Gradient, and Preconditioning

Roadmap

Iterative Methods
Steepest Descent and Conjugate Gradient
GMRES
Multigrid

Let's retread some ground to get a better feel for some of the algorithms we've investigated so far

Sparse Matrices

Saving space and time

An \(n \times m\) matrix uses \(O(nm)\) storage
Matrix vector multiplication costs \(O(nm)\) for \(n \times m\) matrices
Many of problems correspond to matrices with many zero entries (or are banded/structured in some way)
We consider these matrices to be sparse, generally we think of a matrix as sparse when the ratio of zeros to non-zeros is greater than \(\frac{1}{2}\)
Why store zeros?

Sparse Matrices

Saving space and time

Useful for matrices that are so large that they cannot fit in memory
How do you store these matrices (data structures)?

\begin{bmatrix} 1 & 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 6 & 0 \\ 0 & 0 & 0 & 7 & 1 \end{bmatrix}

Sparse Matrices

Saving space and time

Set of triplets

\begin{bmatrix} 1 & 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 6 & 0 \\ 0 & 0 & 0 & 7 & 1 \end{bmatrix}

i	j	value
1	1	1
2	1	2
2	2	1
3	3	3
4	3	4
4	4	6
4	5	7
5	5	1

Bad for iterating over rows/columns

Fast lookup, \(O(m)\) storage

Sparse Matrices

Saving space and time

Linked Representation

\begin{bmatrix} 1 & 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 6 & 0 \\ 0 & 0 & 0 & 7 & 1 \end{bmatrix}

Sparse Matrices

Saving space and time

Compressed Sparse Row

Stores three non-zero arrays

\begin{bmatrix} 1 & 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 6 & 0 \\ 0 & 0 & 0 & 7 & 1 \end{bmatrix}

A=[1 2 1 3 4 6 7 1]

IA=[0 2 3 5 6 8]

JA=[0 1 1 2 3 3 3 4]

A = non zero columns of the matrix

IA = number of nonzero elements on the (i-1)-th row in the original matrix

JA = index of the column for an element in A

Sparse Matrices

Saving space and time

Compressed Sparse Row

A=[1 2 1 3 4 6 7 1]

IA=[0 2 3 5 6 8]

JA=[0 1 1 2 3 3 3 4]

A = non zero columns of the matrix

IA = number of nonzero elements on the (i-1)-th row in the original matrix

JA = index of the column for an element in A

def sparse_matrix_mult(n, A, IA, JA, x):
    Ax = x.copy()*0
    a_idx = 0
    for i in range(n):
        num_elements = IA[i + 1] - IA[i] 
        for _ in range(num_elemnets):
            Ax[i] += x[JA[a_idx]] * A[a_idx]
            a_idx += 1
    return Ax

Sparse Matrices

Saving space and time

Related Friends

\begin{bmatrix} 1 & 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 6 & 0 \\ 0 & 0 & 0 & 7 & 1 \end{bmatrix}

List of keys
Compressed Sparse Columns
Variations on CSR

Sparse Matrices

Saving space and time

If a matrix has \(m\) non zero entries, then matrix vector multiplication is \(O(mn)\)
Direct methods on sparse matrices?
Factorizations can produce dense matrices, how do you store those?

Matrix Free Representations

An aside

What if you have no explicit matrix \(A\), but only a procedure that calculates \(Ax\) for a given \(x\) ?

Iterative Methods

Can massively outperform direct methods, especially when convergence is fast

Stationary

Krylov Subspace methods

Jacobi, Gauss seidel

Iterative methods for solving linear system \(Ax = b\) begin with initial guess for solution and successively improve it until solution is as accurate as desired
In theory, infinite number of iterations might be required to converge to exact solution
Iteration terminates when residual \(\|b − Ax\|\), or some other measure of error is as small as desired Iterative methods are especially useful when matrix A is sparse

Non-stationary

Power Iteration

Simplest method for computing one eigenvalue-eigenvector pair is power iteration, which in effect takes successively higher powers of matrix times initial starting vector

for k in range(iterations):
	y = A.dot(x)
	x = y/y.infty_norm()

Stationary

Iterate on vector

Why do these methods work? Also, when?

\(x^{new} = Tx^{old} + c\)

T never changes

Jacobi and Gauss Siedel

\(x^*_i = \frac{b_i - \sum_{i\ne j} a_{i,j}x_j}{a_{ii}}\)

Beginning with an initial guess, both methods compute the next iterate by solving for each component of \(x\) in terms of other components of \(x\)

If D, L, and U are diagonal, strict lower triangular, and strict upper triangular portions of A, then Jacobi method can be written

\(x^* = D^{-1}(b-(L+U)x)\)

\(x^*_i = \frac{b_i - \sum_{j<i} a_{i,j}x^*_j -\sum_{j>i} a_{i,j}x_j}{a_{ii}}\)

\(x^* = (L+D)^{-1}(b-Ux)\)

Jacobi and Gauss Siedel

Convergence

Take a random \(100 \times 100\) matrix and solve for a random vector

\(\log \|Ax-b\|\)

Iterations

Gauss Seidel

Jacobi

Pre-preconditioners

A first introduction, sorta

Change problem slightly so that it's easier to solve.

\(x^*_i =(1-\omega)x^*_i + \omega\left(\frac{b_i - \sum_{j<i} a_{i,j}x^*_j -\sum_{j>i} a_{i,j}x_j}{a_{ii}}\right)\)

Pre-preconditioners

A first introduction, sorta

Change problem slightly so that it's easier to solve.

Iterations

Gauss Seidel

Gauss Seidel SOR \(\omega=0.95\)

\(\log \|Ax-b\|\)

Steepest Descent

\(\displaystyle\argmin_x \frac{1}{2}x^TAx - x^Tb\)

Obtains minimum when \(Ax=b\)

Steepest descent \(x^*=x + \alpha\nabla(\frac{1}{2}x^TAx - x^Tb)\)

\(\alpha = r^Ts/(s^TAs)\)

\(r:= b - Ax\)

If A is \(n \times n\) symmetric positive definite matrix, we minimize the quadratic form

Optimal line search parameter

Steepest Descent

def steepest_descent(A, b, max_iters = 1000):
    x = np.array([0.0] * A.shape[0])
    for _ in xrange(max_iters):
        r = b-A.dot(x)
        s = r.dot(r)
        a = s/(r.dot(A.dot(r)))
        if s < 1e-10:
            break
        x = x+a*r
    print("%d iterations" % _ )
    return x, r

Steepest Descent

Example

A = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}

b = \begin{bmatrix} -1 \\ 5 \end{bmatrix}

Problem: Chooses the same vector more than once

Conjugate Gradient

Fixing steepest descent by choosing gradients in new, orthogonal, directions

Krylov Subspaces

\( K_n = \{b, Ab, A^2b, \cdots A^{n-1}b\}\)

General idea

Seek solutions in the space

That is, at step \(n\) look for a solution \(x_n\) such that

\(\displaystyle\min_{x_n \in K_n} \|b-Ax_n\|\)

Why find a solution in \(K_n\)?

Krylov Subspaces

General idea

\(0 = q(A)\)

\(0 = \displaystyle\sum_{i=0}^m a_iA^i\)

\(0 = a_0I + \displaystyle\sum_{i=1}^m a_iA^i\)

\(I = - \displaystyle\sum_{i=1}^m \frac{a_i}{a_0}A^i\)

\(A^{-1} = - \displaystyle\sum_{i=1}^m \frac{a_i}{a_0}A^{i-1}\)

\(A^{-1}b = - \displaystyle\sum_{i=1}^m \frac{a_i}{a_0}A^{i-1}b\)

\(x = A^{-1}b \in K_m\)

\(q(\cdot)\) is the minimal polynomial for A

\(m\) depends on the structure of the eigenvalues of \(A\)

Conjugate Gradient

Fixing SD by choosing gradients in new, unique directions

Each sucessive \(p_i\) is chosen to be orthogonal to the set \(\{p_0, \cdots, p_{i-1}\}\)

As the algorithm progresses, \(p_i\) and \(r_i\) span the same Krylov subspace.

\(x_k\)can be regarded as the projection of \(x\) on \(K_k\).

If \(m << n\), then we converge in few iterations. If we allow \(k\to n\) then we have a direct method.

Conjugate Gradient

def conjgrad(A, x, b):
    r = b - A.dot(x)
    p = r.copy()
    rsold = r.dot(r)
    e = []

    for i in range(len(b)):
        Ap = A.dot(p);
        alpha = rsold / (p.dot(Ap))
        x = x + alpha * p;
        r = r - alpha * Ap;
        rsnew = r.dot(r);
        p = r + (rsnew / rsold) * p;
        rsold = rsnew;
        e += [sqrt(rsnew)]
    return x, e

Conjugate Gradient

A = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}

b = \begin{bmatrix} -1 \\ 5 \end{bmatrix}

Convergence

Iterations

Steepest Desc

Conj Grad

\(\log \|Ax-b\|\)

Problems

What if A isn't positive definite? What if A isn't symmetric?

Generalised Minimal Residual

\( K_n = \{b, Ab, A^2b, \cdots A^{n-1}b\}\)

Again, for an \(m\times m\) matrix \(A\) seek solutions in the space

That is, at step \(n\) look for a solution \(x_n\) such that

\(\displaystyle\min_{x_n \in K_n} \|b-Ax_n\|\)

Iteratively build up an orthogonal basis for \(K_n\) (QR decomposition) which allows us to easily solve for a given \(x_n\)

Use Arnoldi iteration to build \(Q_n\) and \(H\) such that \(AQ_n = Q_{n+1}H\) (H is Hassenberg, i.e almost upper triangular)

Arnoldi Iteration

Start with an arbitrary vector \(q_1\) with norm 1.

For k = 2, 3, …

\(q_k = Aq_{k-1}\)
for j in 1 to k-1
- \(h_{j,k} = q_j^*q_k\)
- \(q_k = q_k-h_{j,k}q_j\)
\(h_{k,k-1} = \|q_k\|\)
\(q_k = q_k/h_{k,k-1}\)

Can be built up incrementally

Generalised Minimal Residual

Use Arnoldi iteration to build \(Q_n\) and \(H\) such that \(AQ_n = Q_{n+1}H\) (H is Hassenberg, i.e almost upper triangular)

\(\|Ax_n - b\| = \|Q_{n+1}^TAx_n - Q_{n+1}^Tb\| = \|Q_{n+1}^TAQ_ny_n - Q_{n+1}^Tb\| = \|H_ny_n - \beta e_1\|\)

with \(\beta = \|b-Ax\|\)

So the algorithm is as follows, on the \(n^{th}\) iteration

Calculate \(q_{n}\) with the Arnoldi method;
Find the \(y_n\) which minimizes \(\|H_ny_n - \beta e_1\|\)
Compute \(x_{n}=Q_{n}y_{n}\)
Repeat if the residual is not yet small enough.

Since \(x\in K_n\) and \(Q_n\) is a basis for \( K_n\)

Preconditioners

Make a given problem easier

Instead of solving \(Ax = b\), we solve a related problem \(AM^{-1}y = b\) for \(y\), then compute \(x = My\)

The matrices \(M\) and \(M^{-1}\) should be easily expressible
The matrix \(AM^{-1}\) should be better conditioned than \(A\)

Preconditioners

Incomplete LU factorization

Since LU of sparse matrix may be dense, seek a sparse approximation to LU factorization.

Start by noting all the non-zero elements of an \(n \times n\) sparse matrix \(A\).

\(G(A) = \{(i,j) | A_{ij} \ne 0\}\)

Then choose a sparsity pattern \(S\) such that

\(S\subset \{1, \cdots, n\}^2, \ \ G(A) \subset S, \ \ \ \{(i,i) | i\in\{1\cdots n\}\} \subset S\),

A decomposition of the form \(A=LU-R\) where the following hold

\(L\in \mathbb {R}^{n\times n}\) is a lower unitriangular matrix
\(U\in \mathbb {R} ^{n\times n}\) is an upper triangular matrix
\(\displaystyle L,U\) are zero outside of the sparsity pattern i.e. \(L_{ij}=U_{ij}=0\quad \forall \;(i,j)\notin S\)
\(R\in \mathbb {R} ^{n\times n}\) is zero within the sparsity pattern i.e.\(R_{ij}=0\quad \forall \;(i,j)\in S\)

Preconditioners

Incomplete Cholesky factorization

A sparse approximation of the Cholesky factorization.

Recall the Cholesky factorization of a positive definite matrix \(A\) is \(A\) = \(LL^*\) where \(L\) is a lower triangular matrix.

Incomplete Cholesky factorization is a sparse approximation to \(L\) lower triangular matrix \(M\) that is close to \(L\). The corresponding preconditioner is \(MM^*\).

Use the algorithm for finding the exact Cholesky decomposition, except that any entry is set to zero if the corresponding entry in A is also zero.

Preconditioners

Incomplete Cholesky factorization

A sparse approximation of the Cholesky factorization.

def ichol(A):
    n = A.shape[0]
    L = np.zeros(A.shape)
    for k in range(n):
        L[k,k] = sqrt(A[k,k])
        for i in range(k+1):
            if A[i,k] != 0:
                L[i,k] = A[i,k]/L[k,k]
    return L

Preconditioners

Symmetric SOR

If the original matrix can be split into diagonal, lower and upper triangular as \(A=D+L+L^T\) then the SSOR preconditioner matrix is defined as

\(M(\omega) = \frac{\omega}{2-\omega}\left(\frac{1}{\omega} D + L\right)D^{-1}\left(\frac{1}{\omega} D + L\right)^T\)

What makes a good preconditioner?

Multigrid

Many iterative methods have the smoothing property where oscillatory modes of the error are eliminated effectively, but smooth modes are damped slowly.

So we transform the problem such that we forcibly dampen lower frequency error modes. I.e we solve a similar problem on coarser grids, then use that as an estimate for our initial guess