Associate Professor Mikael Mortensen
Department of Mathematics
Division of Mechanics
University of Oslo
PCCFD at KAUST 24/5 - 2017
Part I: Triply periodic domains/Fourier - parallel scaling
Part II: Automating the Spectral Galerkin method
Started out as a toy project for
M. Mortensen and Hans Petter Langtangen
3D + time and HUGE resolution
Serious number crunching - extreme computing
We knew Direct Numerical Simulations (DNS) of turbulence were performed with low-level Fortran or C
From scratch?
(Not just wrapping a Fortran solver)
Using only numpy/scipy/mpi4py?
Can we make this solver run on thousands of cores ?
With billions of unknowns?
As fast as C?
Triply periodic domain
Fourier space
Equations:
Navier Stokes in spectral (Fourier) space:
Pseudo-spectral treatment of nonlinearity
Some timestepping routine (e.g., 4th order Runge-Kutta)
Classical pseudo-spectral Navier-Stokes solver
The kind that has been used to break records since the day of the first supercomputer
Numpy is designed around structured arrays that can be reshaped, resized, sliced, transposed, etc.
Most array operations are performed in C and are very fast
(1) Structured grid
(2) 3D parallel FFT
Numpy wraps FFTpack
pyFFTW wraps FFTW
No MPI implementation
But MPI for Python (mpi4py, Lisandro Dalcin) wraps almost the entire MPI library and is as efficient as low-level C
A few computational details
A computational box (mesh) needs to be distributed between a (possibly) large number of processors
Mesh size:
Pencil
Slab
CPUs
CPUs
in a nutshell
real FFT in z, complex in y
No MPI required
Transpose and distribute
Then FFT in last direction
4 operations, four lines of Python code!
in a nutshell
rfft(u, axis=2)
Transform
fft(u, axis=0)
Transform
fft(u, axis=1)
7 operations
FFT z-dir
transpose
communicate
FFT y-dir
transpose
communicate
FFT x-dir
7 lines of code!
Enstrophy colored by pressure
which turns out to be very easy, and
100 lines of compact Python code is enough to implement
How about speed and parallel scaling?
At first it scaled poorly on my laptop
Then it scaled somewhat better on Norway's largest computer
But then Hans Petter contacted the Davids
Shaheen Blue Gene P supercomputer at KAUST Supercomputing Laboratory
Testing the 100 lines of pure Python code
C++(slab) 20-30 % faster than Python (slab)
Good strong scaling
mesh =
Actually, the numpy ufuncs used to compute the cross product is the main reason
Moving cross product to Cython or Numba
@jit(float[:,:,:,:](float[:,:,:,:], float[:,:,:,:], float[:,:,:,:]), nopython=True)
def cross(a, b, c):
for i in xrange(a.shape[1]):
for j in xrange(a.shape[2]):
for k in xrange(a.shape[3]):
a0 = a[0,i,j,k]
a1 = a[1,i,j,k]
a2 = a[2,i,j,k]
b0 = b[0,i,j,k]
b1 = b[1,i,j,k]
b2 = b[2,i,j,k]
c[0,i,j,k] = a1*b2 - a2*b1
c[1,i,j,k] = a2*b0 - a0*b2
c[2,i,j,k] = a0*b1 - a1*b0
return c
Either Cython or Numba will all do the trick. A few lines more
Numba and Cython perform approximately the same
and as fast as C++!
Approximately 5 times faster than original
More than matches C++ on the Blue Gene P!
mesh size = per task
65536 tasks!
mesh =
per task
Word of warning
2048 nodes, or 65536 tasks, take about 13 minutes to load Python and get started
Should have been using larger mesh
mpiFFT4py
mpi4py-fft
https://bitbucket.org/mpi4py/mpi4py-fft
With Lisandro Dalcin at KAUST
Completely generic implementation ~ 500 lines of Python code
SIAM CSE 2017
Automating the spectral Galerkin method
for tensor product domains
and non-periodic
basis functions
Prof. Jie Shen
Purdue University
meets chebfun
Solution
Expansion coefficients
(unknowns)
Test functions
Poisson example
1) Create a basis that satisfies boundary conditions, e.g.,
(Legendre polynomials)
2) Transform PDE to variational form
Trial function
Test function
Shen's modified basis leads to very sparse matrices
from fenics import *
mesh = UnitSquareMesh(8, 8)
V = FunctionSpace(mesh, 'CG', 1)
# Define variational problem
u = TrialFunction(V)
v = TestFunction(V)
f = Constant(-6.0)
a = inner(grad(u), grad(v))*dx
L = f*v*dx
Solve equations with code very similar to mathematics
However
How to automate?
What do we know about
automating the solution
of PDE'S?
Python package (~3k loc) for automating the spectral Galerkin method
for high performance computing on tensor product grids
A FEniCS like interface with a simple form language
Extremely fast, order optimal, direct solvers for Poisson, Helmholtz and Biharmonic problems (sparse diagonal arrays)
MPI - invisible to the user mpi4py-fft (Lisandro Dalcin)
Fourier - Chebyshev - Legendre
modified basis functions
Chebfun
Spectral collocation, MATLAB, structured, no MPI
Semtex
Spectral element, MPI, C++/Fortran, 2D unstructured (1 Fourier)
Quite low-level, nowhere near as easily configured as FEniCS for new types of equations.
Very good for Navier-Stokes
Nek5000 (Nekbox)
Spectral element, Fortran, MPI, 3D unstructured
Nektar++
Spectral hp-element, C++, 3D unstructured
Nothing really for fast development of new spectral solvers aimed at high performance computing on supercomputers
import numpy as np
from shenfun import TestFunction, TrialFunction, inner, div, grad, chebyshev
N = 20
V = chebyshev.ShenDirichletBasis(N, plan=True)
v = TestFunction(V)
u = TrialFunction(V)
# Assemble stiffness matrix
A = inner(v, div(grad(u)))
# Some consistent right hand side
f = Function(V, False)
f[:] = -2
# Assemble right hand side
f_hat = inner(v, f)
# Solve equations and transform spectral solution to real space
f_hat = A.solve(f_hat)
fj = V.backward(f_hat)
x = V.mesh(N)
assert np.allclose(fj, 1-x**2)
Very similar to FEniCS!
just with spectral accuracy
(1) Create basis
(2) Variational forms
(3) Solve system
Combinations of Fourier and Chebyshev/Legendre bases
Any Cartesian combination of spaces (ultimately)
With solutions of the form:
Unknown expansion coefficients
Combinations of Fourier and Chebyshev/Legendre bases
from shenfun import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (20, 30)
V0 = chebyshev.ShenDirichletBasis(N[0])
V1 = fourier.R2CBasis(N[1])
T = TensorProductSpace(comm, (V0, V1))
v = TestFunction(T)
u = TrialFunction(T)
A = inner(v, div(grad(u)))
# Or 3D
V2 = fourier.C2CBasis(40)
T3 = TensorProductSpace(comm, (V0, V2, V1))
v = TestFunction(T3)
u = TrialFunction(T3)
A = inner(v, div(grad(u)))
MPI under the hood
using mpi4py-fft
on tensor product grids like:
to make implementation very similar to mathematics
Mathamatical operators act on test, trial, or functions with data
V = TensorProductSpace(comm, (V0, V1, V2))
# Form arguments:
v = TestFunction(V)
u = TrialFunction(V)
u_ = Function(V, False)
# Forms
du = grad(u_)
d2u = div(du)
cu = curl(du)
dudx = Dx(u_, 0)
Inner products and projections
A = inner(v, u) # Bilinear form, assembles to matrices
f = inner(v, u_) # Linear form, assembles to vector
project(div(grad(u_)), V) # projection to V
(not in FEniCS)
u_ = V.backward(u_hat, u_)
Any spectral function is defined as
3D:
1D:
This is computed on the tensor product grid as a backwards transform,
e.g., a discrete Fast Fourier inverse transform
A non-periodic basis uses fast Chebyshev(Legendre) transforms
by L2-projections
u_hat = V.forward(u_, u_hat)
which is achieved using a forward transform
Implemented with FFTs or fast Chebyshev transforms
VectorTensorProductSpace
import numpy as np
from shenfun import *
N = 16
K0 = C2CBasis(N)
K1 = C2CBasis(N)
K2 = R2CBasis(N)
T = TensorProductSpace(comm, (K0, K1, K2))
# Scalar function u_
u_ = Function(T, False)
u_hat = Function(T)
u_[:] = np.random.random(u_.shape)
u_hat = T.forward(u_, u_hat)
# Now define vector space to allow computation of grad(u_)
TV = VectorTensorProductSpace([T]*3)
du_hat = project(grad(u_), TV, uh_hat=u_hat)
are all treated with pseudospectral techniques
(1) Compute inverse
u = T.backward(u_hat)
with dealiasing using
3/2-rule (padding) or 2/3-rule
(3) Transform back to spectral space
w = u*u
(2) Compute nonlinearity
w_hat = T.forward(w)
T = TensorProductSpace(comm, (V0, V1, V2))
example
2D periodic (x, y) \( \in [-50, 50]^2 \)
Use Fourier in both directions
Weak form, only spatial discretization:
example
2D periodic (x, y) \( \in [-50, 50]^2 \)
import sympy
from shenfun.fourier.bases import R2CBasis, C2CBasis
from shenfun import inner, grad, TestFunction, TrialFunction, \
Function, TensorProductSpace
# Use sympy to set up initial condition
x, y = sympy.symbols("x,y")
ue = (x + y)*sympy.exp(-0.03*(x**2+y**2))
N = (200, 200)
K0 = C2CBasis(N[0], domain=(-50., 50.))
K1 = C2CBasis(N[1], domain=(-50., 50.))
T = TensorProductSpace(comm, (K0, K1))
u = TrialFunction(T)
v = TestFunction(T)
U = Function(T, False)
def compute_rhs(rhs, u_hat, U, T):
rhs = inner(grad(v), -grad(U), output_array=rhs, uh_hat=u_hat)
rhs /= (2*np.pi)**2
rhs += u_hat
rhs -= T.forward((1+1.5j)*U*abs(U)**2, w0)
return rhs
# And, e.g., Runge-Kutta integrator
#
#
Key code:
Turbulent channel flow
3D tensor product space
2 Fourier bases +
Modified Chebyshev bases for wall normal direction
Kim and Moin modified version of Navier-Stokes:
is convection
Wall normal velocity
Wall normal vorticity
Turbulent channel flow
First fully spectral DNS with
3D Tensor product mesh:
Shaheen II at KAUST
http://github.com/spectralDNS
Before I go I just want to point you attention towards a conference organised in the memory of my brilliant colleague
International Conference on Computational Science and Engineering, Simula October 23-25, 2017