Associate Professor Mikael Mortensen
Department of Mathematics, Division of Mechanics, University of Oslo
I3MS-seminar RWTH Aachen University 6/11 - 2017
Python for high performance scientific computing
Shenfun
A software for automating the spectral Galerkin method
Github pushes by language (2017):
Python is often compared to Matlab, because a lot of the coding done in either is similar
Matlab | Python/Numpy | Operation |
---|---|---|
find(a>0.5) | nonzero(a>0.5) | find the indices where (a > 0.5) |
zeros((3,4,5)) | zeros(3,4,5) | Create array of shape (3,4,5) |
eye(3) | eye(3) | 3x3 identity matrix |
Some similarities and differences:
If you want to get started with scientific computing in Python
Comes pre-loaded with (almost) everything needed to do scientific computing
More than 1,500,000 downloads
https://anaconda.org
Using Fourier transformed Navier Stokes equations
Direct Numerical Simulations (DNS) of turbulence are traditionally performed with low-level Fortran or C
From scratch?
(Not just wrapping a Fortran solver)
Using only standard Python modules numpy/scipy/mpi4py?
Can this solver be made to run on thousands of cores ?
With billions of unknowns?
As fast as C/fortran?
Triply periodic domain
Fourier space
Equations:
Navier Stokes in spectral (Fourier) space:
Pseudo-spectral treatment of nonlinearity
Some timestepping routine (e.g., 4th order Runge-Kutta)
Classical pseudo-spectral Navier-Stokes solver
The kind that has been used to break records since the day of the first supercomputer
Numpy is designed around structured arrays that can be reshaped, resized, sliced, transposed, etc.
Most array operations are performed in C and are very fast
(1) Structured grid
(2) 3D parallel FFT
Numpy wraps FFTpack
pyFFTW wraps FFTW
MPI ?
But MPI for Python ( mpi4py, Lisandro Dalcin) wraps almost the entire MPI library and is as efficient as low-level C
A few computational details
A computational box (mesh) needs to be distributed between a large (possibly thousands) number of processors
Mesh size:
Pencil
Slab
CPUs
CPUs
in a nutshell
real FFT in z, complex in y
No MPI required
Transpose and distribute
Then FFT in last direction
4 operations, four lines of Python code!
in a nutshell
rfft(u, axis=2)
Transform
fft(u, axis=0)
Transform
fft(u, axis=1)
7 operations
FFT z-dir
transpose
communicate
FFT y-dir
transpose
communicate
FFT x-dir
7 lines of code!
Enstrophy colored by pressure
and
100 lines of compact Python code is enough to implement
How about speed and parallel scaling?
Can we compete with low-level C/C++ or Fortran on supercomputers?
Shaheen Blue Gene P supercomputer at KAUST Supercomputing Laboratory
Testing the 100 lines of pure Python code in parallel
C++(slab) 20-30 % faster than Python (slab)
Good strong scaling
mesh =
Actually, the numpy ufuncs used to compute the cross product is the main reason
Luckily Python has several options for optimizing slow code
Moving cross product to Cython or Numba
def cross(a, b, c):
c[0] = a[1]*b[2]-a[2]*b[1]
c[1] = a[2]*b[0]-a[0]*b[2]
c[2] = a[0]*b[1]-a[1]*b[0]
return c
u = np.ones((3, N, N, N))
w = np.ones((3, N, N, N))
v = np.zeros((3, N, N, N))
v = cross(u, w, v)
Numpy has to create intermediate work arrays on the fly
@jit(float[:,:,:,:](float[:,:,:,:], float[:,:,:,:], float[:,:,:,:]), nopython=True)
def cross(a, b, c):
for i in xrange(a.shape[1]):
for j in xrange(a.shape[2]):
for k in xrange(a.shape[3]):
a0 = a[0,i,j,k]
a1 = a[1,i,j,k]
a2 = a[2,i,j,k]
b0 = b[0,i,j,k]
b1 = b[1,i,j,k]
b2 = b[2,i,j,k]
c[0,i,j,k] = a1*b2 - a2*b1
c[1,i,j,k] = a2*b0 - a0*b2
c[2,i,j,k] = a0*b1 - a1*b0
return c
Either Cython or Numba will do the trick. A few lines more
Numba and Cython perform approximately the same
and as fast as C++!
Approximately 5 times faster than original
More than matches C++ on the Blue Gene P!
mesh size = per task
65536 tasks!
mesh =
per task
Word of warning
2048 nodes, or 65536 tasks, take about 13 minutes to load Python and get started
Python can be used for high performance supercomputing and it can be used for DNS with Fourier
mpi4py-fft
https://bitbucket.org/mpi4py/mpi4py-fft
With Lisandro Dalcin at KAUST
~ 500 lines of Python code
With the newly released
we can do any transform from FFTW in parallel
Highly configurable and fast
Automating the spectral Galerkin method
on non-periodic domains, with different types
of boundary conditions (Dirichlet/Neumann)
Not just periodic Fourier
With a highly configurable library for any fast transforms
we can now implement the spectral Galerkin method
Galerkin meets orthogonal
global spectral basis functions, e.g.
Solution
Expansion coefficients
(unknowns)
Global basis functions
We use it to solve PDEs using the method of weighted residuals
Weighted inner products:
Choose global basis (test) functions satisfying the correct boundary conditions. Usually modified Chebyshev or Legendre polynomials, or Fourier exponentials
Transform PDEs to weak variational forms using the method of weighted residuals
Assemble and solve linear system of equations
Poisson example
1) Create a basis that satisfies boundary conditions, e.g.,
(Legendre polynomial order k+2)
2) Transform PDE to variational form using test function
Trial function
Test function
With quadrature notation
Insert to get linear algebra problem:
With
from fenics import *
mesh = UnitSquareMesh(8, 8)
V = FunctionSpace(mesh, 'CG', 1)
# Define variational problem
u = TrialFunction(V)
v = TestFunction(V)
f = Constant(-6.0)
a = inner(grad(u), grad(v))*dx
L = f*v*dx
Finite accuracy (not spectral)
How?
https://github.com/spectralDNS/shenfun
Python package (~3k loc) for automating the spectral Galerkin
method for high performance computing on tensor product grids
import numpy as np
from shenfun import TestFunction, TrialFunction, inner, div, grad, chebyshev
N = 20
V = chebyshev.ShenDirichletBasis(N, plan=True)
v = TestFunction(V)
u = TrialFunction(V)
# Assemble stiffness matrix
A = inner(v, div(grad(u)))
# Some consistent right hand side
f = Function(V, False)
f[:] = -2
# Assemble right hand side
f_hat = inner(v, f)
# Solve equations and transform spectral solution to real space
f_hat = A.solve(f_hat)
fj = V.backward(f_hat)
x = V.mesh(N)
assert np.allclose(fj, 1-x**2)
Very similar to FEniCS!
just with spectral accuracy
(1) Create basis
(2) Variational forms
(3) Solve system
Cartesian products of Fourier and Chebyshev/Legendre bases
Any Cartesian combination of spaces (ultimately)
With solutions of the form:
Unknown expansion coefficients
Implementation
from shenfun import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (20, 30)
# Create two bases
V0 = chebyshev.ShenDirichletBasis(N[0])
V1 = fourier.R2CBasis(N[1])
# Create tensor product space
T = TensorProductSpace(comm, (V0, V1))
v = TestFunction(T)
u = TrialFunction(T)
A = inner(v, div(grad(u)))
# Or 3D
V2 = fourier.C2CBasis(40)
T3 = TensorProductSpace(comm, (V0, V2, V1))
v = TestFunction(T3)
u = TrialFunction(T3)
A = inner(v, div(grad(u)))
MPI under the hood
using mpi4py-fft
on tensor product grids like:
Slab
Pencil
influenced heavily by UFL/FEniCS
Mathematical operators act on test, trial, or functions with data
V = TensorProductSpace(comm, (V0, V1, V2))
# Form arguments:
v = TestFunction(V)
u = TrialFunction(V)
u_ = Function(V, False)
# Forms
du = grad(u_)
d2u = div(du)
cu = curl(du)
dudx = Dx(u_, 0)
Inner products and projections
A = inner(v, u) # Bilinear form, assembles to matrices
f = inner(v, u_) # Linear form, assembles to vector
project(div(grad(u_)), V) # projection to V
u_hat = V.forward(u, u_hat) u = V.backward(u_hat, u)
Move back and forth between and using fast transforms
Periodic basis using fast Fourier transforms
Non-periodic basis using fast Chebyshev transforms (not for Legendre)
all treated with pseudo-spectral techniques
(1) Compute inverse
u = T.backward(u_hat)
with dealiasing using
3/2-rule (padding) or 2/3-rule
(3) Transform back to spectral space
w = u*u
(2) Compute nonlinearity
w_hat = T.forward(w)
T = TensorProductSpace(comm, (V0, V1, V2))
Fourier: A is diagonal
Modified Legendre: A is diagonal
Modified Chebyshev: A sparse - solved as tridiagonal matrix
2 dimensions
Matrix form:
Sparse matrices A (stiffness) and B (mass) are 1D matrices of basis (1/2)
Very fast direct solvers implemented in parallel using matrix decomposition method
Linear algebra:
in more detail
from shenfun import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (64, 64)
K1 = R2CBasis(N[0])
SD = chebyshev.bases.ShenDirichletBasis(N[1])
T = TensorProductSpace(comm, (K1, SD))
u = TrialFunction(T)
v = TestFunction(T)
Assemble matrix
Trial function
Test function
Combination of Chebyshev polynomials (or Legendre) that satisfies boundary conditions
Jie Shen's modified basis satisfies bcs and leads to very sparse matrices
Assemble matrices
1D stiffness
1D mass
Matrix form
Assemble matrices in shenfun
from shenfun import *
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (64, 64)
K1 = R2CBasis(N[0])
SD = chebyshev.bases.ShenDirichletBasis(N[1])
T = TensorProductSpace(comm, (K1, SD))
u = TrialFunction(T)
v = TestFunction(T)
mat = inner(div(grad(u)), v)
Creates mat - a dictionary holding two sparse matrices: mass and stiffness. Fourier contribution are scales to matrices
example
2D periodic (x, y) \( \in [-50, 50]^2 \)
Use Fourier in both directions
Weak form, only spatial discretization:
example
2D periodic (x, y) \( \in [-50, 50]^2 \)
import sympy
from shenfun.fourier.bases import R2CBasis, C2CBasis
from shenfun import inner, grad, TestFunction, TrialFunction, \
Function, TensorProductSpace
# Use sympy to set up initial condition
x, y = sympy.symbols("x,y")
ue = (x + y)*sympy.exp(-0.03*(x**2+y**2))
N = (200, 200)
K0 = C2CBasis(N[0], domain=(-50., 50.))
K1 = C2CBasis(N[1], domain=(-50., 50.))
T = TensorProductSpace(comm, (K0, K1))
u = TrialFunction(T)
v = TestFunction(T)
U = Function(T, False)
def compute_rhs(rhs, u_hat, U, T):
rhs = inner(grad(v), -grad(U), output_array=rhs, uh_hat=u_hat)
rhs /= (2*np.pi)**2
rhs += u_hat
rhs -= T.forward((1+1.5j)*U*abs(U)**2, w0)
return rhs
# And, e.g., Runge-Kutta integrator
#
#
Key code:
Turbulent channel flow
3D tensor product space
2 Fourier bases +
Modified Chebyshev bases for wall normal direction
Kim and Moin modified version of Navier-Stokes:
is convection
Wall normal velocity
Wall normal vorticity
Turbulent channel flow
First fully spectral DNS with
3D Tensor product mesh:
Shaheen II at KAUST
http://github.com/spectralDNS/shenfun
conda install -c conda-forge -c spectralDNS shenfun