The spectralDNS project

Associate Professor Mikael Mortensen

Department of Mathematics

Division of Mechanics

University of Oslo

 

PCCFD at KAUST 24/5 - 2017

Part I: Triply periodic domains/Fourier - parallel scaling

 

Part II: Automating the Spectral Galerkin method

The spectralDNS project

Started out as a toy project for

M. Mortensen and Hans Petter Langtangen

 

3D + time and HUGE resolution

Serious number crunching - extreme computing

We knew Direct Numerical Simulations (DNS) of turbulence were performed with low-level Fortran or C

Would it be possible to write a competitive, state of the art turbulence solver in Python?

From scratch?

(Not just wrapping a Fortran solver)

Using only numpy/scipy/mpi4py?

Can we make this solver run on thousands of cores ?

With billions of unknowns?

As fast as C?

What does it take to build a spectral DNS solver?

Triply periodic domain

Fourier space

[0, 2 \pi]^3

Equations:

What does it take to build a spectral DNS solver?

Navier Stokes in spectral (Fourier) space:

Pseudo-spectral treatment of nonlinearity

Some timestepping routine (e.g., 4th order Runge-Kutta)

Classical pseudo-spectral Navier-Stokes solver

The kind that has been used to break records since the day of the first supercomputer

What does it really take to implement a spectral DNS solver?

Numpy is designed around structured arrays that can be reshaped, resized, sliced, transposed, etc. 

Most array operations are performed in C and are very fast

  (1)  Structured grid

  (2)  3D parallel FFT

Numpy wraps FFTpack

pyFFTW wraps FFTW

No MPI implementation

But MPI for Python (mpi4py, Lisandro Dalcin) wraps almost  the entire MPI library and is as efficient as low-level C

Implementation

A few computational details

A computational box (mesh) needs to be distributed between a (possibly) large number of processors 

N^3, N \sim 1000 - 10000

Mesh size:

Pencil 

Slab

\le N^2

CPUs

\le N

CPUs

3D FFT with MPI

in a nutshell

real FFT in z, complex in y

 

No MPI required

Transpose and distribute

Then FFT in last direction

  1. 2D FFTs in Y-Z planes (local)
  2. transpose operation
  3. distribute data (MPI Alltoall)
  4. FFT in X-direction

What does it really take?

4 operations, four lines of Python code!

Pencil decomposition

in a nutshell

rfft(u, axis=2)

Transform

fft(u, axis=0)

Transform

 

fft(u, axis=1)

7 operations

  1. FFT z-dir​

  2. transpose

  3. communicate

  4. FFT y-dir

  5. transpose

  6. communicate

  7. FFT x-dir

7 lines of code!

With parallel 3D FFT we may develop efficient DNS solvers 

Enstrophy colored by pressure

which turns out to be very easy, and

100 lines of compact Python code is enough to implement

Ok, so it was relatively easy to implement

How about speed and parallel scaling?

At first it scaled poorly on my laptop

Then it scaled somewhat better on Norway's largest computer

But then Hans Petter contacted the Davids

Parallell scaling

Shaheen Blue Gene P supercomputer at KAUST Supercomputing Laboratory

Testing the 100 lines of pure Python code

C++(slab) 20-30 % faster than Python (slab)

Good strong scaling

mesh = 

2048^3
512^3

Why slower than C++?

Actually, the numpy ufuncs used to compute the cross product is the main reason

 

Moving cross product to Cython or Numba

Hardcoding the cross product into one single loop

@jit(float[:,:,:,:](float[:,:,:,:], float[:,:,:,:], float[:,:,:,:]), nopython=True)
def cross(a, b, c):
    for i in xrange(a.shape[1]):
        for j in xrange(a.shape[2]):
            for k in xrange(a.shape[3]):
                a0 = a[0,i,j,k]
                a1 = a[1,i,j,k]
                a2 = a[2,i,j,k]
                b0 = b[0,i,j,k]
                b1 = b[1,i,j,k]
                b2 = b[2,i,j,k]
                c[0,i,j,k] = a1*b2 - a2*b1
                c[1,i,j,k] = a2*b0 - a0*b2
                c[2,i,j,k] = a0*b1 - a1*b0
    return c

Either Cython or Numba will all do the trick. A few lines more

Numba and Cython perform approximately the same

and as fast as C++!

Approximately 5 times faster than original

Solver optimized with Cython

More than matches C++ on the Blue Gene P!

And later on Shaheen II 

64^3

mesh size =         per task

65536 tasks!

mesh =    

per task

Word of warning

2048 nodes, or 65536 tasks, take about 13 minutes to load Python and get started

Should have been using larger mesh

0.5 \cdot 64^3

Publicly available packages for parallel FFT in

mpi4py-fft

https://bitbucket.org/mpi4py/mpi4py-fft

With Lisandro Dalcin at KAUST

Completely generic implementation ~ 500 lines of  Python code

Faster and more accurate pseudospectral DNS through adaptive high-order time integration

  • Higher-order RK pairs reliably outperform RK4
  • Step-size adaptivity can be a big win and is never disadvantageous
  • Method of choice: Bogacki-Shampine 5/4 (BS5) with adaptivity
  • David Ketcheson
  • Nathanael Schilling
  • Matteo Parsani
  • Mikael Mortensen

SIAM CSE 2017

Part II:

Evolving from just Fourier and Navier Stokes

 

 

Shenfun

Automating the spectral Galerkin method

for tensor product domains

and non-periodic

basis functions

Prof. Jie Shen

Purdue University

meets chebfun

The Spectral Galerkin method

  1. Choose global basis (test) functions satisfying boundary conditions. We use modified Chebyshev or Legendre polynomials (J. Shen), or Fourier exponentials
  2. Transform PDEs to weak variational forms using the method of weighted residuals
  3. Solve system of equations
u(x) = \displaystyle\sum_{k=0}^{N-1} \hat{u}_k \phi_k(x)

Solution

Expansion coefficients

(unknowns)

Test functions

Spectral Galerkin

Poisson example

\nabla^2 u(x) = f(x) \quad \text{for} \quad x \in [-1, 1]
u(\pm 1) = 0

1) Create a basis that satisfies boundary conditions, e.g.,

\phi_k(x) = L_k(x) - L_{k+2}(x) \quad V = \text{span}\{\phi_k\}_{k=0}^{N-3}
\displaystyle \int_{-1}^{1}v \nabla^2 u \, w dx = \int_{-1}^{1} vf w dx
\text{where } w(x) \text{ is a weight function and } v \text{ is a test function}

(Legendre polynomials)

2) Transform PDE to variational form

3) Solve linear system of equations

\text{Find } u \in V, \text{such that}
\left(v, \nabla^2 u \right)_w^N = \left(v, f \right)_w^N \quad \forall v \in V
A_{kl} \hat{u}_l = \tilde{f}_k
\hat{u}_k = A_{kl}^{-1} \tilde{f}_l
\text{Leading to a linear algebra problem:}
\text{With quadrature notation }
\left( v, u\right)_w^N \approx \displaystyle \int_{-1}^{1} vu w dx
u(x) = \displaystyle\sum_{k=0}^{N-1} \hat{u}_k \phi_k(x)
v = \phi_l

Trial function

Test function

\text{Diagonal (!)} A_{kl}

Shen's modified basis leads to very sparse matrices

3 steps - Sounds easy?

Let's automate!

from fenics import *

mesh = UnitSquareMesh(8, 8)
V = FunctionSpace(mesh, 'CG', 1)

# Define variational problem
u = TrialFunction(V)
v = TestFunction(V)
f = Constant(-6.0)
a = inner(grad(u), grad(v))*dx
L = f*v*dx

Solve equations with code very similar to mathematics

However

  • Finite accuracy (not spectral)
  • Not optimal for simple tensor product grids or DNS

How to automate?

What do we know about

automating the solution

of PDE'S?

Shenfun

Python package (~3k loc) for automating the spectral Galerkin method

for high performance computing on tensor product grids

A FEniCS like interface with a simple form language

Extremely fast, order optimal, direct solvers for Poisson, Helmholtz and Biharmonic problems (sparse diagonal arrays)

MPI - invisible to the user mpi4py-fft (Lisandro Dalcin)

Fourier - Chebyshev - Legendre

modified basis functions

What else is there?

Chebfun

Spectral collocation, MATLAB, structured, no MPI

Semtex 

Spectral element, MPI, C++/Fortran, 2D unstructured (1 Fourier)

Quite low-level, nowhere near as easily configured as FEniCS for new types of equations.

Very good for Navier-Stokes

Nek5000 (Nekbox)

Spectral element, Fortran, MPI, 3D unstructured

Nektar++

Spectral hp-element, C++, 3D unstructured

Nothing really for fast development of new spectral solvers aimed at high performance computing on supercomputers

Shenfun implementation of 1D Poisson problem

import numpy as np
from shenfun import TestFunction, TrialFunction, inner, div, grad, chebyshev

N = 20
V = chebyshev.ShenDirichletBasis(N, plan=True)
v = TestFunction(V)
u = TrialFunction(V)

# Assemble stiffness matrix
A = inner(v, div(grad(u)))

# Some consistent right hand side
f = Function(V, False)
f[:] = -2

# Assemble right hand side
f_hat = inner(v, f)

# Solve equations and transform spectral solution to real space
f_hat = A.solve(f_hat)
fj = V.backward(f_hat)

x = V.mesh(N)
assert np.allclose(fj, 1-x**2)

Very similar to FEniCS!

just with spectral accuracy

(1) Create basis

(2) Variational forms

(3) Solve system

2D/3D/4D..  Tensor product spaces

Combinations of Fourier and Chebyshev/Legendre bases

u(x,y,z,t) = \displaystyle \sum_{k}\sum_{l}\sum_{m} \hat{u}_{k,l,m}(t) \phi_k(x)\phi_l(y)\phi_m(z)
W(x,y,z) = V(x) \times V(y) \times V(z)

Any Cartesian combination of spaces (ultimately)

With solutions of the form:

Unknown expansion coefficients

Tensor product spaces

Combinations of Fourier and Chebyshev/Legendre bases

from shenfun import *
from mpi4py import MPI

comm = MPI.COMM_WORLD
N = (20, 30)
V0 = chebyshev.ShenDirichletBasis(N[0])
V1 = fourier.R2CBasis(N[1])
T = TensorProductSpace(comm, (V0, V1))
v = TestFunction(T)
u = TrialFunction(T)

A = inner(v, div(grad(u)))

# Or 3D
V2 = fourier.C2CBasis(40)
T3 = TensorProductSpace(comm, (V0, V2, V1))
v = TestFunction(T3)
u = TrialFunction(T3)

A = inner(v, div(grad(u)))

MPI under the hood

using mpi4py-fft

on tensor product grids like:

A simple form language

to make implementation very similar to mathematics

Mathamatical operators act on test, trial, or functions with data 

V = TensorProductSpace(comm, (V0, V1, V2))

# Form arguments:
v = TestFunction(V)
u = TrialFunction(V)
u_ = Function(V, False)

# Forms
du = grad(u_)
d2u = div(du) 
cu = curl(du)
dudx = Dx(u_, 0)
 

Inner products and projections

A = inner(v, u)    # Bilinear form, assembles to matrices
f = inner(v, u_)   # Linear form, assembles to vector
project(div(grad(u_)), V)  # projection to V

Fast transforms

(not in FEniCS)

u_ = V.backward(u_hat, u_)

Any spectral function is defined as

u(x) = \displaystyle\sum_{k=0}^{N-1} \hat{u}_k \phi_k(x)
u(x,y,z) = \displaystyle\sum_{k=0}^{N-1} \sum_{l=0}^{N-1} \sum_{m=0}^{N-1} \hat{u}_{klm} \phi_k(x)\phi_l(y)\phi_m(z)

3D:

1D:

This is computed on the tensor product grid as a backwards transform,

e.g., a discrete Fast Fourier inverse transform

A non-periodic basis uses fast Chebyshev(Legendre) transforms

Fast forward transforms

by L2-projections

u_hat = V.forward(u_, u_hat)
(u_h, v_h) = (Iu, v_h)\quad \forall \, v_h \in V_h
\text{Find } u_h \in V_h, \text{such that}

which is achieved using a forward transform

Implemented with FFTs or fast Chebyshev transforms

Vector spaces

VectorTensorProductSpace

import numpy as np
from shenfun import *

N = 16
K0 = C2CBasis(N)
K1 = C2CBasis(N)
K2 = R2CBasis(N)
T = TensorProductSpace(comm, (K0, K1, K2))

# Scalar function u_
u_ = Function(T, False)
u_hat = Function(T)
u_[:] = np.random.random(u_.shape)
u_hat = T.forward(u_, u_hat)

# Now define vector space to allow computation of grad(u_)
TV = VectorTensorProductSpace([T]*3)
du_hat = project(grad(u_), TV, uh_hat=u_hat)

Nonlinearities

are all treated with pseudospectral techniques

u \times \nabla \times u
u \cdot u
u | u |^2

(1) Compute inverse

u = T.backward(u_hat)

with dealiasing using

3/2-rule (padding) or 2/3-rule

(3) Transform back to spectral space

w = u*u

(2) Compute nonlinearity

w_hat = T.forward(w)
T = TensorProductSpace(comm, (V0, V1, V2))

Complex Ginzburg-Landau

example

\frac{\partial u}{\partial t} = \nabla^2 u + u -(1+1.5i)u |u|^2


2D periodic (x, y) \( \in [-50, 50]^2 \)

Use Fourier in both directions

Weak form, only spatial discretization:

\frac{\partial (v, u)^N}{\partial t} = -(\nabla v, \nabla u)^N + (v, u -(1+1.5i)u |u|^2)^N
(2\pi)^2\frac{\partial \hat{u}}{\partial t} = -(\nabla v, \nabla u)^N + (v, u -(1+1.5i)u |u|^2)^N

Complex Ginzburg-Landau

example

2D periodic (x, y) \( \in [-50, 50]^2 \)

import sympy
from shenfun.fourier.bases import R2CBasis, C2CBasis
from shenfun import inner, grad, TestFunction, TrialFunction, \
    Function, TensorProductSpace

# Use sympy to set up initial condition
x, y = sympy.symbols("x,y")
ue = (x + y)*sympy.exp(-0.03*(x**2+y**2))

N = (200, 200)
K0 = C2CBasis(N[0], domain=(-50., 50.))
K1 = C2CBasis(N[1], domain=(-50., 50.))
T = TensorProductSpace(comm, (K0, K1))
u = TrialFunction(T)
v = TestFunction(T)
U = Function(T, False)

def compute_rhs(rhs, u_hat, U, T):
    rhs = inner(grad(v), -grad(U), output_array=rhs, uh_hat=u_hat)
    rhs /= (2*np.pi)**2
    rhs += u_hat
    rhs -= T.forward((1+1.5j)*U*abs(U)**2, w0)
    return rhs

# And, e.g., Runge-Kutta integrator
#
#

Key code:

(2\pi)^2\frac{\partial \hat{u}}{\partial t} = -(\nabla v, \nabla u)^N + (v, u -(1+1.5i)u |u|^2)^N

Navier Stokes 

Turbulent channel flow

3D tensor product space

2 Fourier bases +

Modified Chebyshev bases for wall normal direction

Kim and Moin modified version of Navier-Stokes:

\mathcal{H}

is convection

Wall normal velocity

Wall normal vorticity

Navier Stokes 

Turbulent channel flow

512\times 1024\times 1024

First fully spectral DNS with 

3D Tensor product mesh:

Shaheen II at KAUST

\text{Re}_{\tau} > 2000

Thank you for your time! 

http://github.com/spectralDNS

Before I go I just want to point you attention towards a conference organised in the memory of my brilliant colleague

 

International Conference on Computational Science and Engineering, Simula October 23-25, 2017

Questions?

Made with Slides.com