Maintainable HPC with Python/C++

Automatic animations between code

Dr. Srijith Rajamohan

Pros

Easy to prototype algorithms
Compact code
Plethora of libraries
Standard(almost) for Data Science
Mature

Cons

Native code is slow
Parallelization is still second class
Ecosystem for debugging is still poor compared to C/C++

Python for Scientific Computing

Pros

Mature for Scientific Computing workloads
Well-established ecosytem of libraries, tools and best practices
Many paradigms for parallelization

Cons

Slow to develop, prototyping not as easy as Python/Julia/Matlab
Steep learning curve and can seem inaccessible
Verbose code making maintainability and debugging difficult

C/C++ for Scientific Computing

The Python scientific toolkit of optimized libraries in NumPy, SciPy, Pandas, DAAL4Py etc. - ideal for Data Science
- These help to mitigate some of Python's speed problem
- Limitations still exist for scalable computing
Dask for out-of-core computing
RAPIDS toolkit for efficient GPU computing
TensorFlow, Pytorch, MXNet, Chainer etc. for Deep Learning

Scientific Computing Landscape in Python

PyData stack - NumPy,SciPy,Pandas etc.
- Can provide competitive performance while substantially reducing development time
- Optimized libraries written in C/C++/Fortran for fast computation
Benchmark of productivity languages vs. performance languages at https://hal.inria.fr/hal-02879767/document
Can be harder to scale
- Use high-level languages to direct control flow
- Perform the scalable computation in the performance languages

PyData - Productivity vs. Performance

Xarray - uses NetCDF - out-of-core performance extension with Dask

Xarray

CuPy - GPU acceleration in Python
cuDNN and NCCL - use the binary provided by Nvidia

GPU Computing using Python

conda install -c conda-forge cupy

Topics

C++ vs. Python for Scientific Computing
PyData stack: Productivity vs. Performance
Numba
Eigen
Xtensor

Numba is a Just-In-Time (JIT) compiler that allows you to accelerate Python functions
- Functions compiled to machine code
  - Use a @jit decorator to configure this
- Can break free from the GIL using option 'parallel=True' to the @jit decorator
  - Threading via TBB or OMP, use set_num_threads(N)
- Works best with NumPy functions
- Use a prange instead of range to parallelize loops
  - Avoid race conditions when writing to the same location - 'parallel=True'
- Intel SVML, optimized functions - 'fastmath=True'

Accelerating Python with Numba

Numba relies on compiling your Python code
Two modes for Numba-compiled code
- 'nopython' mode and 'object' mode
- 'nopython' mode has no Python interpreter involvement resulting in faster execution
- 'object' mode only helps with loop-lifting or only running the inner loop in 'nopython' mode
'nopython' mode can fail for a number of reasons
- Using Python features in your code that are not supported by Numba
- Numba relies on type inference to figure out what type each variables is, it falls back to object mode when it can't
- It can fall back to 'object' mode

Limitations of Numba

Make any function behave like NumPy ufuncs with Vectorize()
- Eager compilation uses decorators with signatures
- Ufuncs get features such as reduce, accumulate and broadcast
- Supports single-threaded or multi-threaded CPUs, GPUs with 'target'

Vectorize

@vectorize([float64(float64, float64)], 
           nopython=True, target='cpu')
def f(x, y):
  return x + y

Ufuncs operate on vectors by allowing you to write a function that operates on scalars
Numba then vectorizes them

Create NumPy ufuncs with GUVectorize()
- Similar to vectorize but operate on multiple elements at a time, think window function
- Return the result through an argument parameter

GUVectorize

@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y

dum_var = np.array([0.0,0.0]) 
res = numba_svd_derivative(j, self.nrow, self.ncol, self.projection_components, \
                           self.u, self.v, self.Xorig, self.X, d_sq, dum_var)
Derivative[:,:,:] = res
@guvectorize(
["u8, u8, u8, u8, f8[:,:], f8[:,:], f8[:,:], f8[:,:], f8[:], f8[:], f8[:,:,:]"], 
  "(), (), (), (), (nrow, nrow), (ncol, ncol), (nrow, ncol), (nrow, ncol), (nrow), (ndim)
  -> (ncol, nrow, ndim)", nopython=True,  target="cpu")
def numba_svd_derivative(j, nrow, ncol, ncomp, u, v, Xorig, X, d_sq, dum_var, result):
            jval = j
            for j in prange(0, ncol):
                contrib = np.zeros((ncol, ncomp))
                loopindex = min(nrow, ncol)
                for i in range(0,nrow):
                    sigma_v = np.zeros((ncol, ncomp))
                    ...
                contrib = np.dot(v,contrib)
                dX_dw = np.zeros((nrow,ncol))
                dX_dw[:,j] = Xorig[:,j]
                res_elem1 = np.dot(dX_dw, v[:,0:2])
                res_elem2 = np.dot(X, contrib[:,0:2])
                result[j,:,:] = res_elem1 + res_elem2

Accelerating SVD with Numba

Speedup of Numba GUVectorize function vs. regular Python vs. object mode Numba
Data size: 48 x 72

Speedup with Numba GUVectorize

Python(s)	Numba Object Mode(s)	GUVectorize
0.68	Same as regular Python	0.008

One of the most popular C++ libraries for numerical computing
Expressive and intuitive API
Broadcasting and lazy evaluation
Fast!
Used by a lot of frameworks (hence fairly reliable) such as
- Tensorflow
- Stan
- Ceres
- IFOpt etc.

Eigen - Numerical Computing in C++

Header-only library, alleviates compilation headaches

Eigen

clang++ -I/usr/local/include/eigen3/ test_eigen.cxx -o test_eigen

Fast vector and matrix operations
- Compile-time-spec containers - up to 4 dimensions
  - Matrix2d, Matrix3d, Matrix4d
  - Vector2d, Vector3d, Vector4d
- Operations on matrices and vectors
  - E.g. mat.sum(), mat.prod(), mat.trace()
- Block operations on slices of matrices, vectors
  - E.g. mat.block(i,j,a,b), mat.row(i), vec.tail(n)
- Coefficient-wise operations such as
  - E.g. mat.abs(), mat.exp(), mat.log()

Eigen - Python Productivity in C++?

Solutions of systems of equations (small, medium data)
- A variety of algorithms for LU, QR, Eigenvector/Eigenvalue, SVD decompositions
- Benchmarks of popular algorithms
- Dense and sparse matrices
  - Direct and iterative solvers for sparse matrices
  - E.g. SuperLU, CG, BICGSTAB
  - Some of them require external libraries such as GPL, PaStiX

Linear Algebra

typedef struct{
int highdim;
int lowdim;
int numrows;
} params;

params parameters;
MatrixXd w_ext;

X = load_csv<MatrixXd>("X.csv"); // templatized function
z = load_csv<MatrixXd>("z.csv"); // templatized function
parameters.highdim = X.cols();
parameters.lowdim = z.cols();
parameters.numrows = X.rows();

X_t = X.transpose();
z_t = z.transpose();
z_product.noalias() = z*z_t;
...

Eigen for MDS Optimization

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
 // Loop assignment
  for(i = 0; i < parameters.numrows ; i++)
    {
       
        term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }
  
  //term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

Nlopt Function

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
  /*for(i = 0; i < parameters.numrows ; i++)
    {
       term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }*/
  
  // Optimized version
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

Nlopt Function - Optimized

void nlopt_unconstrained_gradient_form1(const double *wd, double *df_dw)
{

  ...
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  sub_term = 2.0 * res;

  #pragma omp parallel
  {  
  for(i = 0 ; i < parameters.highdim; i++)
  {
     weight = wd[i];
     intermediate = MatrixXd::Zero(parameters.numrows,parameters.numrows);
     for(j = 0; j < parameters.numrows; j++)
     {
        intermediate.row(j) = X(j,i) * X.col(i);
     }
     df_dw[i] = (sub_term.array() * intermediate.array()).sum();  
  }}}

Nlopt Gradient

MDS - Comparison of C++/Eigen and Python

Dataset	Python(s)	C++(s)	Speedup
70 x 100	0.0896	0.00388	23x
40 x 45	0.01861	0.000624	29x
20 x 25	0.0032	0.000099	32x

Xtensor
- C++ library designed to be similar to Numpy
- Computing on multi-dimensional arrays
- Numpy-style broadcasting and lazy evaluation
Xtensor-python
- Python bindings from the Pybind11 C++ library
- Can use the NumPy data structures in C++ using the Python Buffer Protocol

Xtensor

Compile Xtensor C++

g++ -I /PATH_TO_ANACONDA/envs/xtensor_env/include -std=c++17 test.cpp -o test

export CPLUS_INCLUDE_PATH=PATH_TO_ANACONDA/envs/xtensor_env/include/:
PATH_TO_EIGEN/eigen3/

g++ -std=c++14 `python3-config --ldflags` -shared -fPIC -Wl,-undefined,dynamic_lookup 
xtensor_python_test.cpp -o xtensor_python_test.so

Compile C++ Xtensor code into a shared object library
Note the flags needed

Python caller

Interfacing your C++ Code with Python

import xtensor_python_test as xt

# Call function from C++ library by passing numpy objects
derivative = xt.imds_wrapper(X,z,np.sqrt(weights),u,s,v)

PYBIND11_MODULE(xtensor_python_test, m)
{
    xt::import_numpy();
    m.doc() = "Test module for xtensor python bindings";
    m.def("imds_wrapper", imds_wrapper, "Wrapper for IMDS in C++");
}

Declare the entry point in C++

xt::xarray<double> imds_wrapper(xt::pyarray<double>& X, xt::pyarray<double>& z, xt::pyarray<double>& weights)
{
  Matrix_SVD obj(X, weights, 0, 2);
  obj.Compute_SVD_derivative();
}

Xtensor Wrapper in C++

void Optimized_Compute_SVD_derivative()
{

auto contrib = xt::zeros<double>({ncol, projection_components});
auto sigma_v = xt::zeros<double>({ncol, projection_components});
auto Derivative = xt::xarray<double>::from_shape({nrow, ncol, 2});
...

};

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
		   if(k == l) continue;
		# block of size 'ncol' x one (or single column), starting at k*ncol, 1
		   sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *
            (( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;  }}};

Eigen Derivative of SVD

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
			if(k == l) continue;
			sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *(( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;
}}};

Eigen Derivative of SVD

void Optimized_Compute_SVD_derivative()
{
	...		
auto temp = xt::tile(s_in,(1,nrow));
temp = temp.reshape(nrow, s_size);
u_in = xt::linalg::dot(u_in, temp);
contrib.fill(0);
for(i = 0; i < nrow; i++)
   {
  	 sigma_v.fill(0); // Make one big sigma_v matrix for all 'ncol' weights
  	 for(k = 0; k < k_upper; k++)
  	 {
  		for(l = 0; l < projection_components; l++)
  		{
  		 if(k == l) continue;
         lhs = xt::view(arr1, xt::range(k*ncol, k*ncol + ncol), xt::range(l, l + 1))
         rhs_term1 = -1.0 * xt::transpose(xt::row(Xorig, i))
         rhs_term2 = u(i, k) *  xt::col(v, l) + u(i, l) * xt::col(v, k) 
         / ( d_sq(k) - d_sq(l) )
         lhs = xt::linalg::dot(rhs_term1, rhs_term2)
            ...

Xtensor Derivative of SVD

Problem - Computation of Derivative of SVD of weighted MDS
Data size of 48x72
Initialization - 0.019845s
Computation of Derivative - 0.459772s

Xtensor - Problem Specification

Eigen
- Used widely as standalone or wrapped within our packages, e.g. Tensorflow
- Wide variety of algorithms
- Fast
Xtensor
- Similar to numpy
- Python bindings, easy to transfer data structures between C++ and Python
- Not widely adopted

Xtensor vs. Eigen

MaintainableHPC

By sjster

MaintainableHPC

We've added support for incredibly smooth animations between code blocks. See it in action in this deck!

Maintainable HPC with Python/C++

Pros

Cons

Python for Scientific Computing

Pros

Cons

C/C++ for Scientific Computing

Scientific Computing Landscape in Python

PyData - Productivity vs. Performance

Xarray

GPU Computing using Python

Topics

Accelerating Python with Numba

Limitations of Numba

Vectorize

GUVectorize

Speedup with Numba GUVectorize

Eigen - Numerical Computing in C++

Eigen

Eigen - Python Productivity in C++?

Linear Algebra

MDS - Comparison of C++/Eigen and Python

Xtensor

Interfacing your C++ Code with Python

Xtensor - Problem Specification

Xtensor vs. Eigen

MaintainableHPC

More from sjster