Maintainable HPC with Python/C++

Automatic animations between code

Dr. Srijith Rajamohan

Pros

  • Easy to prototype algorithms
  • Compact code
  • Plethora of libraries
  • Standard(almost) for Data Science
  • Mature

Cons

  • Native code is slow
  • Parallelization is still second class
  • Ecosystem for debugging is still poor compared to C/C++

Python for Scientific Computing

Pros

  • Mature for Scientific Computing workloads
  • Well-established ecosytem of libraries, tools and best practices
  • Many paradigms for parallelization

Cons

  • Slow to develop, prototyping not as easy as Python/Julia/Matlab
  • Steep learning curve and can seem inaccessible
  • Verbose code making maintainability and debugging difficult

C/C++ for Scientific Computing

  • The Python scientific toolkit of optimized libraries in NumPy, SciPy, Pandas, DAAL4Py etc. - ideal for Data Science
    • These help to mitigate some of Python's speed problem
    • Limitations still exist for scalable computing
  • Dask for out-of-core computing
  • RAPIDS toolkit for efficient GPU computing
  • TensorFlow, Pytorch, MXNet, Chainer etc. for Deep Learning 

Scientific Computing Landscape in Python

  • PyData stack - NumPy,SciPy,Pandas etc. 
    • Can provide competitive performance while substantially reducing development time
    • Optimized libraries written in C/C++/Fortran for fast computation
  • Benchmark of productivity languages vs. performance languages at https://hal.inria.fr/hal-02879767/document
  • Can be harder to scale
    • Use high-level languages to direct control flow
    • Perform the scalable computation in the performance languages

PyData - Productivity vs. Performance

  •  Xarray - uses NetCDF - out-of-core performance extension with Dask

     

Xarray

  • CuPy - GPU acceleration in Python 
  • cuDNN and NCCL - use the binary provided by Nvidia

    

GPU Computing using Python

conda install -c conda-forge cupy

Topics

 

  • C++ vs. Python for Scientific Computing
  • PyData stack: Productivity vs. Performance
  • Numba
  • Eigen
  • Xtensor
  • Numba is a Just-In-Time (JIT) compiler that allows you to accelerate Python functions 
    • Functions compiled to machine code
      • Use a @jit decorator to configure this
    • Can break free from the GIL using option 'parallel=True' to the @jit decorator
      • Threading via TBB or OMP, use set_num_threads(N)
    • Works best with NumPy functions 
    • Use a prange instead of range to parallelize loops
      • Avoid race conditions when writing to the same location - 'parallel=True'
    • Intel SVML, optimized functions - 'fastmath=True'

Accelerating Python with Numba

  • Numba relies on compiling your Python code
  • Two modes for Numba-compiled code
    • 'nopython' mode and 'object' mode
    • 'nopython' mode has no Python interpreter involvement resulting in faster execution
    • 'object' mode only helps with loop-lifting or only running the inner loop in 'nopython' mode
  • 'nopython' mode can fail for a number of reasons 
    • Using Python features in your code that are not supported by Numba
    • Numba relies on type inference to figure out what type each variables is, it falls back to object mode when it can't
    • It can fall back to 'object' mode

Limitations of Numba

  • Make any function behave like NumPy ufuncs with Vectorize()
    • Eager compilation uses decorators with signatures
    • Ufuncs get features such as reduce, accumulate and broadcast 
    • Supports single-threaded or multi-threaded CPUs, GPUs with 'target'

Vectorize

@vectorize([float64(float64, float64)], 
           nopython=True, target='cpu')
def f(x, y):
  return x + y
  • Ufuncs operate on vectors by allowing you to write a function that operates on scalars
  • Numba then vectorizes them
  • Create NumPy ufuncs with GUVectorize()
    • Similar to vectorize but operate on multiple elements at a time, think window function
    • Return the result through an argument parameter

GUVectorize

@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y
dum_var = np.array([0.0,0.0]) 
res = numba_svd_derivative(j, self.nrow, self.ncol, self.projection_components, \
                           self.u, self.v, self.Xorig, self.X, d_sq, dum_var)
Derivative[:,:,:] = res
@guvectorize(
["u8, u8, u8, u8, f8[:,:], f8[:,:], f8[:,:], f8[:,:], f8[:], f8[:], f8[:,:,:]"], 
  "(), (), (), (), (nrow, nrow), (ncol, ncol), (nrow, ncol), (nrow, ncol), (nrow), (ndim)
  -> (ncol, nrow, ndim)", nopython=True,  target="cpu")
def numba_svd_derivative(j, nrow, ncol, ncomp, u, v, Xorig, X, d_sq, dum_var, result):
            jval = j
            for j in prange(0, ncol):
                contrib = np.zeros((ncol, ncomp))
                loopindex = min(nrow, ncol)
                for i in range(0,nrow):
                    sigma_v = np.zeros((ncol, ncomp))
                    ...
                contrib = np.dot(v,contrib)
                dX_dw = np.zeros((nrow,ncol))
                dX_dw[:,j] = Xorig[:,j]
                res_elem1 = np.dot(dX_dw, v[:,0:2])
                res_elem2 = np.dot(X, contrib[:,0:2])
                result[j,:,:] = res_elem1 + res_elem2

Accelerating SVD with Numba

  • Speedup of Numba GUVectorize function vs. regular Python vs. object mode Numba 
  • Data size: 48 x 72

 

Speedup with Numba GUVectorize

Python(s) Numba Object Mode(s) GUVectorize
0.68 Same as regular Python 0.008
  • One of the most popular C++ libraries for numerical computing
  • Expressive and intuitive API
  • Broadcasting and lazy evaluation
  • Fast!
  • Used by a lot of frameworks (hence fairly reliable) such as
    • Tensorflow
    • Stan
    • Ceres
    • IFOpt etc.

Eigen - Numerical Computing in C++

  • Header-only library, alleviates compilation headaches

Eigen 

clang++ -I/usr/local/include/eigen3/ test_eigen.cxx -o test_eigen
  •  Fast vector and matrix operations 
    • Compile-time-spec containers - up to 4 dimensions 
      • Matrix2d, Matrix3d, Matrix4d
      • Vector2d, Vector3d, Vector4d
    • Operations on matrices and vectors
      • E.g. mat.sum(), mat.prod(), mat.trace() 
    • Block operations on slices of matrices, vectors
      • ​E.g. mat.block(i,j,a,b), mat.row(i), vec.tail(n)
    •  Coefficient-wise operations such as
      • E.g. mat.abs(), mat.exp(), mat.log() 

Eigen - Python Productivity in C++?

  • Solutions of systems of equations (small, medium data)
    • A variety of algorithms for LU, QR, Eigenvector/Eigenvalue, SVD decompositions
    • Benchmarks of popular algorithms
    • Dense and sparse matrices
      • Direct and iterative solvers for sparse matrices 
      • E.g. SuperLU, CG, BICGSTAB
      • Some of them require external libraries such as GPL, PaStiX

    

Linear Algebra 

typedef struct{
int highdim;
int lowdim;
int numrows;
} params;

params parameters;
MatrixXd w_ext;

X = load_csv<MatrixXd>("X.csv"); // templatized function
z = load_csv<MatrixXd>("z.csv"); // templatized function
parameters.highdim = X.cols();
parameters.lowdim = z.cols();
parameters.numrows = X.rows();

X_t = X.transpose();
z_t = z.transpose();
z_product.noalias() = z*z_t;
...

Eigen for MDS Optimization

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
 // Loop assignment
  for(i = 0; i < parameters.numrows ; i++)
    {
       
        term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }
  
  //term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

Nlopt Function 

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
  /*for(i = 0; i < parameters.numrows ; i++)
    {
       term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }*/
  
  // Optimized version
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

Nlopt Function - Optimized

void nlopt_unconstrained_gradient_form1(const double *wd, double *df_dw)
{

  ...
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  sub_term = 2.0 * res;

  #pragma omp parallel
  {  
  for(i = 0 ; i < parameters.highdim; i++)
  {
     weight = wd[i];
     intermediate = MatrixXd::Zero(parameters.numrows,parameters.numrows);
     for(j = 0; j < parameters.numrows; j++)
     {
        intermediate.row(j) = X(j,i) * X.col(i);
     }
     df_dw[i] = (sub_term.array() * intermediate.array()).sum();  
  }}}

Nlopt Gradient

MDS - Comparison of C++/Eigen and Python

Dataset Python(s) C++(s) Speedup
70 x 100 0.0896 0.00388 23x
40 x 45 0.01861 0.000624 29x
20 x 25 0.0032 0.000099 32x
  • Xtensor
    • C++ library designed to be similar to Numpy
    • Computing on multi-dimensional arrays
    • Numpy-style broadcasting and lazy evaluation 
  • Xtensor-python

    • Python bindings from the Pybind11 C++ library

    • Can use the NumPy data structures in C++ using the Python Buffer Protocol

Xtensor

Compile Xtensor C++

g++ -I /PATH_TO_ANACONDA/envs/xtensor_env/include -std=c++17 test.cpp -o test
export CPLUS_INCLUDE_PATH=PATH_TO_ANACONDA/envs/xtensor_env/include/:
PATH_TO_EIGEN/eigen3/

g++ -std=c++14 `python3-config --ldflags` -shared -fPIC -Wl,-undefined,dynamic_lookup 
xtensor_python_test.cpp -o xtensor_python_test.so
  • Compile C++ Xtensor code into a shared object library
  • Note the flags needed

Python caller

Interfacing your C++ Code with Python

import xtensor_python_test as xt

# Call function from C++ library by passing numpy objects
derivative = xt.imds_wrapper(X,z,np.sqrt(weights),u,s,v)
PYBIND11_MODULE(xtensor_python_test, m)
{
    xt::import_numpy();
    m.doc() = "Test module for xtensor python bindings";
    m.def("imds_wrapper", imds_wrapper, "Wrapper for IMDS in C++");
}

Declare the entry point in C++

xt::xarray<double> imds_wrapper(xt::pyarray<double>& X, xt::pyarray<double>& z, xt::pyarray<double>& weights)
{
  Matrix_SVD obj(X, weights, 0, 2);
  obj.Compute_SVD_derivative();
}

Xtensor Wrapper in C++

void Optimized_Compute_SVD_derivative()
{

auto contrib = xt::zeros<double>({ncol, projection_components});
auto sigma_v = xt::zeros<double>({ncol, projection_components});
auto Derivative = xt::xarray<double>::from_shape({nrow, ncol, 2});
...

};
void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
		   if(k == l) continue;
		# block of size 'ncol' x one (or single column), starting at k*ncol, 1
		   sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *
            (( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;  }}};

Eigen Derivative of SVD

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
			if(k == l) continue;
			sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *(( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;
}}};

Eigen Derivative of SVD

void Optimized_Compute_SVD_derivative()
{
	...		
auto temp = xt::tile(s_in,(1,nrow));
temp = temp.reshape(nrow, s_size);
u_in = xt::linalg::dot(u_in, temp);
contrib.fill(0);
for(i = 0; i < nrow; i++)
   {
  	 sigma_v.fill(0); // Make one big sigma_v matrix for all 'ncol' weights
  	 for(k = 0; k < k_upper; k++)
  	 {
  		for(l = 0; l < projection_components; l++)
  		{
  		 if(k == l) continue;
         lhs = xt::view(arr1, xt::range(k*ncol, k*ncol + ncol), xt::range(l, l + 1))
         rhs_term1 = -1.0 * xt::transpose(xt::row(Xorig, i))
         rhs_term2 = u(i, k) *  xt::col(v, l) + u(i, l) * xt::col(v, k) 
         / ( d_sq(k) - d_sq(l) )
         lhs = xt::linalg::dot(rhs_term1, rhs_term2)
            ...

Xtensor Derivative of SVD

  • Problem - Computation of Derivative of SVD of weighted MDS
  • Data size of 48x72
  • Initialization - 0.019845s
  • Computation of Derivative - 0.459772s

Xtensor - Problem Specification

  • Eigen 
    • Used widely as standalone or wrapped within our packages, e.g. Tensorflow
    • Wide variety of algorithms
    • Fast
  • Xtensor 
    • Similar to numpy
    • Python bindings, easy to transfer data structures between C++ and Python
    • Not widely adopted 

Xtensor vs. Eigen

MaintainableHPC

By sjster

MaintainableHPC

We've added support for incredibly smooth animations between code blocks. See it in action in this deck!

  • 168