Maintainable HPC and Data Science with Python/C++

Automatic animations between code

Dr. Srijith Rajamohan

 

Topics

 

  • C++ vs. Python for Scientific Computing
  • PyData stack: Productivity vs. Performance
  • Numba
  • Eigen
  • Xtensor

Pros

  • Easy to prototype algorithms
  • Compact code
  • Plethora of mature libraries
  • Standard (almost) for Data Science

Cons

  • Native code is slow
  • Parallelization is still second class
  • Ecosystem for debugging is still poor compared to C/C++

Python for Scientific Computing

Pros

  • Mature for Scientific Computing workloads
  • Well-established ecosytem of libraries, tools and best practices
  • Many paradigms for parallelization

Cons

  • Slow to develop, prototyping not as easy as Python/Julia/Matlab
  • Steep learning curve and can seem inaccessible
  • Verbose code making maintainability and debugging difficult

C/C++ for Scientific Computing

  • The Python scientific toolkit of optimized libraries in NumPy, SciPy, Pandas, DAAL4Py etc. - ideal for Data Science
    • These help to mitigate some of Python's speed problem
    • Limitations still exist for scalable computing
  • PySpark for distributed computing
  • Dask for out-of-core computing
  • RAPIDS toolkit for efficient GPU computing
  • TensorFlow, Pytorch, MXNet, Chainer etc. for Deep Learning 

Scientific Computing Landscape in Python

  • PyData stack - NumPy,SciPy,Pandas etc. 
    • Can provide competitive performance while substantially reducing development time
    • Optimized libraries written in C/C++/Fortran for fast computation
  • Benchmark of productivity languages vs. performance languages at https://hal.inria.fr/hal-02879767/document
  • Can be harder to scale
    • Use high-level languages to direct control flow
    • Perform the scalable computation in the performance languages

PyData - Productivity vs. Performance

  • Numba is a Just-In-Time (JIT) compiler that allows you to accelerate Python functions
    • Functions compiled to machine code
      • Use a @jit decorator to configure this
    • Works best with NumPy functions
    • Can break free from the GIL using option 'parallel=True' to the @jit decorator
      • Threading via TBB or OMP, use set_num_threads(N)
      • Use a prange instead of range to parallelize loops
      • Avoid race conditions when writing to the same location - 'parallel=True'
    • Intel SVML, optimized functions - 'fastmath=True'

Accelerating Python with Numba

  • Numba relies on compiling your Python code
  • Two modes for Numba-compiled code
    • 'nopython' mode and 'object' mode
    • 'nopython' mode has no Python interpreter involvement resulting in faster execution
    • 'object' mode only helps with loop-lifting or only running the inner loop in 'nopython' mode
  • 'nopython' mode can fail for a number of reasons
    • Using Python features in your code that are not supported by Numba
    • Numba relies on type inference to figure out what type each variables is, it falls back to object mode when it can't

Compilation in Numba

  • Make any function behave like NumPy ufuncs with Vectorize()
    • Eager compilation uses decorators with signatures
    • Ufuncs get features such as reduce, accumulate and broadcast 
    • Supports single-threaded or multi-threaded CPUs, GPUs with 'target'

Vectorize

@vectorize([float64(float64, float64)], 
           nopython=True, target='cpu')
def f(x, y):
  return x + y
  • Ufuncs operate on vectors by allowing you to write a function that operates on scalars
  • Numba then vectorizes them
  • Create NumPy ufuncs with GUVectorize()
    • Similar to vectorize but operate on multiple elements at a time, think window function
    • Return the result through an argument parameter

GUVectorize

@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y

Optimization in Multidimensional Scaling

\textbf{Stress}^2 = \sum_{i=0}^{nrows} \sum_{j=i+1}^{nrows} \biggl( \delta_{HD}( X_i [W] , X_j [W]) - \delta_{LD}( Z_i , Z_j) \biggr)^2
\delta_{HD} (X_i, X_j, W) = (X_{i} [W]) \cdot (X_{j} [W])^T \\ \\

Optimize for the stress function for weights in the Backward MDS step

\delta_{LD} (Z_i, Z_j) = Z_{i} \cdot Z_{j}^T
dum_var = np.array([0.0,0.0]) 
res = numba_svd_derivative(j, self.nrow, self.ncol, self.projection_components, \
                           self.u, self.v, self.Xorig, self.X, d_sq, dum_var)
Derivative[:,:,:] = res
@guvectorize(
["u8, u8, u8, u8, f8[:,:], f8[:,:], f8[:,:], f8[:,:], f8[:], f8[:], f8[:,:,:]"], 
  "(), (), (), (), (nrow, nrow), (ncol, ncol), (nrow, ncol), (nrow, ncol), (nrow), (ndim)
  -> (ncol, nrow, ndim)", nopython=True,  target="cpu")
def numba_svd_derivative(j, nrow, ncol, ncomp, u, v, Xorig, X, d_sq, dum_var, result):
            jval = j
            for j in prange(0, ncol):
                contrib = np.zeros((ncol, ncomp))
                loopindex = min(nrow, ncol)
                for i in range(0,nrow):
                    sigma_v = np.zeros((ncol, ncomp))
                    ...
                contrib = np.dot(v,contrib)
                dX_dw = np.zeros((nrow,ncol))
                dX_dw[:,j] = Xorig[:,j]
                res_elem1 = np.dot(dX_dw, v[:,0:2])
                res_elem2 = np.dot(X, contrib[:,0:2])
                result[j,:,:] = res_elem1 + res_elem2

Accelerate SVD-based Solution with Numba

  • Speed comparison for data of size 48x72:
    • Numba GUVectorize function
      • 'nopython' mode
      • 'object' mode 
    • Regular Python

Speedup with Numba GUVectorize

Python  Numba Object Mode Numba Nopython Mode
0.68 Same as regular Python 0.008

Runtimes of the functions(s)

  • Eigen 
    • Used widely as standalone or wrapped within our packages, e.g. Tensorflow
    • Wide variety of algorithms
    • Fast
  • Xtensor 
    • Similar to numpy
    • Python bindings, easy to transfer data structures between C++ and Python
    • Not widely adopted 

Xtensor vs. Eigen

  • One of the most popular C++ libraries for numerical computing
  • Expressive and intuitive API
  • Broadcasting and lazy evaluation
  • Used by a lot of frameworks (hence fairly reliable) such as
    • Tensorflow
    • Stan
    • Ceres
    • IFOpt etc.

Eigen - Numerical Computing in C++

  • Header-only library, alleviates compilation headaches

Eigen 

clang++ -I /usr/local/include/eigen3/ test_eigen.cxx -o test_eigen
  •  Fast vector and matrix operations 
    • Compile-time-spec containers - up to 4 dimensions 
      • Matrix2d, Matrix3d, Matrix4d
      • Vector2d, Vector3d, Vector4d
    • Operations on matrices and vectors
      • E.g. mat.sum(), mat.prod(), mat.trace() 
    • Block operations on slices of matrices, vectors
      • ​E.g. mat.block(i,j,a,b), mat.row(i), vec.tail(n)
    • Coefficient-wise operations such as
      • E.g. mat.abs(), mat.exp(), mat.log() 

Eigen - Python Productivity in C++?

  • Solutions of systems of equations (small, medium data)
    • A variety of algorithms for LU, QR, Eigenvector/Eigenvalue, SVD decompositions
    • Benchmarks of popular algorithms
    • Dense and sparse matrices
      • Direct and iterative solvers for sparse matrices 
      • E.g. SuperLU, CG, BICGSTAB
      • Some of them require external libraries such as GPL, PaStiX

    

Linear Algebra 

typedef struct{
int highdim;
int lowdim;
int numrows;
} params;

params parameters;
MatrixXd w_ext, X, X_t, z, z_t;

X = load_csv<MatrixXd>("X.csv"); // templatized function
z = load_csv<MatrixXd>("z.csv"); // templatized function
parameters.highdim = X.cols();
parameters.lowdim = z.cols();
parameters.numrows = X.rows();

X_t = X.transpose();
z_t = z.transpose();
z_product.noalias() = z*z_t;
...

MDS - Read Data in Eigen 

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
 // Loop assignment
  for(i = 0; i < parameters.numrows ; i++)
    {
       
        term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }
  
  //term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

MDS - Nlopt Function 

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
  /*for(i = 0; i < parameters.numrows ; i++)
    {
       term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }*/
  
  // Optimized version
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

Nlopt Function - Optimized

void nlopt_unconstrained_gradient_form1(const double *wd, double *df_dw)
{

  ...
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  sub_term = 2.0 * res;

  #pragma omp parallel
  {  
  for(i = 0 ; i < parameters.highdim; i++)
  {
     weight = wd[i];
     intermediate = MatrixXd::Zero(parameters.numrows,parameters.numrows);
     for(j = 0; j < parameters.numrows; j++)
     {
        intermediate.row(j) = X(j,i) * X.col(i);
     }
     df_dw[i] = (sub_term.array() * intermediate.array()).sum();  
  }}}

Nlopt Gradient

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
		   if(k == l) continue;
		# block of size 'ncol' x one (or single column), starting at k*ncol, 1
		   sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *
            (( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;  }}};

Memory Map

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
			if(k == l) continue;
			sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *(( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;
}}};

Memory Map

MDS - Comparison of C++/Eigen and Python

Dataset Python(s) C++(s) Speedup
70 x 100 0.0896 0.00388 23x
40 x 45 0.01861 0.000624 29x
20 x 25 0.0032 0.000099 32x
  • Xtensor
    • C++ library designed to be similar to Numpy
    • Computing on multi-dimensional arrays
    • Numpy-style broadcasting and lazy evaluation 
  • Xtensor-python

    • Python bindings from the Pybind11 C++ library

    • Can use the NumPy data structures in C++ using the Python Buffer Protocol

Xtensor

Compile Xtensor C++

g++ -I /PATH_TO_ANACONDA/envs/xtensor_env/include -std=c++14 test.cpp -o test
export CPLUS_INCLUDE_PATH=PATH_TO_ANACONDA/envs/xtensor_env/include/:
PATH_TO_EIGEN/eigen3/

g++ -std=c++14 `python3-config --ldflags` -shared -fPIC -Wl,-undefined,dynamic_lookup 
xtensor_python_test.cpp -o xtensor_python_test.so
  • To use from Xtensor-python
    • Compile C++ Xtensor code into a shared object library
    • Note the flags needed
  • To compile C++ Xtensor code

Python caller

Interfacing your C++ Code with Python

import xtensor_python_test as xt

# Call function from C++ library by passing numpy objects
derivative = xt.imds_wrapper(X,z,np.sqrt(weights),u,s,v)
PYBIND11_MODULE(xtensor_python_test, m)
{
    // Must be called before any other lines
    xt::import_numpy();
    m.doc() = "Test module for xtensor python bindings";
    m.def("imds_wrapper", imds_wrapper, "Wrapper for IMDS in C++");
}

Declare the entry point in C++

xt::xarray<double> imds_wrapper(xt::pyarray<double>& X, 
xt::pyarray<double>& z, xt::pyarray<double>& weights)
{
  Matrix_SVD obj(X, weights, 0, 2);
  obj.Compute_SVD_derivative();
}

Xtensor Wrapper in C++

void Optimized_Compute_SVD_derivative()
{

auto contrib = xt::zeros<double>({ncol, projection_components});
auto sigma_v = xt::zeros<double>({ncol, projection_components});
auto Derivative = xt::xarray<double>::from_shape({nrow, ncol, 2});
...

};
void Optimized_Compute_SVD_derivative()
{
	...		
auto temp = xt::tile(s_in,(1,nrow));
temp = temp.reshape(nrow, s_size);
u_in = xt::linalg::dot(u_in, temp);
contrib.fill(0);
for(i = 0; i < nrow; i++)
   {
  	 sigma_v.fill(0); // Make one big sigma_v matrix for all 'ncol' weights
  	 for(k = 0; k < k_upper; k++)
  	 {
  		for(l = 0; l < projection_components; l++)
  		{
  		 if(k == l) continue;
         lhs = xt::view(arr1, xt::range(k*ncol, k*ncol + ncol), xt::range(l, l + 1))
         rhs_term1 = -1.0 * xt::transpose(xt::row(Xorig, i))
         rhs_term2 = u(i, k) *  xt::col(v, l) + u(i, l) * xt::col(v, k) 
         / ( d_sq(k) - d_sq(l) )
         lhs = xt::linalg::dot(rhs_term1, rhs_term2)
            ...

Xtensor Derivative of SVD

  • Computation of Derivative of SVD of weighted MDS
  • Data size of 48x72
  • Less than 4% of time in the initialization routine

Data Transfer Overhead

  • Native Python is slow, but we already knew that!
  • Profile your code
  • Scale up code through incremental complexity
    • Use optimized packages (most people already do this)
    • Numba 
    • Eigen or Xtensor
    • Eigen or Xtensor with multithreading 

Conclusion

MaintainableHPC

By sjster

MaintainableHPC

We've added support for incredibly smooth animations between code blocks. See it in action in this deck!

  • 7,099