Maintainable HPC and Data Science with Python/C++

Automatic animations between code

Dr. Srijith Rajamohan

Topics

C++ vs. Python for Scientific Computing
PyData stack: Productivity vs. Performance
Numba
Eigen
Xtensor

Pros

Easy to prototype algorithms
Compact code
Plethora of mature libraries
Standard (almost) for Data Science

Cons

Native code is slow
Parallelization is still second class
Ecosystem for debugging is still poor compared to C/C++

Python for Scientific Computing

Pros

Mature for Scientific Computing workloads
Well-established ecosytem of libraries, tools and best practices
Many paradigms for parallelization

Cons

Slow to develop, prototyping not as easy as Python/Julia/Matlab
Steep learning curve and can seem inaccessible
Verbose code making maintainability and debugging difficult

C/C++ for Scientific Computing

The Python scientific toolkit of optimized libraries in NumPy, SciPy, Pandas, DAAL4Py etc. - ideal for Data Science
- These help to mitigate some of Python's speed problem
- Limitations still exist for scalable computing
PySpark for distributed computing
Dask for out-of-core computing
RAPIDS toolkit for efficient GPU computing
TensorFlow, Pytorch, MXNet, Chainer etc. for Deep Learning

Scientific Computing Landscape in Python

PyData stack - NumPy,SciPy,Pandas etc.
- Can provide competitive performance while substantially reducing development time
- Optimized libraries written in C/C++/Fortran for fast computation
Benchmark of productivity languages vs. performance languages at https://hal.inria.fr/hal-02879767/document
Can be harder to scale
- Use high-level languages to direct control flow
- Perform the scalable computation in the performance languages

PyData - Productivity vs. Performance

Numba is a Just-In-Time (JIT) compiler that allows you to accelerate Python functions
- Functions compiled to machine code
  - Use a @jit decorator to configure this
- Works best with NumPy functions
- Can break free from the GIL using option 'parallel=True' to the @jit decorator
  - Threading via TBB or OMP, use set_num_threads(N)
  - Use a prange instead of range to parallelize loops
  - Avoid race conditions when writing to the same location - 'parallel=True'
- Intel SVML, optimized functions - 'fastmath=True'

Accelerating Python with Numba

Numba relies on compiling your Python code
Two modes for Numba-compiled code
- 'nopython' mode and 'object' mode
- 'nopython' mode has no Python interpreter involvement resulting in faster execution
- 'object' mode only helps with loop-lifting or only running the inner loop in 'nopython' mode
'nopython' mode can fail for a number of reasons
- Using Python features in your code that are not supported by Numba
- Numba relies on type inference to figure out what type each variables is, it falls back to object mode when it can't

Compilation in Numba

Make any function behave like NumPy ufuncs with Vectorize()
- Eager compilation uses decorators with signatures
- Ufuncs get features such as reduce, accumulate and broadcast
- Supports single-threaded or multi-threaded CPUs, GPUs with 'target'

Vectorize

@vectorize([float64(float64, float64)], 
           nopython=True, target='cpu')
def f(x, y):
  return x + y

Ufuncs operate on vectors by allowing you to write a function that operates on scalars
Numba then vectorizes them

Create NumPy ufuncs with GUVectorize()
- Similar to vectorize but operate on multiple elements at a time, think window function
- Return the result through an argument parameter

GUVectorize

@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y

Optimization in Multidimensional Scaling

\textbf{Stress}^2 = \sum_{i=0}^{nrows} \sum_{j=i+1}^{nrows} \biggl( \delta_{HD}( X_i [W] , X_j [W]) - \delta_{LD}( Z_i , Z_j) \biggr)^2

\delta_{HD} (X_i, X_j, W) = (X_{i} [W]) \cdot (X_{j} [W])^T \\ \\

Optimize for the stress function for weights in the Backward MDS step

\delta_{LD} (Z_i, Z_j) = Z_{i} \cdot Z_{j}^T

dum_var = np.array([0.0,0.0]) 
res = numba_svd_derivative(j, self.nrow, self.ncol, self.projection_components, \
                           self.u, self.v, self.Xorig, self.X, d_sq, dum_var)
Derivative[:,:,:] = res
@guvectorize(
["u8, u8, u8, u8, f8[:,:], f8[:,:], f8[:,:], f8[:,:], f8[:], f8[:], f8[:,:,:]"], 
  "(), (), (), (), (nrow, nrow), (ncol, ncol), (nrow, ncol), (nrow, ncol), (nrow), (ndim)
  -> (ncol, nrow, ndim)", nopython=True,  target="cpu")
def numba_svd_derivative(j, nrow, ncol, ncomp, u, v, Xorig, X, d_sq, dum_var, result):
            jval = j
            for j in prange(0, ncol):
                contrib = np.zeros((ncol, ncomp))
                loopindex = min(nrow, ncol)
                for i in range(0,nrow):
                    sigma_v = np.zeros((ncol, ncomp))
                    ...
                contrib = np.dot(v,contrib)
                dX_dw = np.zeros((nrow,ncol))
                dX_dw[:,j] = Xorig[:,j]
                res_elem1 = np.dot(dX_dw, v[:,0:2])
                res_elem2 = np.dot(X, contrib[:,0:2])
                result[j,:,:] = res_elem1 + res_elem2

Accelerate SVD-based Solution with Numba

Speed comparison for data of size 48x72:
- Numba GUVectorize function
  - 'nopython' mode
  - 'object' mode
- Regular Python

Speedup with Numba GUVectorize

Python	Numba Object Mode	Numba Nopython Mode
0.68	Same as regular Python	0.008

Runtimes of the functions(s)

Eigen
- Used widely as standalone or wrapped within our packages, e.g. Tensorflow
- Wide variety of algorithms
- Fast
Xtensor
- Similar to numpy
- Python bindings, easy to transfer data structures between C++ and Python
- Not widely adopted

Xtensor vs. Eigen

One of the most popular C++ libraries for numerical computing
Expressive and intuitive API
Broadcasting and lazy evaluation
Used by a lot of frameworks (hence fairly reliable) such as
- Tensorflow
- Stan
- Ceres
- IFOpt etc.

Eigen - Numerical Computing in C++

Header-only library, alleviates compilation headaches

Eigen

clang++ -I /usr/local/include/eigen3/ test_eigen.cxx -o test_eigen

Fast vector and matrix operations
- Compile-time-spec containers - up to 4 dimensions
  - Matrix2d, Matrix3d, Matrix4d
  - Vector2d, Vector3d, Vector4d
- Operations on matrices and vectors
  - E.g. mat.sum(), mat.prod(), mat.trace()
- Block operations on slices of matrices, vectors
  - E.g. mat.block(i,j,a,b), mat.row(i), vec.tail(n)
- Coefficient-wise operations such as
  - E.g. mat.abs(), mat.exp(), mat.log()

Eigen - Python Productivity in C++?

Solutions of systems of equations (small, medium data)
- A variety of algorithms for LU, QR, Eigenvector/Eigenvalue, SVD decompositions
- Benchmarks of popular algorithms
- Dense and sparse matrices
  - Direct and iterative solvers for sparse matrices
  - E.g. SuperLU, CG, BICGSTAB
  - Some of them require external libraries such as GPL, PaStiX

Linear Algebra

typedef struct{
int highdim;
int lowdim;
int numrows;
} params;

params parameters;
MatrixXd w_ext, X, X_t, z, z_t;

X = load_csv<MatrixXd>("X.csv"); // templatized function
z = load_csv<MatrixXd>("z.csv"); // templatized function
parameters.highdim = X.cols();
parameters.lowdim = z.cols();
parameters.numrows = X.rows();

X_t = X.transpose();
z_t = z.transpose();
z_product.noalias() = z*z_t;
...

MDS - Read Data in Eigen

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
 // Loop assignment
  for(i = 0; i < parameters.numrows ; i++)
    {
       
        term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }
  
  //term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

MDS - Nlopt Function

double nlopt_unconstrained_function_form1(const double *wd)
{

...
  
  /*for(i = 0; i < parameters.numrows ; i++)
    {
       term1.row(i) = X.row(i).array() * w_ext.row(0).array();        
    }*/
  
  // Optimized version
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();

  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  return (res.array()*res.array()).sum();
  
}

Nlopt Function - Optimized

void nlopt_unconstrained_gradient_form1(const double *wd, double *df_dw)
{

  ...
  term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
  amat.noalias() = term1*X_t;
  res.noalias() = amat - z_product;
  sub_term = 2.0 * res;

  #pragma omp parallel
  {  
  for(i = 0 ; i < parameters.highdim; i++)
  {
     weight = wd[i];
     intermediate = MatrixXd::Zero(parameters.numrows,parameters.numrows);
     for(j = 0; j < parameters.numrows; j++)
     {
        intermediate.row(j) = X(j,i) * X.col(i);
     }
     df_dw[i] = (sub_term.array() * intermediate.array()).sum();  
  }}}

Nlopt Gradient

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
		   if(k == l) continue;
		# block of size 'ncol' x one (or single column), starting at k*ncol, 1
		   sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *
            (( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;  }}};

Memory Map

void Optimized_Compute_SVD_derivative()
{
	...		
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
	sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
	for(k = 0; k < k_upper; k++)
	{
		for(l = 0; l < projection_components; l++)
		{
			if(k == l) continue;
			sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *(( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
		}
	}
	contrib += sigma_v;
}
	...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j, 
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;
}}};

Memory Map

MDS - Comparison of C++/Eigen and Python

Dataset	Python(s)	C++(s)	Speedup
70 x 100	0.0896	0.00388	23x
40 x 45	0.01861	0.000624	29x
20 x 25	0.0032	0.000099	32x

Xtensor
- C++ library designed to be similar to Numpy
- Computing on multi-dimensional arrays
- Numpy-style broadcasting and lazy evaluation
Xtensor-python
- Python bindings from the Pybind11 C++ library
- Can use the NumPy data structures in C++ using the Python Buffer Protocol

Xtensor

Compile Xtensor C++

g++ -I /PATH_TO_ANACONDA/envs/xtensor_env/include -std=c++14 test.cpp -o test

export CPLUS_INCLUDE_PATH=PATH_TO_ANACONDA/envs/xtensor_env/include/:
PATH_TO_EIGEN/eigen3/

g++ -std=c++14 `python3-config --ldflags` -shared -fPIC -Wl,-undefined,dynamic_lookup 
xtensor_python_test.cpp -o xtensor_python_test.so

To use from Xtensor-python
- Compile C++ Xtensor code into a shared object library
- Note the flags needed

To compile C++ Xtensor code

Python caller

Interfacing your C++ Code with Python

import xtensor_python_test as xt

# Call function from C++ library by passing numpy objects
derivative = xt.imds_wrapper(X,z,np.sqrt(weights),u,s,v)

PYBIND11_MODULE(xtensor_python_test, m)
{
    // Must be called before any other lines
    xt::import_numpy();
    m.doc() = "Test module for xtensor python bindings";
    m.def("imds_wrapper", imds_wrapper, "Wrapper for IMDS in C++");
}

Declare the entry point in C++

xt::xarray<double> imds_wrapper(xt::pyarray<double>& X, 
xt::pyarray<double>& z, xt::pyarray<double>& weights)
{
  Matrix_SVD obj(X, weights, 0, 2);
  obj.Compute_SVD_derivative();
}

Xtensor Wrapper in C++

void Optimized_Compute_SVD_derivative()
{

auto contrib = xt::zeros<double>({ncol, projection_components});
auto sigma_v = xt::zeros<double>({ncol, projection_components});
auto Derivative = xt::xarray<double>::from_shape({nrow, ncol, 2});
...

};

void Optimized_Compute_SVD_derivative()
{
	...		
auto temp = xt::tile(s_in,(1,nrow));
temp = temp.reshape(nrow, s_size);
u_in = xt::linalg::dot(u_in, temp);
contrib.fill(0);
for(i = 0; i < nrow; i++)
   {
  	 sigma_v.fill(0); // Make one big sigma_v matrix for all 'ncol' weights
  	 for(k = 0; k < k_upper; k++)
  	 {
  		for(l = 0; l < projection_components; l++)
  		{
  		 if(k == l) continue;
         lhs = xt::view(arr1, xt::range(k*ncol, k*ncol + ncol), xt::range(l, l + 1))
         rhs_term1 = -1.0 * xt::transpose(xt::row(Xorig, i))
         rhs_term2 = u(i, k) *  xt::col(v, l) + u(i, l) * xt::col(v, k) 
         / ( d_sq(k) - d_sq(l) )
         lhs = xt::linalg::dot(rhs_term1, rhs_term2)
            ...

Xtensor Derivative of SVD

Computation of Derivative of SVD of weighted MDS
Data size of 48x72
Less than 4% of time in the initialization routine

Data Transfer Overhead

Native Python is slow, but we already knew that!
Profile your code
Scale up code through incremental complexity
- Use optimized packages (most people already do this)
- Numba
- Eigen or Xtensor
- Eigen or Xtensor with multithreading

Conclusion

MaintainableHPC

By sjster

MaintainableHPC

We've added support for incredibly smooth animations between code blocks. See it in action in this deck!

7,441