Maintainable HPC and Data Science with Python/C++
Automatic animations between code
Dr. Srijith Rajamohan
Topics
- C++ vs. Python for Scientific Computing
- PyData stack: Productivity vs. Performance
- Numba
- Eigen
- Xtensor
Pros
- Easy to prototype algorithms
- Compact code
- Plethora of mature libraries
- Standard (almost) for Data Science
Cons
- Native code is slow
- Parallelization is still second class
- Ecosystem for debugging is still poor compared to C/C++
Python for Scientific Computing
Pros
- Mature for Scientific Computing workloads
- Well-established ecosytem of libraries, tools and best practices
- Many paradigms for parallelization
Cons
- Slow to develop, prototyping not as easy as Python/Julia/Matlab
- Steep learning curve and can seem inaccessible
- Verbose code making maintainability and debugging difficult
C/C++ for Scientific Computing
- The Python scientific toolkit of optimized libraries in NumPy, SciPy, Pandas, DAAL4Py etc. - ideal for Data Science
- These help to mitigate some of Python's speed problem
- Limitations still exist for scalable computing
- PySpark for distributed computing
- Dask for out-of-core computing
- RAPIDS toolkit for efficient GPU computing
- TensorFlow, Pytorch, MXNet, Chainer etc. for Deep Learning
Scientific Computing Landscape in Python
- PyData stack - NumPy,SciPy,Pandas etc.
- Can provide competitive performance while substantially reducing development time
- Optimized libraries written in C/C++/Fortran for fast computation
- Benchmark of productivity languages vs. performance languages at https://hal.inria.fr/hal-02879767/document
- Can be harder to scale
- Use high-level languages to direct control flow
- Perform the scalable computation in the performance languages
PyData - Productivity vs. Performance
-
Numba is a Just-In-Time (JIT) compiler that allows you to accelerate Python functions
-
Functions compiled to machine code
- Use a @jit decorator to configure this
- Works best with NumPy functions
-
Can break free from the GIL using option 'parallel=True' to the @jit decorator
- Threading via TBB or OMP, use set_num_threads(N)
- Use a prange instead of range to parallelize loops
- Avoid race conditions when writing to the same location - 'parallel=True'
- Intel SVML, optimized functions - 'fastmath=True'
-
Functions compiled to machine code
Accelerating Python with Numba
- Numba relies on compiling your Python code
-
Two modes for Numba-compiled code
- 'nopython' mode and 'object' mode
- 'nopython' mode has no Python interpreter involvement resulting in faster execution
- 'object' mode only helps with loop-lifting or only running the inner loop in 'nopython' mode
-
'nopython' mode can fail for a number of reasons
- Using Python features in your code that are not supported by Numba
- Numba relies on type inference to figure out what type each variables is, it falls back to object mode when it can't
Compilation in Numba
- Make any function behave like NumPy ufuncs with Vectorize()
- Eager compilation uses decorators with signatures
- Ufuncs get features such as reduce, accumulate and broadcast
- Supports single-threaded or multi-threaded CPUs, GPUs with 'target'
Vectorize
@vectorize([float64(float64, float64)],
nopython=True, target='cpu')
def f(x, y):
return x + y
- Ufuncs operate on vectors by allowing you to write a function that operates on scalars
- Numba then vectorizes them
- Create NumPy ufuncs with GUVectorize()
- Similar to vectorize but operate on multiple elements at a time, think window function
- Return the result through an argument parameter
GUVectorize
@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
for i in range(x.shape[0]):
res[i] = x[i] + y
Optimization in Multidimensional Scaling
\textbf{Stress}^2 = \sum_{i=0}^{nrows} \sum_{j=i+1}^{nrows} \biggl( \delta_{HD}( X_i [W] , X_j [W]) - \delta_{LD}( Z_i , Z_j) \biggr)^2
\delta_{HD} (X_i, X_j, W) = (X_{i} [W]) \cdot (X_{j} [W])^T \\ \\
Optimize for the stress function for weights in the Backward MDS step
\delta_{LD} (Z_i, Z_j) = Z_{i} \cdot Z_{j}^T
dum_var = np.array([0.0,0.0])
res = numba_svd_derivative(j, self.nrow, self.ncol, self.projection_components, \
self.u, self.v, self.Xorig, self.X, d_sq, dum_var)
Derivative[:,:,:] = res
@guvectorize(
["u8, u8, u8, u8, f8[:,:], f8[:,:], f8[:,:], f8[:,:], f8[:], f8[:], f8[:,:,:]"],
"(), (), (), (), (nrow, nrow), (ncol, ncol), (nrow, ncol), (nrow, ncol), (nrow), (ndim)
-> (ncol, nrow, ndim)", nopython=True, target="cpu")
def numba_svd_derivative(j, nrow, ncol, ncomp, u, v, Xorig, X, d_sq, dum_var, result):
jval = j
for j in prange(0, ncol):
contrib = np.zeros((ncol, ncomp))
loopindex = min(nrow, ncol)
for i in range(0,nrow):
sigma_v = np.zeros((ncol, ncomp))
...
contrib = np.dot(v,contrib)
dX_dw = np.zeros((nrow,ncol))
dX_dw[:,j] = Xorig[:,j]
res_elem1 = np.dot(dX_dw, v[:,0:2])
res_elem2 = np.dot(X, contrib[:,0:2])
result[j,:,:] = res_elem1 + res_elem2
Accelerate SVD-based Solution with Numba
- Speed comparison for data of size 48x72:
- Numba GUVectorize function
- 'nopython' mode
- 'object' mode
- Regular Python
- Numba GUVectorize function
Speedup with Numba GUVectorize
Python | Numba Object Mode | Numba Nopython Mode |
---|---|---|
0.68 | Same as regular Python | 0.008 |
Runtimes of the functions(s)
- Eigen
- Used widely as standalone or wrapped within our packages, e.g. Tensorflow
- Wide variety of algorithms
- Fast
- Xtensor
- Similar to numpy
- Python bindings, easy to transfer data structures between C++ and Python
- Not widely adopted
Xtensor vs. Eigen
- One of the most popular C++ libraries for numerical computing
- Expressive and intuitive API
- Broadcasting and lazy evaluation
- Used by a lot of frameworks (hence fairly reliable) such as
- Tensorflow
- Stan
- Ceres
- IFOpt etc.
Eigen - Numerical Computing in C++
- Header-only library, alleviates compilation headaches
Eigen
clang++ -I /usr/local/include/eigen3/ test_eigen.cxx -o test_eigen
- Fast vector and matrix operations
- Compile-time-spec containers - up to 4 dimensions
- Matrix2d, Matrix3d, Matrix4d
- Vector2d, Vector3d, Vector4d
- Operations on matrices and vectors
- E.g. mat.sum(), mat.prod(), mat.trace()
-
Block operations on slices of matrices, vectors
- E.g. mat.block(i,j,a,b), mat.row(i), vec.tail(n)
-
Coefficient-wise operations such as
- E.g. mat.abs(), mat.exp(), mat.log()
- Compile-time-spec containers - up to 4 dimensions
Eigen - Python Productivity in C++?
- Solutions of systems of equations (small, medium data)
- A variety of algorithms for LU, QR, Eigenvector/Eigenvalue, SVD decompositions
- Benchmarks of popular algorithms
- Dense and sparse matrices
- Direct and iterative solvers for sparse matrices
- E.g. SuperLU, CG, BICGSTAB
- Some of them require external libraries such as GPL, PaStiX
Linear Algebra
typedef struct{
int highdim;
int lowdim;
int numrows;
} params;
params parameters;
MatrixXd w_ext, X, X_t, z, z_t;
X = load_csv<MatrixXd>("X.csv"); // templatized function
z = load_csv<MatrixXd>("z.csv"); // templatized function
parameters.highdim = X.cols();
parameters.lowdim = z.cols();
parameters.numrows = X.rows();
X_t = X.transpose();
z_t = z.transpose();
z_product.noalias() = z*z_t;
...
MDS - Read Data in Eigen
double nlopt_unconstrained_function_form1(const double *wd)
{
...
// Loop assignment
for(i = 0; i < parameters.numrows ; i++)
{
term1.row(i) = X.row(i).array() * w_ext.row(0).array();
}
//term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
amat.noalias() = term1*X_t;
res.noalias() = amat - z_product;
return (res.array()*res.array()).sum();
}
MDS - Nlopt Function
double nlopt_unconstrained_function_form1(const double *wd)
{
...
/*for(i = 0; i < parameters.numrows ; i++)
{
term1.row(i) = X.row(i).array() * w_ext.row(0).array();
}*/
// Optimized version
term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
amat.noalias() = term1*X_t;
res.noalias() = amat - z_product;
return (res.array()*res.array()).sum();
}
Nlopt Function - Optimized
void nlopt_unconstrained_gradient_form1(const double *wd, double *df_dw)
{
...
term1 = X.array() * w_ext.row(0).replicate(parameters.numrows,1).array();
amat.noalias() = term1*X_t;
res.noalias() = amat - z_product;
sub_term = 2.0 * res;
#pragma omp parallel
{
for(i = 0 ; i < parameters.highdim; i++)
{
weight = wd[i];
intermediate = MatrixXd::Zero(parameters.numrows,parameters.numrows);
for(j = 0; j < parameters.numrows; j++)
{
intermediate.row(j) = X(j,i) * X.col(i);
}
df_dw[i] = (sub_term.array() * intermediate.array()).sum();
}}}
Nlopt Gradient
void Optimized_Compute_SVD_derivative()
{
...
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
for(k = 0; k < k_upper; k++)
{
for(l = 0; l < projection_components; l++)
{
if(k == l) continue;
# block of size 'ncol' x one (or single column), starting at k*ncol, 1
sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *
(( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
}
}
contrib += sigma_v;
}
...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j,
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4; }}};
Memory Map
void Optimized_Compute_SVD_derivative()
{
...
contrib.setZero(ncol*ncol, projection_components); // Make one big matrix for all 'ncol' weights
for(i = 0; i < nrow; i++)
{
sigma_v.setZero(ncol*ncol,projection_components); // Make one big sigma_v matrix for all 'ncol' weights
for(k = 0; k < k_upper; k++)
{
for(l = 0; l < projection_components; l++)
{
if(k == l) continue;
sigma_v.block(k*ncol, l, ncol, 1) = -1.0 * Xorig.row(i).transpose().array() *(( u(i,k)*v.col(l) + u(i,l)*v.col(k) ) / ( d_sq(k) - d_sq(l))).array();
}
}
contrib += sigma_v;
}
...
Map<Matrix<double,Dynamic,Dynamic>,0,Stride<Dynamic,Dynamic> > M4(contrib.data() + j,
ncol, projection_components, Stride<Dynamic,Dynamic>(contrib.rows(),ncol));
res_elem1 = dX_dw * v.leftCols<2>() + X_V * M4;
}}};
Memory Map
MDS - Comparison of C++/Eigen and Python
Dataset | Python(s) | C++(s) | Speedup |
---|---|---|---|
70 x 100 | 0.0896 | 0.00388 | 23x |
40 x 45 | 0.01861 | 0.000624 | 29x |
20 x 25 | 0.0032 | 0.000099 | 32x |
- Xtensor
- C++ library designed to be similar to Numpy
- Computing on multi-dimensional arrays
- Numpy-style broadcasting and lazy evaluation
-
Xtensor-python
-
Python bindings from the Pybind11 C++ library
-
Can use the NumPy data structures in C++ using the Python Buffer Protocol
-
Xtensor
Compile Xtensor C++
g++ -I /PATH_TO_ANACONDA/envs/xtensor_env/include -std=c++14 test.cpp -o test
export CPLUS_INCLUDE_PATH=PATH_TO_ANACONDA/envs/xtensor_env/include/:
PATH_TO_EIGEN/eigen3/
g++ -std=c++14 `python3-config --ldflags` -shared -fPIC -Wl,-undefined,dynamic_lookup
xtensor_python_test.cpp -o xtensor_python_test.so
- To use from Xtensor-python
- Compile C++ Xtensor code into a shared object library
- Note the flags needed
- To compile C++ Xtensor code
Python caller
Interfacing your C++ Code with Python
import xtensor_python_test as xt
# Call function from C++ library by passing numpy objects
derivative = xt.imds_wrapper(X,z,np.sqrt(weights),u,s,v)
PYBIND11_MODULE(xtensor_python_test, m)
{
// Must be called before any other lines
xt::import_numpy();
m.doc() = "Test module for xtensor python bindings";
m.def("imds_wrapper", imds_wrapper, "Wrapper for IMDS in C++");
}
Declare the entry point in C++
xt::xarray<double> imds_wrapper(xt::pyarray<double>& X,
xt::pyarray<double>& z, xt::pyarray<double>& weights)
{
Matrix_SVD obj(X, weights, 0, 2);
obj.Compute_SVD_derivative();
}
Xtensor Wrapper in C++
void Optimized_Compute_SVD_derivative()
{
auto contrib = xt::zeros<double>({ncol, projection_components});
auto sigma_v = xt::zeros<double>({ncol, projection_components});
auto Derivative = xt::xarray<double>::from_shape({nrow, ncol, 2});
...
};
void Optimized_Compute_SVD_derivative()
{
...
auto temp = xt::tile(s_in,(1,nrow));
temp = temp.reshape(nrow, s_size);
u_in = xt::linalg::dot(u_in, temp);
contrib.fill(0);
for(i = 0; i < nrow; i++)
{
sigma_v.fill(0); // Make one big sigma_v matrix for all 'ncol' weights
for(k = 0; k < k_upper; k++)
{
for(l = 0; l < projection_components; l++)
{
if(k == l) continue;
lhs = xt::view(arr1, xt::range(k*ncol, k*ncol + ncol), xt::range(l, l + 1))
rhs_term1 = -1.0 * xt::transpose(xt::row(Xorig, i))
rhs_term2 = u(i, k) * xt::col(v, l) + u(i, l) * xt::col(v, k)
/ ( d_sq(k) - d_sq(l) )
lhs = xt::linalg::dot(rhs_term1, rhs_term2)
...
Xtensor Derivative of SVD
- Computation of Derivative of SVD of weighted MDS
- Data size of 48x72
- Less than 4% of time in the initialization routine
Data Transfer Overhead
- Native Python is slow, but we already knew that!
- Profile your code
- Scale up code through incremental complexity
- Use optimized packages (most people already do this)
- Numba
- Eigen or Xtensor
- Eigen or Xtensor with multithreading
Conclusion
MaintainableHPC
By sjster
MaintainableHPC
We've added support for incredibly smooth animations between code blocks. See it in action in this deck!
- 7,198