Python For Quants
Performance Pandas
May 6, 2016
Jeff Reback
@jreback
Jeff Reback
- former quant
- currently working on projects at Continuum
- core commiter to pandas for last 3 years
- manage pandas since 2013
@jreback
data:image/s3,"s3://crabby-images/58a71/58a716c2e07a83c311029293a55ca141055d6135" alt=""
data:image/s3,"s3://crabby-images/e665c/e665c74b95ae4a43ddf1b83ee7b085bcd2f4982b" alt=""
data:image/s3,"s3://crabby-images/ac1b5/ac1b5772a9ae42f67616012431b2b7ab7af43fda" alt=""
Why Pandas?
- Vectorization for the masses
- Extract Transform Load
- Munging & data prep is a big part of the pipeline
How Pandas?
- Fast and Efficient DataFrame
- Interoperability with ecosystem
- Database-like
- User friendly API
data:image/s3,"s3://crabby-images/3fab3/3fab36cd601a2aeba794bd22a0dae64a26d3c1c5" alt=""
The PyData Stack
- automatic data alignment
- rolling, expanding, and EWM operations
- timeseries ops: fillna, dropna
- resampling & ordered merges
- timezone handling
- date offsets & holiday support
- intelligent interactive indexing
Why Pandas in Finance?
- implementation time
- runtime
- resource utilization
What do we care about when writing code?
- feature set
- readability counts
- maintenance is a virtue
- tests & docs
Constraints
Objectives
data:image/s3,"s3://crabby-images/36477/3647718dd94ecccdc4bb68c9cf7099725bee5596" alt=""
Why do we care about performance?
data:image/s3,"s3://crabby-images/7810c/7810c31f141abb38dc0a1b75c07f22e348a586e9" alt=""
- dtype segregation
- block memory layout
What drives pandas?
data:image/s3,"s3://crabby-images/37372/37372129d48415ed771ec68505df9619ffe8b17c" alt=""
- dtype segregation
- block memory layout
- computation backends
What drives pandas?
data:image/s3,"s3://crabby-images/cf0b1/cf0b1f8fddaea5e95f438da1de215ca236ab9017" alt=""
- numpy
- bottleneck
- numexpr
- dask
- numba
- libpandas
- DyND
Computation Backends
- dtype segregation
- block memory layout
- computation backends
- cython for critical parts
- hashtable for indexing
What drives pandas?
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- algo
- idioms
-
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- apply across the rows
How to make pandas fast
slow
- apply across the rows
- itertuples/iterrows
How to make pandas fast
slow
- apply across the rows
- itertuples/iterrows
- iterative updating
How to make pandas fast
slow
Do's
-
have the correct dtypes
-
pd.concat
-
Categoricals
-
.apply across columns
Don'ts
-
repeated insertions
-
micro optimize
-
use loops / re-invent the wheel
-
.apply across rows
-
.applymap
-
nest groupby.apply()
-
inplace=True
- conversions
Memory Considerations
- conversions
- categoricals
Memory Considerations
- conversions
- categoricals
- iterators
Memory Considerations
- HDF5
- bcolz
- CSV
- SQL
- JSON
- pickle
- msgpack
- feather
I/O & Serialization
Global-Interpreter-Lock
data:image/s3,"s3://crabby-images/a44a8/a44a8e7db2a288aa0c9cac46bbaa993f09899bda" alt=""
dask
data:image/s3,"s3://crabby-images/859d0/859d04f7249a722abe00f0365df9c56aacecda84" alt=""
- Collections build task graphs
- Schedulers execute task graphs
- Graph specification = uniting interface
- A generalization of RDDs
(((A + 1) * 2) ** 3)
data:image/s3,"s3://crabby-images/68908/689084f0bee9bc6d1ad8d2c471baa8669974e48b" alt=""
(B - B.mean(axis=0)) + (B.T / B.std())
data:image/s3,"s3://crabby-images/be43c/be43c88ca009631968e81c10ce4bacabfb789719" alt=""
How to contribute
This Talk
@jreback
Performance Pandas
By Jeff Reback
Performance Pandas
Python For Quants 2016
- 5,100