Python For Quants
Performance Pandas
May 6, 2016
Jeff Reback
@jreback
Jeff Reback
- former quant
- currently working on projects at Continuum
- core commiter to pandas for last 3 years
- manage pandas since 2013
@jreback
Why Pandas?
- Vectorization for the masses
- Extract Transform Load
- Munging & data prep is a big part of the pipeline
How Pandas?
- Fast and Efficient DataFrame
- Interoperability with ecosystem
- Database-like
- User friendly API
The PyData Stack
- automatic data alignment
- rolling, expanding, and EWM operations
- timeseries ops: fillna, dropna
- resampling & ordered merges
- timezone handling
- date offsets & holiday support
- intelligent interactive indexing
Why Pandas in Finance?
- implementation time
- runtime
- resource utilization
What do we care about when writing code?
- feature set
- readability counts
- maintenance is a virtue
- tests & docs
Constraints
Objectives
Why do we care about performance?
- dtype segregation
- block memory layout
What drives pandas?
- dtype segregation
- block memory layout
- computation backends
What drives pandas?
- numpy
- bottleneck
- numexpr
- dask
- numba
- libpandas
- DyND
Computation Backends
- dtype segregation
- block memory layout
- computation backends
- cython for critical parts
- hashtable for indexing
What drives pandas?
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- algo
- idioms
-
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- apply across the rows
How to make pandas fast
slow
- apply across the rows
- itertuples/iterrows
How to make pandas fast
slow
- apply across the rows
- itertuples/iterrows
- iterative updating
How to make pandas fast
slow
Do's
-
have the correct dtypes
-
pd.concat
-
Categoricals
-
.apply across columns
Don'ts
-
repeated insertions
-
micro optimize
-
use loops / re-invent the wheel
-
.apply across rows
-
.applymap
-
nest groupby.apply()
-
inplace=True
- conversions
Memory Considerations
- conversions
- categoricals
Memory Considerations
- conversions
- categoricals
- iterators
Memory Considerations
- HDF5
- bcolz
- CSV
- SQL
- JSON
- pickle
- msgpack
- feather
I/O & Serialization
Global-Interpreter-Lock
dask
- Collections build task graphs
- Schedulers execute task graphs
- Graph specification = uniting interface
- A generalization of RDDs
(((A + 1) * 2) ** 3)
(B - B.mean(axis=0)) + (B.T / B.std())
How to contribute
This Talk
@jreback
Performance Pandas
By Jeff Reback
Performance Pandas
Python For Quants 2016
- 5,014