Python For Quants
Performance Pandas
May 6, 2016
Jeff Reback
@jreback
Jeff Reback
- former quant
- currently working on projects at Continuum
- core commiter to pandas for last 3 years
- manage pandas since 2013
@jreback
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2243966/MIT_Logo_2.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2243968/deutsche_bank_logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2243972/Anaconda_Logo_0702_0.png)
Why Pandas?
- Vectorization for the masses
- Extract Transform Load
- Munging & data prep is a big part of the pipeline
How Pandas?
- Fast and Efficient DataFrame
- Interoperability with ecosystem
- Database-like
- User friendly API
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2556567/PyData_Stack.png)
The PyData Stack
- automatic data alignment
- rolling, expanding, and EWM operations
- timeseries ops: fillna, dropna
- resampling & ordered merges
- timezone handling
- date offsets & holiday support
- intelligent interactive indexing
Why Pandas in Finance?
- implementation time
- runtime
- resource utilization
What do we care about when writing code?
- feature set
- readability counts
- maintenance is a virtue
- tests & docs
Constraints
Objectives
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2567087/the_general_problem.png)
Why do we care about performance?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/1503141/apples.jpg)
- dtype segregation
- block memory layout
What drives pandas?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/1503171/Internal_Class_Hierarchy_Landscape_-_New_Page.png)
- dtype segregation
- block memory layout
- computation backends
What drives pandas?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/1870585/PyData_Flow_2_-_New_Page.png)
- numpy
- bottleneck
- numexpr
- dask
- numba
- libpandas
- DyND
Computation Backends
- dtype segregation
- block memory layout
- computation backends
- cython for critical parts
- hashtable for indexing
What drives pandas?
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- algo
- idioms
-
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- apply across the rows
How to make pandas fast
slow
- apply across the rows
- itertuples/iterrows
How to make pandas fast
slow
- apply across the rows
- itertuples/iterrows
- iterative updating
How to make pandas fast
slow
Do's
-
have the correct dtypes
-
pd.concat
-
Categoricals
-
.apply across columns
Don'ts
-
repeated insertions
-
micro optimize
-
use loops / re-invent the wheel
-
.apply across rows
-
.applymap
-
nest groupby.apply()
-
inplace=True
- conversions
Memory Considerations
- conversions
- categoricals
Memory Considerations
- conversions
- categoricals
- iterators
Memory Considerations
- HDF5
- bcolz
- CSV
- SQL
- JSON
- pickle
- msgpack
- feather
I/O & Serialization
Global-Interpreter-Lock
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/1767175/pandas-release-the-gil-timings.png)
dask
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2247888/collections-schedulers.png)
- Collections build task graphs
- Schedulers execute task graphs
- Graph specification = uniting interface
- A generalization of RDDs
(((A + 1) * 2) ** 3)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2248870/embarrassing.gif)
(B - B.mean(axis=0)) + (B.T / B.std())
![](https://s3.amazonaws.com/media-p.slid.es/uploads/202361/images/2248872/normalized-b.gif)
How to contribute
This Talk
@jreback
Performance Pandas
By Jeff Reback
Performance Pandas
Python For Quants 2016
- 4,901