Python For Quants

Performance Pandas

May 6, 2016

Jeff Reback

@jreback

https://github.com/jreback/PandasTalks/tree/master/performance/may_2016

Jeff Reback

former quant
currently working on projects at Continuum
core commiter to pandas for last 3 years
manage pandas since 2013

@jreback

Why Pandas?

Vectorization for the masses
Extract Transform Load
Munging & data prep is a big part of the pipeline

How Pandas?

Fast and Efficient DataFrame
Interoperability with ecosystem
Database-like
User friendly API

The PyData Stack

automatic data alignment
rolling, expanding, and EWM operations
timeseries ops: fillna, dropna
resampling & ordered merges
timezone handling
date offsets & holiday support
intelligent interactive indexing

Why Pandas in Finance?

implementation time
runtime
resource utilization

What do we care about when writing code?

feature set
readability counts
maintenance is a virtue
tests & docs

Constraints

Objectives

Why do we care about performance?

dtype segregation
block memory layout

What drives pandas?

dtype segregation
block memory layout
computation backends

What drives pandas?

numpy
bottleneck
numexpr
dask
numba
libpandas
DyND

Computation Backends

http://slides.com/jeffreback/ds4ds-pandas#/

dtype segregation
block memory layout
computation backends
cython for critical parts
hashtable for indexing

What drives pandas?

algo
idioms
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
ad-hoc cython/numba

How to make pandas fast

algo
idioms
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
ad-hoc cython/numba

How to make pandas fast

algo
idioms
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
ad-hoc cython/numba

How to make pandas fast

apply across the rows

How to make pandas fast

slow

apply across the rows
itertuples/iterrows

How to make pandas fast

slow

apply across the rows
itertuples/iterrows
iterative updating

How to make pandas fast

slow

Do's

have the correct dtypes
pd.concat
Categoricals
.apply across columns

Don'ts

repeated insertions
micro optimize
use loops / re-invent the wheel
.apply across rows
.applymap
nest groupby.apply()
inplace=True

conversions

Memory Considerations

conversions
categoricals

Memory Considerations

conversions
categoricals
iterators

Memory Considerations

HDF5
bcolz
CSV
SQL
JSON
pickle
msgpack
feather

I/O & Serialization

http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/

http://odo.readthedocs.org/en/latest/

Global-Interpreter-Lock

http://continuum.io/blog/pandas-releasing-the-gil

dask

Collections build task graphs
Schedulers execute task graphs
Graph specification = uniting interface
A generalization of RDDs

(((A + 1) * 2) ** 3)

(B - B.mean(axis=0)) + (B.T / B.std())

https://dask.readthedocs.org/en/latest/

http://blaze.pydata.org/blog/2016/02/17/dask-distributed-1/

http://matthewrocklin.com/blog/work/2015/01/06/Towards-OOC-Scheduling

How to contribute

https://github.com/pydata/pandas/issues

https://github.com/jreback/PandasTalks/tree/master/performance/may_2016

This Talk

@jreback

Performance Pandas

By Jeff Reback

Performance Pandas

Python For Quants 2016

5,483

Performance Pandas

Jeff Reback

Why Pandas?

How Pandas?

Why Pandas in Finance?

What do we care about when writing code?

Why do we care about performance?

What drives pandas?

What drives pandas?

Computation Backends

What drives pandas?

How to make pandas fast

How to make pandas fast

How to make pandas fast

How to make pandas fast

slow

How to make pandas fast

slow

How to make pandas fast

slow

Do's

Don'ts

Memory Considerations

Memory Considerations

Memory Considerations

I/O & Serialization

Global-Interpreter-Lock

dask

Performance Pandas

More from Jeff Reback