Python For Quants

Performance Pandas

May 6, 2016

Jeff Reback

@jreback

Jeff Reback

  • former quant
  • currently working on projects at Continuum
  • core commiter to pandas for last 3 years
  • manage pandas since 2013

@jreback

Why Pandas?

  • Vectorization for the masses
  • Extract Transform Load
  • Munging & data prep is a big part of the pipeline

How Pandas?

  • Fast and Efficient DataFrame
  • Interoperability with ecosystem
  • Database-like
  • User friendly API

The PyData Stack

  • automatic data alignment
  • rolling, expanding, and EWM operations
  • timeseries ops: fillna, dropna
  • resampling & ordered merges
  • timezone handling
  • date offsets & holiday support
  • intelligent interactive indexing

Why Pandas in Finance?

  • implementation time
  • runtime
  • resource utilization

What do we care about when writing code?

  • feature set
  • readability counts
  • maintenance is a virtue
  • tests & docs

Constraints

Objectives

Why do we care about performance?

  • dtype segregation
  • block memory layout

What drives pandas?

  • dtype segregation
  • block memory layout
  • computation backends

What drives pandas?

  • numpy
  • bottleneck
  • numexpr
  • dask
  • numba
  • libpandas
  • DyND

Computation Backends

  • dtype segregation
  • block memory layout
  • computation backends
  • cython for critical parts
  • hashtable for indexing

What drives pandas?

  • algo
  • idioms
  • built-in / vectorization
    • pandas/numpy
    • bottleneck/numexpr
    • cython
  • ad-hoc cython/numba

How to make pandas fast

  • algo
  • idioms
  • built-in / vectorization
    • pandas/numpy
    • bottleneck/numexpr
    • cython
  • ad-hoc cython/numba

How to make pandas fast

  • algo
  • idioms
  • built-in / vectorization
    • pandas/numpy
    • bottleneck/numexpr
    • cython
  • ad-hoc cython/numba

How to make pandas fast

  • apply across the rows

How to make pandas fast

slow

  • apply across the rows
  • itertuples/iterrows

How to make pandas fast

slow

  • apply across the rows
  • itertuples/iterrows
  • iterative updating

How to make pandas fast

slow

Do's

  • have the correct dtypes

  • pd.concat

  • Categoricals

  • .apply across columns

Don'ts

  • repeated insertions

  • micro optimize

  • use loops / re-invent the wheel

  • .apply across rows

  • .applymap

  • nest groupby.apply()

  • inplace=True

  • conversions

Memory Considerations

  • conversions
  • categoricals

Memory Considerations

  • conversions
  • categoricals
  • iterators

Memory Considerations

  • HDF5
  • bcolz
  • CSV
  • SQL
  • JSON
  • pickle
  • msgpack
  • feather

I/O & Serialization

Global-Interpreter-Lock

dask

  • Collections build task graphs
  • Schedulers execute task graphs
  • Graph specification = uniting interface
  • A generalization of RDDs
(((A + 1) * 2) ** 3)
(B - B.mean(axis=0)) + (B.T / B.std())

How to contribute

This Talk

@jreback

Performance Pandas

By Jeff Reback

Performance Pandas

Python For Quants 2016

  • 1,691
Loading comments...

More from Jeff Reback