Data Analysis with  Pandas

February 19, 2016

Jeff Reback

@jreback

Jeff Reback

  • former quant
  • currently working on projects at Continuum
  • core commiter to pandas for last 3 years
  • manage pandas since 2013

@jreback

  • What is Pandas?
  • Why do we use it?
  • Why do we use it in Finance?
  • Architecture
  • How to fully utilize pandas
  • I need even more!
  • What's in the future

Overview

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Support for many different data types and manipulations including: floating point & integers, boolean, datetime & time delta, categorical & text data.
  • Easy handling of missing data
  • Powerful groupby, window functions, resampling & aggregations
  • Intelligent, data-dependent slicing, indexing and data subsetting
  • Many different IO connectors, including: SQL, excel, csv, HDF5, BigQuery, Stata, SAS, JSON

What is Pandas?

  • Vectorization for the masses
  • ETL
  • Fast and Efficient DataFrame
  • Interoperability with ecosystem
  • Database-like
  • User friendly API
  • Munging & data prep is a big part of the pipeline

Why Pandas?

  • automatic data alignment
  • rolling, expanding, and EWM operations
  • timeseries ops: fillna, dropna
  • resampling & ordered merges
  • timezone handling
  • date offsets & holiday support
  • intelligent interactive indexing

Why Pandas in Finance?

  • HDF5
  • bcolz
  • CSV
  • SQL
  • JSON
  • pickle
  • msgpack

I/O & Serialization

  • dtype segregation
  • block memory layout
  • computation backends
  • cython for critical parts
  • hashtable for indexing

What drives pandas?

  • algo
  • idioms
  • built-in / vectorization
    • pandas/numpy
    • bottleneck/numexpr
    • cython
  • ad-hoc cython/numba

How to make pandas fast

  • apply across the rows
  • itertuples/iterrows
  • iterative updating

How to make pandas fast

slow

Do's

  • have the correct dtypes

  • pd.concat

  • Categoricals

  • Use idioms & builtin

  • .apply across columns

Don'ts

  • repeated insertions

  • micro optimize

  • use loops / re-invent the wheel

  • .apply across rows

  • .applymap

  • nest groupby.apply()

  • inplace=True

Global-Interpreter Lock

I need even more!

  • out-of-core
  • parallelism
  • DAG semantics

I need even more!

Dask

I need even more!

I need even more!

(((A + 1) * 2) ** 3)

I need even more!

(B - B.mean(axis=0)) + (B.T / B.std())

I need even more!

  • Panel Deprecation
  • IntervalIndex
  • libpandas

Whats in the future

How to contribute

This Talk

@jreback

Data Analysis with Pandas

By Jeff Reback

Data Analysis with Pandas

Pandas Talk February 2016

  • 2,155