Data Analysis with Pandas
February 19, 2016
Jeff Reback
@jreback
Jeff Reback
- former quant
- currently working on projects at Continuum
- core commiter to pandas for last 3 years
- manage pandas since 2013
@jreback
- What is Pandas?
- Why do we use it?
- Why do we use it in Finance?
- Architecture
- How to fully utilize pandas
- I need even more!
- What's in the future
Overview
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Support for many different data types and manipulations including: floating point & integers, boolean, datetime & time delta, categorical & text data.
- Easy handling of missing data
- Powerful groupby, window functions, resampling & aggregations
- Intelligent, data-dependent slicing, indexing and data subsetting
- Many different IO connectors, including: SQL, excel, csv, HDF5, BigQuery, Stata, SAS, JSON
What is Pandas?
- Vectorization for the masses
- ETL
- Fast and Efficient DataFrame
- Interoperability with ecosystem
- Database-like
- User friendly API
- Munging & data prep is a big part of the pipeline
Why Pandas?
- automatic data alignment
- rolling, expanding, and EWM operations
- timeseries ops: fillna, dropna
- resampling & ordered merges
- timezone handling
- date offsets & holiday support
- intelligent interactive indexing
Why Pandas in Finance?
- HDF5
- bcolz
- CSV
- SQL
- JSON
- pickle
- msgpack
I/O & Serialization
- dtype segregation
- block memory layout
- computation backends
- cython for critical parts
- hashtable for indexing
What drives pandas?
- algo
- idioms
- built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
- ad-hoc cython/numba
How to make pandas fast
- apply across the rows
- itertuples/iterrows
- iterative updating
How to make pandas fast
slow
Do's
-
have the correct dtypes
-
pd.concat
-
Categoricals
-
Use idioms & builtin
-
.apply across columns
Don'ts
-
repeated insertions
-
micro optimize
-
use loops / re-invent the wheel
-
.apply across rows
-
.applymap
-
nest groupby.apply()
-
inplace=True
Global-Interpreter Lock
I need even more!
- out-of-core
- parallelism
- DAG semantics
I need even more!
Dask
I need even more!
I need even more!
(((A + 1) * 2) ** 3)
I need even more!
(B - B.mean(axis=0)) + (B.T / B.std())
I need even more!
- Panel Deprecation
- IntervalIndex
- libpandas
Whats in the future
How to contribute
This Talk
@jreback
Data Analysis with Pandas
By Jeff Reback
Data Analysis with Pandas
Pandas Talk February 2016
- 2,126