Data Analysis with Pandas

https://github.com/jreback/PandasTalks/tree/master/february_2016/

February 19, 2016

Jeff Reback

@jreback

Jeff Reback

former quant
currently working on projects at Continuum
core commiter to pandas for last 3 years
manage pandas since 2013

@jreback

What is Pandas?
Why do we use it?
Why do we use it in Finance?
Architecture
How to fully utilize pandas
I need even more!
What's in the future

Overview

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Support for many different data types and manipulations including: floating point & integers, boolean, datetime & time delta, categorical & text data.
Easy handling of missing data
Powerful groupby, window functions, resampling & aggregations
Intelligent, data-dependent slicing, indexing and data subsetting
Many different IO connectors, including: SQL, excel, csv, HDF5, BigQuery, Stata, SAS, JSON

What is Pandas?

Vectorization for the masses
ETL
Fast and Efficient DataFrame
Interoperability with ecosystem
Database-like
User friendly API
Munging & data prep is a big part of the pipeline

Why Pandas?

automatic data alignment
rolling, expanding, and EWM operations
timeseries ops: fillna, dropna
resampling & ordered merges
timezone handling
date offsets & holiday support
intelligent interactive indexing

Why Pandas in Finance?

HDF5
bcolz
CSV
SQL
JSON
pickle
msgpack

I/O & Serialization

http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/

http://odo.readthedocs.org/en/latest/

dtype segregation
block memory layout
computation backends
cython for critical parts
hashtable for indexing

What drives pandas?

algo
idioms
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
ad-hoc cython/numba

How to make pandas fast

apply across the rows
itertuples/iterrows
iterative updating

How to make pandas fast

slow

Do's

have the correct dtypes
pd.concat
Categoricals
Use idioms & builtin
.apply across columns

Don'ts

repeated insertions
micro optimize
use loops / re-invent the wheel
.apply across rows
.applymap
nest groupby.apply()
inplace=True

Global-Interpreter Lock

I need even more!

http://continuum.io/blog/pandas-releasing-the-gil

out-of-core
parallelism
DAG semantics

I need even more!

Dask

I need even more!

(((A + 1) * 2) ** 3)

I need even more!

(B - B.mean(axis=0)) + (B.T / B.std())

I need even more!

https://dask.readthedocs.org/en/latest/

http://blaze.pydata.org/blog/2016/02/17/dask-distributed-1/

http://matthewrocklin.com/blog/work/2015/01/06/Towards-OOC-Scheduling

Panel Deprecation
IntervalIndex
libpandas

Whats in the future

How to contribute

http://pandas.pydata.org/pandas-docs/stable/contributing.html

https://github.com/jreback/PandasTalks/tree/master/february_2016

This Talk

@jreback

Data Analysis with Pandas

By Jeff Reback

Data Analysis with Pandas

Pandas Talk February 2016

2,493

Data Analysis with Pandas

Jeff Reback

Overview

What is Pandas?

Why Pandas?

Why Pandas in Finance?

I/O & Serialization

What drives pandas?

How to make pandas fast

How to make pandas fast

slow

Do's

Don'ts

I need even more!

I need even more!

Dask

I need even more!

I need even more!

I need even more!

I need even more!

Whats in the future

Data Analysis with Pandas

More from Jeff Reback