Data Analysis with Pandas

https://github.com/jreback/PandasTalks/tree/master/february_2016/

February 19, 2016

Jeff Reback

@jreback

Jeff Reback

former quant
currently working on projects at Continuum
core commiter to pandas for last 3 years
manage pandas since 2013

@jreback

What is Pandas?
Why do we use it?
Why do we use it in Finance?
Architecture
How to fully utilize pandas
I need even more!
What's in the future

Overview

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Support for many different data types and manipulations including: floating point & integers, boolean, datetime & time delta, categorical & text data.
Easy handling of missing data
Powerful groupby, window functions, resampling & aggregations
Intelligent, data-dependent slicing, indexing and data subsetting
Many different IO connectors, including: SQL, excel, csv, HDF5, BigQuery, Stata, SAS, JSON

What is Pandas?

Vectorization for the masses
ETL
Fast and Efficient DataFrame
Interoperability with ecosystem
Database-like
User friendly API
Munging & data prep is a big part of the pipeline

Why Pandas?

automatic data alignment
rolling, expanding, and EWM operations
timeseries ops: fillna, dropna
resampling & ordered merges
timezone handling
date offsets & holiday support
intelligent interactive indexing

Why Pandas in Finance?

HDF5
bcolz
CSV
SQL
JSON
pickle
msgpack

I/O & Serialization

http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/

http://odo.readthedocs.org/en/latest/

dtype segregation
block memory layout
computation backends
cython for critical parts
hashtable for indexing

What drives pandas?

algo
idioms
built-in / vectorization
- pandas/numpy
- bottleneck/numexpr
- cython
ad-hoc cython/numba

How to make pandas fast

apply across the rows
itertuples/iterrows
iterative updating

How to make pandas fast

slow

Do's

have the correct dtypes
pd.concat
Categoricals
Use idioms & builtin
.apply across columns

Don'ts

repeated insertions
micro optimize
use loops / re-invent the wheel
.apply across rows
.applymap
nest groupby.apply()
inplace=True

Global-Interpreter Lock

I need even more!

http://continuum.io/blog/pandas-releasing-the-gil

out-of-core
parallelism
DAG semantics

I need even more!

Dask

I need even more!

(((A + 1) * 2) ** 3)

I need even more!

(B - B.mean(axis=0)) + (B.T / B.std())

I need even more!

https://dask.readthedocs.org/en/latest/

http://blaze.pydata.org/blog/2016/02/17/dask-distributed-1/

http://matthewrocklin.com/blog/work/2015/01/06/Towards-OOC-Scheduling

Panel Deprecation
IntervalIndex
libpandas

Whats in the future

How to contribute

http://pandas.pydata.org/pandas-docs/stable/contributing.html

https://github.com/jreback/PandasTalks/tree/master/february_2016

This Talk

@jreback

Data Analysis with Pandas

By Jeff Reback

Data Analysis with Pandas

Pandas Talk February 2016

9 years ago
2,212

Data Analysis with Pandas

Jeff Reback

Overview

What is Pandas?

Why Pandas?

Why Pandas in Finance?

I/O & Serialization

What drives pandas?

How to make pandas fast

How to make pandas fast

slow

Do's

Don'ts

I need even more!

I need even more!

Dask

I need even more!

I need even more!

I need even more!

I need even more!

Whats in the future

Data Analysis with Pandas

More from Jeff Reback