Jumping over data land mines with blaze

about me

MA Psychology
- Computational Neuroscience
Core pandas dev
Blaze et al @ContinuumIO

Motivation

NumPy and Pandas are limited to memory
And they have great APIs
Let's bring those APIs to more complex technologies

Approach

Blaze is an interface
- It doesn't implement any computation on its own
It doesn't replace databases or pandas
- It sits on top of them
- Like a compiler for read only analytics queries
It makes complex technologies more accessible

WHERE does BLAZe fit in to pydata?

pieces of blaze

Expressions + TYPES

>>> from blaze import symbol, discover, compute
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Alice', 'Bob', 'Forrest', 'Bubba'],
...                    'amount': [10, 20, 30, 40]})
...
>>> t = symbol('t', discover(df))
>>> t.amount.sum()
sum(t.amount)

>>> compute(t.amount.sum(), df)
100
>>> compute(t.amount.sum(), odo(df, list))
100
>>> compute(t.amount.sum(), odo(df, np.ndarray))
100

compute recipes

demo time!

Blaze also lets you Do it yourself

Who's heard of the q language?

q)x:"racecar"
q)n:count x
q)all{[x;n;i]x[i]=x[n-i+1]}[x;n]each til _:[n%2]+1
1b

Check if a string is a palindrome

q)-1 x
racecar
-1
q)1 x
racecar1

Print to stdout, with and without a newline

Um, integers are callable?

How about:

1 divided by cat

q)1 % "cat"
0.01010101 0.01030928 0.00862069

However, KDB is fast

so....

Ditch Q,
Keep KDB+

kdbpy: Q without the WAT, via blaze

KDB+ is a database sold by Kx Systems.
- Free 32-bit version available for download on their website.
Column store*.
Makes big things feel small and huge things feel doable.
Heavily used in the financial world.

Why KDB+/Q?

*It's a little more nuanced than that

It's a backend for blaze
It generates q code from python code
That code is run by a q interpreter

What is kdbpy?

To the notebook!

How does Q compare to other blaze backends?

NYC Taxi Trip Data

≈16 GB (trip dataset only)

partitioned in KDB+ on date (year.month.day)

blaze (bcolz + pandas + multiprocessing)

The computation

group by on
- passenger count
- medallion
- hack license
sum on
- trip time
- trip distance

The queries

# trip time
avg_trip_time = trip.trip_time_in_secs.mean()
by(trip.medallion, avg_trip_time=avg_trip_time)
by(trip.passenger_count, avg_trip_time=avg_trip_time)
by(trip.hack_license, avg_trip_time=avg_trip_time)

The hardware

two machines
- 32 cores, 250GB RAM, ubuntu
- 8 cores, 16GB RAM, osx

Beef vs. Mac 'n Cheese vs. Pandas

How pe-q-ular...

Questions

Is this a fair comparison?
- bcolz splits each column into chunks that fit in cache
- kdb writes a directory of columns per value in the partition column
kdb is using symbols instead of strings
- requires an index column for partitions
  - can take a long time to sort
- strings are not very efficient

How does the blaze version work?

bcolz +
pandas +
multiprocessing

bcolz

Column store
- directory per column
Column chunked to fit in cache
numexpr in certain places
- reductions
- arithmetic
transparent reading from disk

pandas

fast, in-memory analytics

Multiprocessing

compute each chunk in separate process

Storage

Compute

Parallelization

pray to the demo gods

graphlab integration

Thanks!

docs: http://blaze.pydata.org