Jumping over data land mines with blaze

about me

  • MA Psychology
    • Computational Neuroscience
  • Core pandas dev
  • Blaze et al @ContinuumIO

Motivation

  • NumPy and Pandas are limited to memory
  • And they have great APIs
  • Let's bring those APIs to more complex technologies

Approach

  • Blaze is an interface
    • It doesn't implement any computation on its own
  • It doesn't replace databases or pandas
    • It sits on top of them
    • Like a compiler for read only analytics queries
  • It makes complex technologies more accessible

WHERE does BLAZe fit in to pydata?

pieces of blaze

Expressions + TYPES

>>> from blaze import symbol, discover, compute
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Alice', 'Bob', 'Forrest', 'Bubba'],
...                    'amount': [10, 20, 30, 40]})
...
>>> t = symbol('t', discover(df))
>>> t.amount.sum()
sum(t.amount)
>>> compute(t.amount.sum(), df)
100
>>> compute(t.amount.sum(), odo(df, list))
100
>>> compute(t.amount.sum(), odo(df, np.ndarray))
100

compute recipes

demo time!

Blaze also lets you Do it yourself

Who's heard of the q language?

q)x:"racecar"
q)n:count x
q)all{[x;n;i]x[i]=x[n-i+1]}[x;n]each til _:[n%2]+1
1b

Check if a string is a palindrome

q)-1 x
racecar
-1
q)1 x
racecar1

Print to stdout, with and without a newline

Um, integers are callable?

How about:

1 divided by cat

q)1 % "cat"
0.01010101 0.01030928 0.00862069

However, KDB is fast

so....

Ditch Q,
Keep KDB+

kdbpy: Q without the WAT, via blaze

  • KDB+ is a database sold by Kx Systems.
    • Free 32-bit version available for download on their website.
  • Column store*.
  • Makes big things feel small and huge things feel doable.
  • Heavily used in the financial world.

Why KDB+/Q?

*It's a little more nuanced than that

  • It's a backend for blaze

  • It generates q code from python code

  • That code is run by a q interpreter

What is kdbpy?

To the notebook!

How does Q compare to other blaze backends?

NYC Taxi Trip Data

≈16 GB  (trip dataset only)

partitioned in KDB+ on date (year.month.day)

vs

blaze (bcolz + pandas + multiprocessing)

The computation

  • group by on

    • passenger count

    • medallion

    • hack license

  • sum on

    • trip time

    • trip distance

The queries

# trip time
avg_trip_time = trip.trip_time_in_secs.mean()
by(trip.medallion, avg_trip_time=avg_trip_time)
by(trip.passenger_count, avg_trip_time=avg_trip_time)
by(trip.hack_license, avg_trip_time=avg_trip_time)

The hardware

  • two machines
    • 32 cores, 250GB RAM, ubuntu
    • 8 cores, 16GB RAM, osx

Beef vs. Mac 'n Cheese vs. Pandas

How pe-q-ular...

Questions

  • Is this a fair comparison?
    • bcolz splits each column into chunks that fit in cache
    • kdb writes a directory of columns per value in the partition column
  • kdb is using symbols instead of strings
    • requires an index column for partitions
      • can take a long time to sort
    • ​strings are not very efficient

How does the blaze version work?

bcolz +
pandas +
multiprocessing

bcolz

  • Column store
    • directory per column
  • Column chunked to fit in cache
  • numexpr in certain places
    • reductions
    • arithmetic
  • transparent reading from disk

pandas

  • fast, in-memory analytics

Multiprocessing

  • compute each chunk in separate process

Storage

Compute

Parallelization

pray to the demo gods

graphlab integration

Thanks!

PyData Dallas 2015

By Phillip Cloud

PyData Dallas 2015

Blazing through data land mines with Python

  • 1,422