Jumping over data land mines with blaze

Recap of @mrocklin's talk

Blaze is good at:

Separating computation from data
Migration between different data sources
Providing an well-known API for computing over different data sources (think pandas)

Separating computation from data

>>> from blaze import Symbol, discover, compute
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['Alice', 'Bob', 'Forrest', 'Bubba'],
...                    'amount': [10, 20, 30, 40]})
...
>>> t = Symbol('t', discover(df))
>>> t.amount.sum()
sum(t.amount)
>>> compute(t.amount.sum(), df)
100
>>> compute(t.amount.sum(), df.values.tolist())
100
>>> compute(t.amount.sum(), df.to_records(index=False))
100

Migrating between different data sources

>>> from blaze import into, drop
>>> import numpy as np
>>> into(list, df)
[(10, 'Alice'), (20, 'Bob'), (30, 'Forrest'), (40, 'Bubba')]

>>> into(np.ndarray, df)
rec.array([(10, 'Alice'), (20, 'Bob'), (30, 'Forrest'), (40, 'Bubba')],
      dtype=[('amount', '<i8'), ('name', 'O')])

>>> into('sqlite:///db.db::t', df)
<blaze.data.sql.SQL at 0x108278fd0>

>>> drop(_)  # remove the database for the next example
>>> result = into(pd.DataFrame, into('sqlite:///db.db::t', df))
>>> result
   amount     name
0      10    Alice
1      20      Bob
2      30  Forrest
3      40    Bubba
>>> type(result)
pandas.core.frame.DataFrame

A comfortable API

>>> from blaze import Data
>>> d = Data(df)
>>> d.amount.dshape
dshape("4 * int64")
>>> d.amount.
d.amount.count             d.amount.max               d.amount.shape
d.amount.count_values      d.amount.mean              d.amount.sort
d.amount.distinct          d.amount.min               d.amount.std
d.amount.dshape            d.amount.ndim              d.amount.sum
d.amount.fields            d.amount.nelements         d.amount.truncate
d.amount.head              d.amount.nrows             d.amount.utcfromtimestamp
d.amount.isidentical       d.amount.nunique           d.amount.var
d.amount.label             d.amount.relabel
d.amount.map               d.amount.schema
>>> d.name.dshape
dshape("4 * string")
>>> d.name.
d.name.count         d.name.head          d.name.max           d.name.nunique
d.name.count_values  d.name.isidentical   d.name.min           d.name.relabel
d.name.distinct      d.name.label         d.name.ndim          d.name.schema
d.name.dshape        d.name.like          d.name.nelements     d.name.shape
d.name.fields        d.name.map           d.name.nrows         d.name.sort

Blaze also lets you DIY

Who's heard of the q language?

q)x:"racecar"
q)n:count x
q)all{[x;n;i]x[i]=x[n-i+1]}[x;n]each til _:[n%2]+1
1b

Check if a string is a palindrome

q)-1 x
racecar
-1
q)1 x
racecar1

Print to stdout, with and without a newline

Um, integers are callable?

How about:

1 divided by cat

q)1 % "cat"
0.01010101 0.01030928 0.00862069

However, KDB is fast

so....

Ditch Q,
Keep KDB+

kdbpy: Q without the WAT, via blaze

KDB+ is a database sold by Kx Systems.
- Free 32-bit version available for download on their website.
Column store*.
Makes big things feel small and huge things feel doable.
Heavily used in the financial world.

Why KDB+/Q?

*It's a little more nuanced than that

It's a backend for blaze
It generates q code from python code
That code is run by a q interpreter

What is kdbpy?

To the notebook!

It's a WIP

Only partial support for KDB's partitioned and splayed tables
- Q code is different for different kinds of tables
  - We really want a single API to drive the different kinds
No profiling on really huge things yet
Code generation is overly redundant in certain cases

How does Q compare to other blaze backends?

NYC Taxi Trip Data

≈16 GB (trip dataset only)

partitioned in KDB+ on date (year.month.day)

blaze

The computation

group by on
- passenger count
- medallion
- hack license
sum on
- trip time
- trip distance
cartesian product of the above

The hardware

two machines
- 32 cores, 250GB RAM, ubuntu
- 8 cores, 16GB RAM, osx

Beef vs. Mac 'n Cheese vs. Pandas

How pe-q-ular...

Questions

Did I partition the dataset optimally in KDB?
Are my queries optimal?
Is this a fair comparison?
- bcolz splits each column into chunks that fit in cache
- kdb writes a directory of columns per value in the partition column
kdb is using symbols instead of strings
- symbols are much faster in kdb
- requires an index column for partitions
  - can take a long time to sort
- strings are not very efficient
- symbols are really a categorical type

How does the blaze version work?

bcolz +
pandas +
multiprocessing

bcolz

Column store
- directory per column
Column chunked to fit in cache
numexpr in certain places
- sum
- selection
- arithmetic
transparent reading from disk

pandas

fast, in-memory analytics

Multiprocessing

compute each chunk in separate process

Storage

Compute

Parallelization

Usage Example

>>> from blaze import by, compute, Data
>>> from multiprocessing import Pool
>>> data = Data('trip.bcolz')
>>> pool = Pool()  # default to the number of logical cores
>>> expr = by(data.passenger_count, avg=data.trip_time_in_secs.mean())
>>> result = compute(expr, map=pool.map)  # chunksize defaults to 2 ** 20
>>> result
    passenger_count         avg
0                 0  122.071500
1                 1  806.607092
2                 2  852.223353
3                 3  850.614843
4                 4  885.621065
5                 5  763.933618
6                 6  760.465655
7                 7  428.485714
8                 8  527.920000
9                 9  506.230769
10              129  240.000000
11              208   55.384615
12              255  990.000000
>>> timeit compute(expr, map=pool.map)
1 loops, best of 3: 5.67 s per loop

The guts

chunk = symbol('chunk', chunksize * leaf.schema)
(chunk, chunk_expr), (agg, agg_expr) = split(leaf, expr, chunk=chunk)

data_parts = partitions(data, chunksize=(chunksize,))

parts = list(map(curry(compute_chunk, data, chunk, chunk_expr),
                       data_parts))

# concatenate parts into intermediate

result = compute(agg_expr, {agg: intermediate})

Split an expression into chunks
Get the chunked expression and the agg
run Pool.map across each chunk
compute the aggregate function over the intermediate

THANKS!

Matt Rocklin
Hugo Shi
Jeff Reback
Everyone @ ContinuumIO

Jumping over data land mines with blaze

Recap of @mrocklin's talk

Blaze is good at:

Separating computation from data

Migration between different data sources

Providing an well-known API for computing over different data sources (think pandas)

Separating computation from data

Migrating between different data sources

A comfortable API

Blaze also lets you DIY

Who's heard of the q language?

How about:

However, KDB is fast

Ditch Q, Keep KDB+

kdbpy: Q without the WAT, via blaze

It's a backend for blaze

It generates q code from python code

That code is run by a q interpreter

To the notebook!

It's a WIP

How does Q compare to other blaze backends?

NYC Taxi Trip Data

The computation

The hardware

Beef vs. Mac 'n Cheese vs. Pandas

How pe-q-ular...

Questions

How does the blaze version work?

bcolz + pandas + multiprocessing

bcolz

pandas

Multiprocessing

Usage Example

The guts

THANKS!

Ditch Q,
Keep KDB+

bcolz +
pandas +
multiprocessing