Blaze (not) indepth
numpy + pandas for datasets larger than memory
Blaze in Pictures
IDEAL: pandaslike workflow, BUT not in pandas
 Spark
 Impala
 Other potentiallylargedata systems ...
Pandas
select avg(account) from df group by name
df.groupby('name').account.mean()
SQL
SPARK
Different systems express things differently
df.groupBy(df.name).agg({'mean': 'amount'})
LET blaze drive it for you
To the notebook!
into the rabbit hole!
Expressions
Ask questions about your data
Expressions
Blaze expressions describe our data. They consist of symbols and operations on those symbols
>>> from blaze import symbol
>>> t = symbol('t', '1000000 * {name: string, amount: float64}')
datashape = shape + type info
symbol name
"t is a one million row table with a string column called 'name' and a float64 column called 'amount'"
MOar ExpressioNS!
>>> by(t.name, avg=t.amount.mean(), sum=t.amount.sum())
Splitapplycombine
Join
>>> join(s, t, on_left='name', on_right='alias')
Many more...
 Arithmetic (can use numba here)
 Reductions (nunique, count, etc.)
 We take requests!
Data
Bits and Bytes
resources

resource
 Regex dispatcher
 resource('*.csv') > CSV object
 resource('postgresql://...') > sqlalchemy table
 resource('foo.bcolz') > ctable or carray
 Lets you get to your data quickly
we also need to be able to go between different systems
Migrations with odo
 go from a thing of type B > thing of type A
 e.g., numpy array to sqlalchemy table
 get the least cost conversion path from B > A
 uses networkx
 alleviates us having to write every single conversion from A <> B
 Difficult to test all conversions
numpy arrays
Dataframes
Generic iterators
Performance?
Yes, please
Currently, we can express simple parallelism by chunking
how does this work?
Chunking
SUPPOSE WE HAVE A LARGE ARRAY OF INTEGERS
x = np.array([5, 3, 1, ... <one trillion numbers>, ... 12, 5, 10])
A trillion numbers
How do we compute the sum?
x.sum()
Define the problem in Blaze
>>> from blaze import symbol
>>> x = symbol('x', '1000000000 * int')
>>> x.sum()
sum by chunking
size = 1000000
chunk = x[size * i:size * (i + 1)]
aggregate[i] = chunk.sum()
aggregate.sum()
>>> from blaze.expr.split import split
>>> split(x, x.sum())
((chunk, sum(chunk)),
(aggregate, sum(aggregate)))
Sum of aggregated results
Sum of each chunk
Count by chunking
size = 1000000
chunk = x[size * i:size * (i + 1)]
aggregate[i] = chunk.count()
aggregate.sum()
>>> from blaze.expr.split import split
>>> split(x, x.count())
((chunk, count(chunk)),
(aggregate, sum(aggregate)))
Sum of aggregated results
Count of each chunk
mean by chunking
size = 1000000
chunk = x[size * i:size * (i + 1)]
aggregate.total[i] = chunk.sum()
aggregate.n[i] = chunk.count()
aggregate.total.sum() / aggregate.n.sum()
>>> from blaze.expr.split import split
>>> split(x, x.mean())
((chunk, summary(count=count(chunk), total=sum(chunk))),
(aggregate, sum(aggregate.total)) / sum(aggregate.count))
Sum the total and count then divide
Sum and count of each chunk
number of occurrences by chunking
size = 1000000
chunk = x[size * i:size * (i + 1)]
by(chunk, freq=chunk.count())
by(aggregate, freq=aggregate.freq.sum())
>>> from blaze.expr.split import split
>>> split(x, by(x, freq=x.count())
((chunk, by(chunk, freq=count(chunk))),
(aggregate, by(aggregate.chunk, freq=sum(aggregate.freq))))
Splitapplycombine on concatenation of results
Splitapplycombine on each chunk
ndimensional reductions
>>> points = symbol('points', '10000 * 10000 * 10000 * {x: int, y: int}')
>>> expr = (points.x + points.y).var(axis=0)
>>> split(points, expr, chunk=chunk)
((chunk,
summary(n = count( chunk.x + chunk.y ),
x = sum( chunk.x + chunk.y ),
x2 = sum((chunk.x + chunk.y) ** 2))),
(aggregate,
(sum(aggregate.x2) / (sum(aggregate.n)))
 ((sum(aggregate.x) / (sum(aggregate.n))) ** 2)))
Variance of x + y
Chunk: a cube of a billion elements
Data: a 10000 by 10000 by 10000 array of (x,y) coordinates
>>> chunk = symbol('chunk', '1000 * 1000 * 1000 * {x: int, y: int}')
This works on many things people want to with pandas
...except sort and joins
nyc taxi dataset notebook
Thanks!
 ContinuumIO: http://continuum.io
 Blaze team: http://blaze.pydata.org
 Alex Rubinsteyn, hammerlab and Mount Sinai for hosting
Questions?
Interpreter structure
compute core:

before execution
 optimize
 expression optimizations
 pre_compute
 beginning of the pipeline
 optimize
 execution
 compute_down
 operate on the whole expression
 pre_compute
 something has changed type
 compute_up
 Individual node in the expression
 compute_down

After execution
 post_compute
 most of the time this doesn't do anything
 SQL backend is notable
 post_compute
compute core:
 pre_compute all leaves
 optimize
 compute_down if the implementation exists
 bottom up traversal of the expression tree until we change data types significantly or we've reached the root node
 optimize and pre_compute
 go to 3
 post_compute
 manipulate the data before execution
pre_compute :: Expr, Data > Data
 manipulate the expression before execution
optimize :: Expr, Data > Expr
 do something with the entire expression before calling compute_up
compute_down :: Expr, Data > Data
 compute a single node in our expression tree
compute_up :: Expr, Data > Data
 do something after we've traversed the tree
post_compute :: Expr, Data > Data
 run the interpreter
compute :: Expr, Data > Data
compute core:
SOME NICE DOCS
 how compute works: http://blaze.pydata.org/docs/dev/exprcomputedev.html
 pipeline: http://blaze.pydata.org/docs/dev/computation.html
pytables backend Example
>>> @dispatch(Selection, tb.Table)
... def compute_up(expr, data):
... s = eval_str(expr.predicate) # Produce string like 'amount < 0'
... return data.read_where(s) # Use PyTables read_where method
>>> @dispatch(Head, tb.Table)
... def compute_up(expr, data):
... return data[:expr.n] # PyTables supports standard indexing
>>> @dispatch(Field, tb.Table)
... def compute_up(expr, data):
... return data.col(expr._name) # Use the PyTables .col method
Blaze indepth
By Phillip Cloud
Blaze indepth
 1,692