select(df, a, b) %>% group_by(b) %>% summarize(u=mean(a))
"What do I want compute?"
Blaze expressions describe data.
They consist of symbols and operations on those symbols
>>> from blaze import symbol
>>> t = symbol('t', '1000000 * {name: string, amount: float64}')name
shape
+
+
type information
>>> by(t.name, avg=t.amount.mean(), sum=t.amount.sum())>>> join(s, t, on_left='name', on_right='alias')Group By
Join
Many more...
"How do I compute expression X on backend Y?"
@dispatch(Join, pd.DataFrame, pd.DataFrame)
def compute_up(expr, lhs, rhs):
# call pandas join implementation
return pd.merge(lhs, rhs, on=expr.on_left + expr.on_right)
@dispatch(Join, pyspark.sql.DataFrame, pyspark.sql.DataFrame)
def compute_up(expr, lhs, rhs):
# call sparksql join implementation
return lhs.join(rhs, expr.on_left == expr.on_right)>>> odo([1, 2, 3], tuple)
(1, 2, 3)list » tuple
>>> odo('hive://hostname/default::users_csv',
... 'hive://hostname/default::users_parquet',
... stored_as='PARQUET', external=False)
<an eternity later ...
sqlalchemy.Table repr>df.to_csv('/path/to/file.csv')load data
local infile '/path/to/file.csv'
into table mytable;>>> odo(df,
... 'hive://hostname/default::tablename')boto.get_bucket().get_contents_to_filename()pandas.read_json()DataFrame.to_csv()copy t from '/path/to/file.csv'
with
delimiter ','
header TRUE>>> odo('s3://mybucket/path/to/data.json',
... 'postgresql://user:passwd@localhost:port/db::data')Each node is a type (DataFrame, list, sqlalchemy.Table, etc...)
Each edge is a conversion function
from odo import convert
from pyspark.sql import DataFrame as SparkDataFrame
@convert(pd.DataFrame, SparkDataFrame)
def frame_to_frame(spark_frame, **kwargs):
return spark_frame.toPandas()conda install blaze odo
pip install blaze odo