Blaze: an interface to all the things
about me
- MA Psychology
- Computational Neuroscience
- Pandas core dev
- blaze, odo, et al @ContinuumIO
How do i interact with all the things?
lists of things
In [79]: gb = toolz.groupby('passenger_count',
....: (x for x in dicts if x['passenger_count'] < 6))
In [80]: {k: sum(y['fare_amount'] for y in v) / len(v) for k, v in gb.items()}
Out[80]:
{0: 37.166666666666664,
1: 12.068150330050502,
2: 11.974952289774347,
3: 11.876616991857428,
4: 12.034261638063919,
5: 11.938438450392496}
databases
In [91]: d = sa.Table('nyc',
....: sa.MetaData(bind=sa.create_engine('postgresql://localhost')),
....: autoload=True)
In [92]: sel = sa.select([d.c.passenger_count, sa.func.avg(d.c.fare_amount)])
In [93]: result = sel.where(d.c.passenger_count < 6).group_by(d.c.passenger_count)
In [94]: result.execute().fetchall()
Out[94]:
[(4, 12.8474578145854),
(5, 12.4340166757418),
(1, 12.1857106698121),
(2, 13.0082269224776),
(3, 12.620676158505),
(0, 23.5708381330686)]
kdb
In [4]: from qpython.qconnection import QConnection
In [5]: c = QConnection(port=5000, host='localhost', pandas=True)
In [6]: c.open()
In [7]: r = c.sync('select avg trip_time_in_secs by passenger_count from trip')
In [8]: r.head()
Out[8]:
trip_time_in_secs
passenger_count
0 122.071500
1 806.607092
2 852.223353
3 850.614843
4 885.621065
Yay string programming!
spark
In [21]: sdf[sdf.passenger_count < 6].groupby('passenger_count').agg({'fare_amount': 'avg'}).collect()
Out[21]:
[Row(passenger_count=0, AVG(fare_amount)=50.0),
Row(passenger_count=1, AVG(fare_amount)=12.085692890743172),
Row(passenger_count=2, AVG(fare_amount)=11.904175317185697),
Row(passenger_count=3, AVG(fare_amount)=11.670360110803324),
Row(passenger_count=4, AVG(fare_amount)=12.041448058761805),
Row(passenger_count=5, AVG(fare_amount)=12.00365975807073)]
Dataframes
In [37]: df[df.passenger_count < 6].groupby('passenger_count').fare_amount.mean()
Out[37]:
passenger_count
0 37.166667
1 12.068150
2 11.974952
3 11.876617
4 12.034262
5 11.938438
Name: fare_amount, dtype: float64
let's drive them all!
blaze =
exprs+
all the things
10,000 foot view
expressions
- Separate data from computation
- Write down what you want to compute without worrying about how it will be computed
use case: drive Q from Python
Q overview
Who's hearD of q?
for something other than a binary store.
Who's used q?
the language
- Written by Arthur Whitney
- Derived from and written in K
- K is similar to APL
- Tons of operators
- Parses from right to left
- Unless you're in qsql land
q demo!
Pray to the q gods and the demo gods
The database
- Column store
- Handles null values for all types
- Nice support for slowly changing string values (symbols)
- Excellent time series support
- Support for different storage formats that scale to huge tables
- Matrix math
The GOOD
- Column store
- Extremely fast queries
- Concise querying via qsql
- Great time series support
- Nice foreign key syntax
The BAD
- Hardly anyone knows q
- Operators > Names
- Cryptic error messages
- Large programs are unreadable by anyone except the q gods
- q gods never make errors
- Strange API breaks
- dev to sdev
- Hard to apply traditional database knowledge
THE ugly
q)x:"racecar"
q)n:count x
q)ispal:all{[x;n;i]x[i]=x[n-i+1]}[x;n]each til _:[n%2]+1
q)ispal x
1b
q)1 % "cat"
0.01010101 0.01030928 0.00862069
q)-1["foo"]
foo
-1
q)1["bar"]
bar1
please q gods, give me whitespace
kDBPY
the stack
- A SQLAlchemy dialect
- That can be driven from Blaze
- With odo for moving data around
- Using qpython to talk to a q process
sqlalchemy Dialect
class QDialect(Dialect):
def visit_select(self, select, **kwargs):
# ...
def visit_insert(self, insert, **kwargs):
# ...
# and so on
- Generate (some variant of SQL) using the visitor pattern
sqlalchemy Dialect
>>> import sqlalchemy as sa
>>> engine = sa.create_engine('q://localhost/path/to/db')
>>> meta = sa.MetaData(bind=engine)
>>> t = sa.Table('t', meta, autoload=True)
>>> expr = sa.select([t.c.name, sa.func.avg(t.c.amount)]).group_by(t.c.name)
>>> result = conn.execute().fetchall() # <- a pandas DataFrame
- Build up expressions
demo time!
wish/bug list
- DDL for splayed and partitioned tables not yet ready for primetime
- Foreign key support in blaze expressions
- Very close to merging
- Better time series support all around
- Transparent insert to special table types
- Python 3 support
- Streaming results from q
- Open source it!
thanks!
Blaze: An Interface to All the Thigns
By Phillip Cloud
Blaze: An Interface to All the Thigns
- 1,635