Ibis
How to get rid of stringly typed analytics
Important legal information
The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time.
Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2018 TWO SIGMA INVESTMENTS, LP. All rights reserved
ABOUT ME
NOT A FAN oF STRINGLY TYPED CODE
EXPLAIN PLAN
- What's an ibis?
- Why ibis?
- Demo
- Design
- Execution
- Questions
UPSHOT
- SQL + Python for analytics
String programming
What's an ibis?
An ibis
Thanks @tswast!
Another ibis
ibis
- Python library for analytic SQL
- Inspired by dplyr
- "Pandonic"?
- "Ibisish"?
- "Birdlike"?
Example
In [32]: import ibis
In [33]: ibis.options.interactive = True
In [34]: con = ibis.sqlite.connect('crunchbase.db')
In [35]: con.list_tables()
Out[35]: ['acquisitions', 'companies', 'investments', 'rounds']
In [36]: t = con.table('companies')
In [37]: t.name.nunique()
Out[37]: 51063
In [38]: t.region.nunique()
Out[38]: 1034
In [39]: (t.groupby('region')
...: .aggregate(topk=t.funding_rounds.count())
...: .sort_by(ibis.desc('topk'))
...: .head())
...:
Out[39]:
region topk
0 SF Bay Area 7165
1 None 5071
2 New York City 2724
3 Boston 1906
4 London 1679
Why IBIS?
SQl & Productivity
SELECT *,
acq_ipos / num_investments::float AS acq_rate
FROM (
SELECT
CASE
WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
ELSE i.investor_name
END AS "Investor name",
COUNT(DISTINCT c.permalink) AS num_investments,
COUNT(
DISTINCT
CASE
WHEN c.status IN ('ipo', 'acquired') THEN
c.permalink
ELSE NULL
END
) AS acq_ipos
FROM crunchbase_companies
LEFT JOIN crunchbase_investments
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
) a
SELECT *,
acq_ipos / num_investments::float AS acq_rate
FROM (
SELECT
CASE
WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
ELSE i.investor_name
END AS "Investor name",
COUNT(DISTINCT c.permalink) AS num_investments,
COUNT(
DISTINCT
CASE
WHEN c.status IN ('ipo', 'acquired') THEN
c.permalink
ELSE NULL
END
) AS acq_ipos
FROM crunchbase_companies AS c
LEFT JOIN crunchbase_investments AS i
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
) a
Ask me if you want access to the data for this query
SQl & Productivity
Lines of SQL Written
Productivity
WHY SQL?
WHY SQL?
SQL
PYTHON
PYTHON
-
Good
- Easy to learn/use
- Rich ecosystem
-
Not so good
- Analytics limited in scale
- SQL interaction
MAKE SQL PRODUCTIVE FROM PYTHON
"I knoW, I'll just generate STRINGS"
"STRINGLY" Typed
"""
SELECT *,
acq_ipos/num_investments::FLOAT AS acquitision_rate
FROM (
SELECT CASE WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
ELSE i.investor_name END "Investor name",
{agg}(DISTINCT c.permalink) AS number_of_investments,
COUNT(DISTINCT
{case_with_conditions}
AS {my_name}
FROM {}
LEFT JOIN {}
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
) a
""".format(
generate_left_part_of_join(use_templating='obviously'),
generate_right_join_v2(sometimes_do_not_use_templating='what else would I do'),
agg=next(agg_generator()),
case_with_conditions='CASE {} END'.format(
'WHEN {} THEN {}'.format(k, v) for k, v in conditions
),
my_name='acqusitions or whatever'
)
"I knoW, I'll just generate STRINGS"
"I knoW, I'll just generate STRINGS write CoNCISE PYTHON code that type checks and eventually generates strings"
recap
- Python
- Fail fastlyish
- Convenience
- Performance
DEMO time
Potential drawbacks
- Performance
- Abstraction
- Features
DESIGN
DESIGN PRinCIPLES
- Separate API from operation
- Check types ASAP
t.string_col + 1.0
API
Impl
class Expr:
def __init__(self, impl):
self.impl = impl
def op(self):
return self.impl
class StringValue(Expr):
def concat(self, other):
return ops.Concat(
self, other
).to_expr()
class Node:
def to_expr(self):
output_type = self.output_type()
return output_type(self)
class Concat(ops.Node):
left = Arg(rules.string)
right = Arg(rules.string)
def output_type(self):
return rules.shape_like(
self.args,
dtype=dt.string
)
- Exprs are the API
- Based on element type, shape
- Used for type checking
- Implementation details
- Know how to make exprs
- Used for compilation
DESIGN - SEPARATION
DESIGN - operations
- Nodes (Ops) are implementation
- Holds arguments
- Determines the Expr type
class Repeat(ValueOp):
arg = Arg(rlz.string)
times = Arg(rlz.integer)
def output_type(self):
return rlz.shape_like(
self.args, dt.string
)
class StringJoin(ValueOp):
sep = Arg(rlz.string)
arg = Arg(
rlz.list_of(
rlz.string, min_length=1
)
)
def output_type(self):
return rlz.shape_like(
tuple(self.flat_args()),
dt.string
)
Execution phases
- Expression graph construction
- Compilation
- Execution
EXPReSSION GRAPH
t.bigint_col.bitwise_and(where=(t.bigint_col == 10) | (t.bigint_col == 40))
COmpilation
t.bigint_col.bitwise_and(where=(t.bigint_col == 10) | (t.bigint_col == 40))
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR t.bigint_col = 40)
BITWISE_AND({}.bigint_col) WHERE ({})
BITWISE_AND({}) WHERE ({})
BITWISE_AND(t.bigint_col) WHERE ({})
BITWISE_AND(t.bigint_col) WHERE ({} OR {})
BITWISE_AND(t.bigint_col) WHERE ({} = {} OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = {} OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR t.bigint_col = {})
Execution
- Give string to driver
- Convert results into pandas things
lots to do
-
Inline SQL: expr.sql("SELECT * FROM t")
- Returning something other than DataFrames
- Usability
- TPC-xBB
- Roadmap for 2022 is in the works!
QUESTIONS?
Ibis
By Phillip Cloud
Ibis
Ibis is a Python library for making it easy to work with different flavors of SQL engines, with a focus on big data systems such as Apache Impala, Google BigQuery and others.
- 1,406