How to get rid of stringly typed analytics
The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time.
Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2018 TWO SIGMA INVESTMENTS, LP. All rights reserved
EXPLAIN PLAN
Thanks @tswast!
In [32]: import ibis
In [33]: ibis.options.interactive = True
In [34]: con = ibis.sqlite.connect('crunchbase.db')
In [35]: con.list_tables()
Out[35]: ['acquisitions', 'companies', 'investments', 'rounds']
In [36]: t = con.table('companies')
In [37]: t.name.nunique()
Out[37]: 51063
In [38]: t.region.nunique()
Out[38]: 1034
In [39]: (t.groupby('region')
...: .aggregate(topk=t.funding_rounds.count())
...: .sort_by(ibis.desc('topk'))
...: .head())
...:
Out[39]:
region topk
0 SF Bay Area 7165
1 None 5071
2 New York City 2724
3 Boston 1906
4 London 1679
SELECT *,
acq_ipos / num_investments::float AS acq_rate
FROM (
SELECT
CASE
WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
ELSE i.investor_name
END AS "Investor name",
COUNT(DISTINCT c.permalink) AS num_investments,
COUNT(
DISTINCT
CASE
WHEN c.status IN ('ipo', 'acquired') THEN
c.permalink
ELSE NULL
END
) AS acq_ipos
FROM crunchbase_companies
LEFT JOIN crunchbase_investments
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
) a
SELECT *,
acq_ipos / num_investments::float AS acq_rate
FROM (
SELECT
CASE
WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
ELSE i.investor_name
END AS "Investor name",
COUNT(DISTINCT c.permalink) AS num_investments,
COUNT(
DISTINCT
CASE
WHEN c.status IN ('ipo', 'acquired') THEN
c.permalink
ELSE NULL
END
) AS acq_ipos
FROM crunchbase_companies AS c
LEFT JOIN crunchbase_investments AS i
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
) a
Ask me if you want access to the data for this query
Lines of SQL Written
Productivity
SQL
"""
SELECT *,
acq_ipos/num_investments::FLOAT AS acquitision_rate
FROM (
SELECT CASE WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
ELSE i.investor_name END "Investor name",
{agg}(DISTINCT c.permalink) AS number_of_investments,
COUNT(DISTINCT
{case_with_conditions}
AS {my_name}
FROM {}
LEFT JOIN {}
ON c.permalink = i.company_permalink
GROUP BY 1
ORDER BY 2 DESC
) a
""".format(
generate_left_part_of_join(use_templating='obviously'),
generate_right_join_v2(sometimes_do_not_use_templating='what else would I do'),
agg=next(agg_generator()),
case_with_conditions='CASE {} END'.format(
'WHEN {} THEN {}'.format(k, v) for k, v in conditions
),
my_name='acqusitions or whatever'
)
t.string_col + 1.0
class Expr:
def __init__(self, impl):
self.impl = impl
def op(self):
return self.impl
class StringValue(Expr):
def concat(self, other):
return ops.Concat(
self, other
).to_expr()
class Node:
def to_expr(self):
output_type = self.output_type()
return output_type(self)
class Concat(ops.Node):
left = Arg(rules.string)
right = Arg(rules.string)
def output_type(self):
return rules.shape_like(
self.args,
dtype=dt.string
)
class Repeat(ValueOp):
arg = Arg(rlz.string)
times = Arg(rlz.integer)
def output_type(self):
return rlz.shape_like(
self.args, dt.string
)
class StringJoin(ValueOp):
sep = Arg(rlz.string)
arg = Arg(
rlz.list_of(
rlz.string, min_length=1
)
)
def output_type(self):
return rlz.shape_like(
tuple(self.flat_args()),
dt.string
)
t.bigint_col.bitwise_and(where=(t.bigint_col == 10) | (t.bigint_col == 40))
t.bigint_col.bitwise_and(where=(t.bigint_col == 10) | (t.bigint_col == 40))
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR t.bigint_col = 40)
BITWISE_AND({}.bigint_col) WHERE ({})
BITWISE_AND({}) WHERE ({})
BITWISE_AND(t.bigint_col) WHERE ({})
BITWISE_AND(t.bigint_col) WHERE ({} OR {})
BITWISE_AND(t.bigint_col) WHERE ({} = {} OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = {} OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR t.bigint_col = {})
Inline SQL: expr.sql("SELECT * FROM t")