Ibis

How to get rid of stringly typed analytics

Important legal information

The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time.

Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws.  Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Copyright © 2018 TWO SIGMA INVESTMENTS, LP. All rights reserved

ABOUT ME

NOT A FAN oF STRINGLY TYPED CODE

EXPLAIN PLAN
  • What's an ibis?
  • Why ibis?
  • Demo
  • Design
  • Execution
  • Questions

UPSHOT

  • SQL + Python for analytics
  • String programming

What's an ibis?

An ibis

Thanks @tswast!

Another ibis

ibis

  • Python library for analytic SQL
  • Inspired by dplyr
  • "Pandonic"?
  • "Ibisish"?
  • "Birdlike"?

Example

In [32]: import ibis

In [33]: ibis.options.interactive = True

In [34]: con = ibis.sqlite.connect('crunchbase.db')

In [35]: con.list_tables()
Out[35]: ['acquisitions', 'companies', 'investments', 'rounds']

In [36]: t = con.table('companies')

In [37]: t.name.nunique()
Out[37]: 51063

In [38]: t.region.nunique()
Out[38]: 1034

In [39]: (t.groupby('region')
    ...:   .aggregate(topk=t.funding_rounds.count())
    ...:   .sort_by(ibis.desc('topk'))
    ...:   .head())
    ...:
Out[39]:
          region  topk
0    SF Bay Area  7165
1           None  5071
2  New York City  2724
3         Boston  1906
4         London  1679

Why IBIS?

SQl & Productivity

SELECT *,
       acq_ipos / num_investments::float AS acq_rate
FROM (
  SELECT
    CASE
      WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
      ELSE i.investor_name
    END AS "Investor name",
    COUNT(DISTINCT c.permalink) AS num_investments,
    COUNT(
      DISTINCT
        CASE
          WHEN c.status IN ('ipo', 'acquired') THEN
            c.permalink
          ELSE NULL
        END
    ) AS acq_ipos
  FROM crunchbase_companies
  LEFT JOIN crunchbase_investments
    ON c.permalink = i.company_permalink
  GROUP BY 1
  ORDER BY 2 DESC
) a
SELECT *,
       acq_ipos / num_investments::float AS acq_rate
FROM (
  SELECT
    CASE
      WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
      ELSE i.investor_name
    END AS "Investor name",
    COUNT(DISTINCT c.permalink) AS num_investments,
    COUNT(
      DISTINCT
        CASE
          WHEN c.status IN ('ipo', 'acquired') THEN
            c.permalink
          ELSE NULL
        END
    ) AS acq_ipos
  FROM crunchbase_companies AS c
  LEFT JOIN crunchbase_investments AS i
    ON c.permalink = i.company_permalink
  GROUP BY 1
  ORDER BY 2 DESC
) a

Ask me if you want access to the data for this query

SQl & Productivity

Lines of SQL Written

Productivity

WHY SQL?

WHY SQL?

SQL

PYTHON

PYTHON

  • Good
    • Easy to learn/use
    • Rich ecosystem
  • Not so good
    • Analytics limited in scale
    • SQL interaction

MAKE SQL PRODUCTIVE FROM PYTHON

"I knoW, I'll just generate STRINGS"

"STRINGLY" Typed

"""
SELECT *,
       acq_ipos/num_investments::FLOAT AS acquitision_rate
FROM (
  SELECT CASE WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
              ELSE i.investor_name END "Investor name",
          {agg}(DISTINCT c.permalink) AS number_of_investments,
          COUNT(DISTINCT
          {case_with_conditions}
          AS {my_name}
  FROM {}
  LEFT JOIN {}
  ON c.permalink = i.company_permalink
  GROUP BY 1
  ORDER BY 2 DESC
) a
""".format(
    generate_left_part_of_join(use_templating='obviously'),
    generate_right_join_v2(sometimes_do_not_use_templating='what else would I do'),
    agg=next(agg_generator()),
    case_with_conditions='CASE {} END'.format(
        'WHEN {} THEN {}'.format(k, v) for k, v in conditions
    ),
    my_name='acqusitions or whatever'
)

"I knoW, I'll just generate STRINGS"

"I knoW, I'll just generate STRINGS write CoNCISE PYTHON code that type checks and eventually generates strings"

recap

  • Python
  • Fail fastlyish
  • Convenience
  • Performance

DEMO time

Potential drawbacks

  • Performance
  • Abstraction
  • Features

DESIGN

DESIGN PRinCIPLES

  • Separate API from operation
  • Check types ASAP
t.string_col + 1.0

API

Impl

class Expr:
    def __init__(self, impl):
        self.impl = impl

    def op(self):
        return self.impl


class StringValue(Expr):
    def concat(self, other):
        return ops.Concat(
            self, other
        ).to_expr()
class Node:
    def to_expr(self):
        output_type = self.output_type()
        return output_type(self)


class Concat(ops.Node):
    left = Arg(rules.string)
    right = Arg(rules.string)
    
    def output_type(self):
        return rules.shape_like(
            self.args,
            dtype=dt.string
        )
  • Exprs are the API
  • Based on element type, shape
  • Used for type checking
  • Implementation details
  • Know how to make exprs
  • Used for compilation

DESIGN - SEPARATION

DESIGN - operations

  • Nodes (Ops) are implementation
  • Holds arguments
  • Determines the Expr type
class Repeat(ValueOp):
    arg = Arg(rlz.string)
    times = Arg(rlz.integer)

    def output_type(self):
        return rlz.shape_like(
            self.args, dt.string
        )
class StringJoin(ValueOp):
    sep = Arg(rlz.string)
    arg = Arg(
        rlz.list_of(
            rlz.string, min_length=1
        )
    )

    def output_type(self):
        return rlz.shape_like(
            tuple(self.flat_args()),
            dt.string
        )

Execution phases

  • Expression graph construction
  • Compilation
  • Execution

EXPReSSION GRAPH

t.bigint_col.bitwise_and(where=(t.bigint_col == 10) | (t.bigint_col == 40))

COmpilation

t.bigint_col.bitwise_and(where=(t.bigint_col == 10) | (t.bigint_col == 40))
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR t.bigint_col = 40)
BITWISE_AND({}.bigint_col) WHERE ({})
BITWISE_AND({}) WHERE ({})
BITWISE_AND(t.bigint_col) WHERE ({})
BITWISE_AND(t.bigint_col) WHERE ({} OR {})
BITWISE_AND(t.bigint_col) WHERE ({} = {} OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = {} OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR {})
BITWISE_AND(t.bigint_col) WHERE (t.bigint_col = 10 OR t.bigint_col = {})

Execution

  • Give string to driver
  • Convert results into pandas things

lots to do

  • Inline SQL: expr.sql("SELECT * FROM t")
  • Returning something other than DataFrames
  • Usability
  • TPC-xBB
  • Roadmap for 2022 is in the works!

QUESTIONS?

Made with Slides.com