StupiDB

Software that isn't too smart

How does a DB work?

How does SELECT work?

How do I learn SELECT?

Learning SELECT from (PostgreSQL) source

Knowledge

Decades of CS and domain knowledge baked into systems like PostgreSQL.

  • Compilers (algebra)
  • Query optimization (join)
  • Data structures (B+ Trees)
  • Mathematical logic (model)

What's the dumbest DB look like?

What is SELECT, really?

[(f[1](row), ..., f[n](row)) for row in rows]

where

f :: row -> column

SELECT

Yes.

 

StupiDB uses __iter__ to build relational algebra

Iterators turn out to be a very effective strategy here

Isn't that "just" a generator?

StupiDB

  • Relation = Iterable[Mapping[str, Any]]
  • Categories of relations
    • Projection
    • Selection (row filtering)
    • Join
    • Group By (sort of)
  • APIs for building relations

API

>>> from pprint import pprint
>>> from stupidb.api import aggregate, group_by, mean, table
>>> rows = [
...     dict(name="Bob", age=30, timezone="America/New_York"),
...     dict(name="Susan", age=20, timezone="America/New_York"),
...     dict(name="Joe", age=41, timezone="America/Los_Angeles"),
...     dict(name="Alice", age=39, timezone="America/Los_Angeles"),
... ]
>>> average_age_by_timezone = (
...     table(rows) >> group_by(tz=lambda r: r.timezone)
...                 >> aggregate(avg_age=mean(lambda r: r.age))
... )
>>> pprint(list(average_age_by_timezone), width=79)
[Row({'tz': 'America/New_York', 'avg_age': 25.0}),
 Row({'tz': 'America/Los_Angeles', 'avg_age': 40.0})]

Window Function API

>>> from stupidb.api import Window, over, mean, select, table
>>> from datetime import date, timedelta
>>> today = date(2019, 2, 9)
>>> days = timedelta(days=1)
>>> rows = [
...     {"name": "Alice", "balance": 400, "date": today},
...     {"name": "Alice", "balance": 300, "date": today + 1 * days},
...     {"name": "Alice", "balance": 100, "date": today + 2 * days},
...     {"name": "Bob", "balance": -150, "date": today - 4 * days},
...     {"name": "Bob", "balance": 200, "date": today - 3 * days},
... ]
>>> t = table(rows)
>>> window = Window.range(
...     partition_by=[lambda r: r.name],
...     order_by=[lambda r: r.date],
...     preceding=lambda r: 2 * days  # two days behind + the current row
... )
>>> avg_balance_per_person = table(rows) >> select(
...     name=lambda r: r.name,
...     avg_balance=mean(lambda r: r.balance) >> over(window),
...     balance=lambda r: r.balance,
...     date=lambda r: r.date,
... ) >> order_by(lambda r: r.name, lambda r: r.date)
>>> pprint(list(avg_balance_per_person), width=79)  # noqa: E501
[Row({'name': 'Alice', 'balance': 400, 'date': datetime.date(2019, 2, 9), 'avg_balance': 400.0}),
 Row({'name': 'Alice', 'balance': 300, 'date': datetime.date(2019, 2, 10), 'avg_balance': 350.0}),
 Row({'name': 'Alice', 'balance': 100, 'date': datetime.date(2019, 2, 11), 'avg_balance': 266.667}),
 Row({'name': 'Bob', 'balance': -150, 'date': datetime.date(2019, 2, 5), 'avg_balance': -150.0}),
 Row({'name': 'Bob', 'balance': 200, 'date': datetime.date(2019, 2, 6), 'avg_balance': 25.0})]

Window Functions

  • Many aspects of software engineering involved
    • API design
    • Data structures
    • Algorithms
      • E.g., range bounds
  • Four algorithms for aggregation
    • Segment trees
    • Removable cumulative
    • "Append-only"
    • Naive
O\left(n\log{n}\right)
O\left(n^{2}\right)\text{*}
O\left(n^{2}\right)
O\left(n^{2}\right)

Segment Trees

Characteristics

  1. O(log N) query
  2. O(N log N) construction
  3. O(fanout ** height) space

Conclusions

  • Databases are hard
  • Feel great about writing things from scratch*
  • Build for stupidly bad performance*
Made with Slides.com