StupiDB
Software that isn't too smart
How does a DB work?
How does SELECT work?
How do I learn SELECT?
Learning SELECT from (PostgreSQL) source
Knowledge
Decades of CS and domain knowledge baked into systems like PostgreSQL.
- Compilers (algebra)
- Query optimization (join)
- Data structures (B+ Trees)
- Mathematical logic (model)
What's the dumbest DB look like?
What is SELECT, really?
[(f[1](row), ..., f[n](row)) for row in rows]
where
f :: row -> column
SELECT
Yes.
StupiDB uses __iter__ to build relational algebra
Iterators turn out to be a very effective strategy here
Isn't that "just" a generator?
StupiDB
-
Relation = Iterable[Mapping[str, Any]]
- Categories of relations
- Projection
- Selection (row filtering)
- Join
- Group By (sort of)
- APIs for building relations
API
>>> from pprint import pprint
>>> from stupidb.api import aggregate, group_by, mean, table
>>> rows = [
... dict(name="Bob", age=30, timezone="America/New_York"),
... dict(name="Susan", age=20, timezone="America/New_York"),
... dict(name="Joe", age=41, timezone="America/Los_Angeles"),
... dict(name="Alice", age=39, timezone="America/Los_Angeles"),
... ]
>>> average_age_by_timezone = (
... table(rows) >> group_by(tz=lambda r: r.timezone)
... >> aggregate(avg_age=mean(lambda r: r.age))
... )
>>> pprint(list(average_age_by_timezone), width=79)
[Row({'tz': 'America/New_York', 'avg_age': 25.0}),
Row({'tz': 'America/Los_Angeles', 'avg_age': 40.0})]
Window Function API
>>> from stupidb.api import Window, over, mean, select, table
>>> from datetime import date, timedelta
>>> today = date(2019, 2, 9)
>>> days = timedelta(days=1)
>>> rows = [
... {"name": "Alice", "balance": 400, "date": today},
... {"name": "Alice", "balance": 300, "date": today + 1 * days},
... {"name": "Alice", "balance": 100, "date": today + 2 * days},
... {"name": "Bob", "balance": -150, "date": today - 4 * days},
... {"name": "Bob", "balance": 200, "date": today - 3 * days},
... ]
>>> t = table(rows)
>>> window = Window.range(
... partition_by=[lambda r: r.name],
... order_by=[lambda r: r.date],
... preceding=lambda r: 2 * days # two days behind + the current row
... )
>>> avg_balance_per_person = table(rows) >> select(
... name=lambda r: r.name,
... avg_balance=mean(lambda r: r.balance) >> over(window),
... balance=lambda r: r.balance,
... date=lambda r: r.date,
... ) >> order_by(lambda r: r.name, lambda r: r.date)
>>> pprint(list(avg_balance_per_person), width=79) # noqa: E501
[Row({'name': 'Alice', 'balance': 400, 'date': datetime.date(2019, 2, 9), 'avg_balance': 400.0}),
Row({'name': 'Alice', 'balance': 300, 'date': datetime.date(2019, 2, 10), 'avg_balance': 350.0}),
Row({'name': 'Alice', 'balance': 100, 'date': datetime.date(2019, 2, 11), 'avg_balance': 266.667}),
Row({'name': 'Bob', 'balance': -150, 'date': datetime.date(2019, 2, 5), 'avg_balance': -150.0}),
Row({'name': 'Bob', 'balance': 200, 'date': datetime.date(2019, 2, 6), 'avg_balance': 25.0})]
Window Functions
- Many aspects of software engineering involved
- API design
- Data structures
- Algorithms
- E.g., range bounds
-
Four algorithms for aggregation
- Segment trees
- Removable cumulative
- "Append-only"
- Naive
O\left(n\log{n}\right)
O\left(n^{2}\right)\text{*}
O\left(n^{2}\right)
O\left(n^{2}\right)
Segment Trees
Characteristics
- O(log N) query
- O(N log N) construction
- O(fanout ** height) space
Conclusions
- Databases are hard
- Feel great about writing things from scratch*
- Build for stupidly bad performance*
StupiDB
By Phillip Cloud
StupiDB
- 1,268