e.g. user ID, document ID
what happened and when?
trails of events
History is lost in destructive updates
Individual events are lost in aggregation
Expensive to query
like SQLite, not like Postgres or Redis
>>> 1 + 2
>>> a + b
>>> a() + b()
which one of the following is easiest to reason about?
*nix shell is an unbeatably productive development environment
Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)
←Download TrailDBs from S3
←Read TrailDBs to memory
Process data with 16 cores
Upload results to S3 →
TrailDB is designed for the future
A TrailDB is an ordered set of trails.
a traildb, one file
A trail is a list of events, ordered by time, identified by a unique key
An event is a set of fields. The first field is always time.
A field has a set of possible values. The first value is always NULL.
The combination of a field and one of its values is called an item.
An item is represented as a 64-bit integer.
So, a TrailDB is a big bunch of integers and some metadata.
A file like this can be encoded and queried very efficiently.
a traildb, one file
Lookup an individual trail given its ID
Iterate over all trail IDs
Iterate over events of a trail
Utilities to convert between fields, values, items and strings.
The API actively encourages working with items, which is fast.
user3435, 1454923792, page_open, signup
user243, 1454923791, submit, form2
user9076, 1454923802, search, landing
Construct a new TrailDB based on an unordered stream of events
Merge two existing TrailDBs into a new TrailDB
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Compression is not only about space but about speed too
It can go fast, if you need speed
Outer beauty attracts, but inner beauty captivates
* YMMV - TrailDB can handle more than 1TB but you will likely need more than one server.