e.g. user ID, document ID
what happened and when?
trails of events
History is lost in destructive updates
Individual events are lost in aggregation
Expensive to query
all-events.tdb
1.5 GB
like SQLite, not like Postgres or Redis
>>> 1 + 2
>>> a + b
>>> a() + b()
which one of the following is easiest to reason about?
*nix shell is an unbeatably productive development environment
Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)
←Download TrailDBs from S3
←Read TrailDBs to memory
Process data with 16 cores
Upload results to S3 →
TrailDB is designed for the future
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Time
Trails
ID1
ID2
ID3
ID4
ID5
A TrailDB is an ordered set of trails.
a traildb, one file
2016-01-02
A trail is a list of events, ordered by time, identified by a unique key
2016-01-03
2016-01-07
User3214
a trail
2016-01-02
An event is a set of fields. The first field is always time.
page_open
signup
Sweden
an event
field 0
time
field 1
type
field 2
page_id
field 3
country
A field has a set of possible values. The first value is always NULL.
NULL
page_open
button_click
submit
a field
field 1
type
value 0
value 1
value 2
value 3
The combination of a field and one of its values is called an item.
An item is represented as a 64-bit integer.
an item
Sweden
Country
So, a TrailDB is a big bunch of integers and some metadata.
A file like this can be encoded and queried very efficiently.
ID1
ID2
ID3
ID4
ID5
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
a traildb, one file
Metadata
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Lookup an individual trail given its ID
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Iterate over all trail IDs
Iterator
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Iterate over events of a trail
Cursor
35767
234
"Country": "Sweden"
Utilities to convert between fields, values, items and strings.
The API actively encourages working with items, which is fast.
user3435, 1454923792, page_open, signup
user243, 1454923791, submit, form2
user9076, 1454923802, search, landing
event
o
o
o
o
o
o
o
o
o
o
o
user3435
user243
user9076
Construct a new TrailDB based on an unordered stream of events
event
o
o
o
o
o
user3435
user243
Merge two existing TrailDBs into a new TrailDB
event
o
o
user243
event
o
o
o
o
o
o
o
o
o
o
o
user3435
user243
user9076
ID1
ID2
ID3
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
ID1
ID2
ID3
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Compression is not only about space but about speed too
It can go fast, if you need speed
Outer beauty attracts, but inner beauty captivates
* YMMV - TrailDB can handle more than 1TB but you will likely need more than one server.
Ville Tuulos
Oleg Avdeev
Jared Flatow
Steven Wright
Mikko Juola
Benoit Rostykus
Asif Imran
Bryan Galvin
Jyri Tuulos
Chris Evans
Knut Nesheim
Martin Scholl