Introduction to TrailDB
Ville Tuulos
Sr. Principal Engineer
ville@adroll.com
TrailDB is
an efficient tool
for storing and querying
series of events.
Events like this, generated by user actions
Or, events generated automatically
That is, any kinds of events
Primary Key
e.g. user ID, document ID
Events
what happened and when?
History
trails of events
Simple Data Model
Primary Key + Events → Relational DB
History is lost in destructive updates
Comparison
Primary Key + History → Time-Series DB
Individual events are lost in aggregation
History + Events → Log files
Expensive to query
Sure,
storing and querying
series of events
is doable using existing tools
so...
Why
we need a new tool?
1) Developer Productivity
2) Prepare for the Future
3) Focus and Simplicity
Ok, so what is TrailDB exactly?
TDB
all-events.tdb
1.5 GB
Simple: It is a read-only file
Simple: It is a library
like SQLite, not like Postgres or Redis
Create
Read
Update
Delete
Simple → Productive
>>> 1 + 2
>>> a + b
>>> a() + b()
which one of the following is easiest to reason about?
Immutable data FTW!
Polyglot → Productive
Use the right tool for the job
Python: Batteries included
R: Robust statistics
D: Performance & Expressivity
C: Performance & Low-level access
Haskell: Blow your mind
more to come!
Single-Server → Productive
*nix shell is an unbeatably productive development environment
Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)
←Download TrailDBs from S3
←Read TrailDBs to memory
Process data with 16 cores
Upload results to S3 →
Rethink distributed computing
TrailDB is designed for the future
Sounds great!
How does it work?
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Time
Trails
ID1
ID2
ID3
ID4
ID5
TrailDB deconstructed
A TrailDB is an ordered set of trails.
a traildb, one file
t
2016-01-02
Trail deconstructed
A trail is a list of events, ordered by time, identified by a unique key
t
2016-01-03
t
2016-01-07
User3214
a trail
t
2016-01-02
Event deconstructed
An event is a set of fields. The first field is always time.
page_open
signup
Sweden
an event
field 0
time
field 1
type
field 2
page_id
field 3
country
Field deconstructed
A field has a set of possible values. The first value is always NULL.
NULL
page_open
button_click
submit
a field
field 1
type
value 0
value 1
value 2
value 3
Item deconstructed
The combination of a field and one of its values is called an item.
An item is represented as a 64-bit integer.
an item
Sweden
Country
TrailDB reconstructed
So, a TrailDB is a big bunch of integers and some metadata.
A file like this can be encoded and queried very efficiently.
ID1
ID2
ID3
ID4
ID5
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
a traildb, one file
Metadata
Makes sense.
How to use it?
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Supported Read Operations
Lookup an individual trail given its ID
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Supported Read Operations
Iterate over all trail IDs
Iterator
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Supported Read Operations
Iterate over events of a trail
Cursor
35767
234
"Country": "Sweden"
Supported Read Operations
Utilities to convert between fields, values, items and strings.
The API actively encourages working with items, which is fast.
user3435, 1454923792, page_open, signup
user243, 1454923791, submit, form2
user9076, 1454923802, search, landing
event
o
o
o
o
o
o
o
o
o
o
o
user3435
user243
user9076
Supported Write Operations
Construct a new TrailDB based on an unordered stream of events
event
o
o
o
o
o
user3435
user243
Supported Write Operations
Merge two existing TrailDBs into a new TrailDB
event
o
o
user243
event
o
o
o
o
o
o
o
o
o
o
o
user3435
user243
user9076
That is simple!
Does it really move mountains?
ID1
ID2
ID3
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
ID1
ID2
ID3
Secret sauce: Compression
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Secret sauce: Compression
Compression is not only about space but about speed too
Thoroughly Performance-Oriented
It can go fast, if you need speed
Core TrailDB is implemented in C
All read operations are lazy with no memory allocations
Cache friendly: Switch between 32/64 bit items
Multicore/NUMA friendly
Actively leverages OS virtual memory
Happy Codebase
Outer beauty attracts, but inner beauty captivates
Small codebase, minimal dependencies
90%+ test coverage
Battle-hardened: 1.5 years in serious use
Takes backwards compatibility very seriously
Friendly community
A growing set of tools built on top of TrailDB
Take-Home Message
If you are building an app or a script that needs to store and query series of events, and the raw gzipped data is less than 1TB*,
you should consider using TrailDB
(probably on a single server).
It can make you more productive and your application faster and more robust.
* YMMV - TrailDB can handle more than 1TB but you will likely need more than one server.
Get started at
traildb.io
v.0.1 Contributors
Ville Tuulos
Oleg Avdeev
Jared Flatow
Steven Wright
Mikko Juola
Benoit Rostykus
Asif Imran
Bryan Galvin
Jyri Tuulos
Chris Evans
Knut Nesheim
Martin Scholl
Thank you
Try TrailDB today!
Introduction to TrailDB
By Ville Tuulos
Introduction to TrailDB
- 10,069