Introduction to TrailDB

Ville Tuulos

Sr. Principal Engineer

ville@adroll.com

TrailDB is

an efficient tool

for storing and querying

series of events.

Primary Key

e.g. user ID, document ID

Events

what happened and when?

History

trails of events

Simple Data Model

Primary Key + Events → Relational DB

History is lost in destructive updates

Comparison

Primary Key + History → Time-Series DB

Individual events are lost in aggregation

History + Events → Log files

Expensive to query

Sure,

storing and querying

series of events

is doable using existing tools

so...

Why

we need a new tool?

1) Developer Productivity

http://www.slideshare.net/AmazonWebServices/dvo209-jaws-a-scalable-serverless-framework/

2) Prepare for the Future

Simple: It is a library

like SQLite, not like Postgres or Redis

Create

Read

Update

Delete

Simple → Productive

>>> 1 + 2

>>> a + b

>>> a() + b()

which one of the following is easiest to reason about?

https://en.wikipedia.org/wiki/Referential_transparency

Immutable data FTW!

Polyglot → Productive

Use the right tool for the job

Python: Batteries included

R: Robust statistics

D: Performance & Expressivity

C: Performance & Low-level access

Haskell: Blow your mind

more to come!

Single-Server → Productive

*nix shell is an unbeatably productive development environment

Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)

←Download TrailDBs from S3

←Read TrailDBs to memory

Process data with 16 cores

Upload results to S3 →

http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html

Rethink distributed computing

TrailDB is designed for the future

Sounds great!

How does it work?

event

o

Time

Trails

ID1

ID2

ID3

ID4

ID5

TrailDB deconstructed

A TrailDB is an ordered set of trails.

a traildb, one file

t

2016-01-02

Trail deconstructed

A trail is a list of events, ordered by time, identified by a unique key

t

2016-01-03

t

2016-01-07

User3214

a trail

t

2016-01-02

Event deconstructed

An event is a set of fields. The first field is always time.

page_open

signup

Sweden

an event

field 0

time

field 1

type

field 2

page_id

field 3

country

Field deconstructed

A field has a set of possible values. The first value is always NULL.

NULL

page_open

button_click

submit

a field

field 1

type

value 0

value 1

value 2

value 3

Item deconstructed

The combination of a field and one of its values is called an item.

An item is represented as a 64-bit integer.

an item

Sweden

Country

TrailDB reconstructed

So, a TrailDB is a big bunch of integers and some metadata.

A file like this can be encoded and queried very efficiently.

ID1

ID2

ID3

ID4

ID5

Text

a traildb, one file

Metadata

Makes sense.

How to use it?

event

o

ID1

ID2

ID3

ID4

ID5

Supported Read Operations

Lookup an individual trail given its ID

event

o

ID1

ID2

ID3

ID4

ID5

Supported Read Operations

Iterate over all trail IDs

Iterator

event

o

ID1

ID2

ID3

ID4

ID5

Supported Read Operations

Iterate over events of a trail

Cursor

35767

234

"Country": "Sweden"

Supported Read Operations

Utilities to convert between fields, values, items and strings.

The API actively encourages working with items, which is fast.

user3435, 1454923792, page_open, signup

user243, 1454923791, submit, form2

user9076, 1454923802, search, landing

event

o

user3435

user243

user9076

Supported Write Operations

Construct a new TrailDB based on an unordered stream of events

event

o

user3435

user243

Supported Write Operations

Merge two existing TrailDBs into a new TrailDB

event

o

user243

event

o

user3435

user243

user9076

That is simple!

Does it really move mountains?

ID1

ID2

ID3

Text

ID1

ID2

ID3

Secret sauce: Compression

Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.

In contrast to gzip, you need to decompress only what you need.

http://tuulos.github.io/pydata-2014/

Secret sauce: Compression

Compression is not only about space but about speed too

Thoroughly Performance-Oriented

It can go fast, if you need speed

Core TrailDB is implemented in C

All read operations are lazy with no memory allocations

Cache friendly: Switch between 32/64 bit items

Multicore/NUMA friendly

Actively leverages OS virtual memory

Happy Codebase

Outer beauty attracts, but inner beauty captivates

Small codebase, minimal dependencies

90%+ test coverage

Battle-hardened: 1.5 years in serious use

Takes backwards compatibility very seriously

Friendly community

A growing set of tools built on top of TrailDB

Take-Home Message

If you are building an app or a script that needs to store and query series of events, and the raw gzipped data is less than 1TB*,

you should consider using TrailDB

(probably on a single server).

It can make you more productive and your application faster and more robust.

* YMMV - TrailDB can handle more than 1TB but you will likely need more than one server.

Get started at

traildb.io

v.0.1 Contributors

Ville Tuulos

Oleg Avdeev

Jared Flatow

Steven Wright

Mikko Juola

Benoit Rostykus

Asif Imran

Bryan Galvin

Jyri Tuulos

Chris Evans

Knut Nesheim

Martin Scholl

Introduction to TrailDB

Ville Tuulos

Sr. Principal Engineer

ville@adroll.com

TrailDB is

an efficient tool

for storing and querying

series of events.

Events like this, generated by user actions

Or, events generated automatically

That is, any kinds of events

Primary Key

Events

History

Simple Data Model

Primary Key + Events → Relational DB

Comparison

Primary Key + History → Time-Series DB

History + Events → Log files

Sure,

storing and querying

series of events

is doable using existing tools

so...

Why

we need a new tool?

1) Developer Productivity

2) Prepare for the Future

3) Focus and Simplicity

Ok, so what is TrailDB exactly?

TDB

Simple: It is a read-only file

Simple: It is a library

Create

Read

Update

Delete

Simple → Productive

Immutable data FTW!

Polyglot → Productive

Use the right tool for the job

Python: Batteries included

R: Robust statistics

D: Performance & Expressivity

C: Performance & Low-level access

Haskell: Blow your mind

more to come!

Single-Server → Productive

Rethink distributed computing

Sounds great!

How does it work?

TrailDB deconstructed

t

Trail deconstructed

t

t

t

Event deconstructed

Field deconstructed

Item deconstructed

TrailDB reconstructed

Makes sense.

How to use it?

Supported Read Operations

Supported Read Operations

Supported Read Operations

Supported Read Operations

Supported Write Operations

Supported Write Operations

That is simple!

Does it really move mountains?

Secret sauce: Compression

Secret sauce: Compression

Thoroughly Performance-Oriented

Core TrailDB is implemented in C

All read operations are lazy with no memory allocations

Cache friendly: Switch between 32/64 bit items

Multicore/NUMA friendly

Actively leverages OS virtual memory

Happy Codebase