TrailDB Tutorial

PyData SF 2016

Ville Tuulos

Head of Data

ville@adroll.com

slides.com/villetuulos/traildb-tutorial-pydata-2016

Material

t raildb.io

gitter.im/traildb/traildb

TrailDB is

an efficient tool

for storing and querying

series of events.

AdRoll's

use case

Why TrailDB?

Primary Key

e.g. user ID, document ID

Events

what happened and when?

History

trails of events

Simple Data Model

Primary Key + Events → Relational DB

History is lost in destructive updates

Comparison

Primary Key + History → Time-Series DB

Individual events are lost in aggregation

History + Events → Log files

Expensive to query

Sure,

storing and querying

series of events

is doable using existing tools

so...

Why

we need a new tool?

1) Developer Productivity

http://www.slideshare.net/AmazonWebServices/dvo209-jaws-a-scalable-serverless-framework/

2) Prepare for the Future

ID1

ID2

ID3

Text

ID1

ID2

ID3

Secret sauce: Compression

Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.

In contrast to gzip, you need to decompress only what you need.

http://tuulos.github.io/pydata-2014/

Secret sauce: Compression

Compression is not only about space but about speed too

Ok, so what is TrailDB exactly?

Simple: It is a library

like SQLite, not like Postgres or Redis

Create

Read

Update

Delete

Simple → Productive

>>> 1 + 2

>>> a + b

>>> a() + b()

which one of the following is easiest to reason about?

https://en.wikipedia.org/wiki/Referential_transparency

Immutable data FTW!

Polyglot → Productive

Use the right tool for the job

Python: Batteries included

R: Robust statistics

D: Performance & Expressivity

C: Performance & Low-level access

Haskell: Blow your mind

Go: Easy concurrency

Sounds great!

Let's try it in practice

Install TrailDB library on OS X

1. brew install traildb

2. git clone https://github.com/traildb/traildb

Install TrailDB library on Linux

1. apt-get install libarchive-dev libjudy-dev pkg-config

(or yum install judy-devel libarchive-devel pkg-config)

2. git clone https://github.com/traildb/traildb

3. cd traildb

4. ./waf configure

5. ./waf install

Old libJudy workaround

1. curl -O http://traildb.io/data/libJudy.so.1.0.3

2. cp libJudy.so.1.0.3 /usr/lib/libJudy.so.1

OS X (if compiling from source)

Linux

1. brew install homebrew/boneyard/judy

Let's test it!

1. cd traildb

2. tdb dump -i test

Smoke test with a test TrailDB

event

o

Time

Trails

ID1

ID2

ID3

ID4

ID5

What do you see

A TrailDB is an ordered set of trails.

a traildb, one file

1. curl -O http://traildb.io/data/pydata-tutorial.tdb

2. tdb dump -i pydata-tutorial.tdb

Test with a more exciting TrailDB

t

2016-01-02

Trail deconstructed

A trail is a list of events, ordered by time, identified by a unique key

t

2016-01-03

t

2016-01-07

User3214

a trail

t

2016-01-02

Event deconstructed

An event is a set of fields. The first field is always time.

page_open

signup

Sweden

an event

field 0

time

field 1

type

field 2

page_id

field 3

country

Let's create a TrailDB

user3435, 1454923792, page_open, signup

user243, 1454923791, submit, form2

user9076, 1454923802, search, landing

event

o

user3435

user243

user9076

How to make a TrailDB

Construct a new TrailDB based on an unordered stream of events

1. cat > test.csv

uuid time field1 field2
2 10 first_event second_event
3 9 another_event yet_another

2. tdb make -i test.csv --csv-header -o mytest

3. tdb dump -i mytest

Convert a CSV to a TrailDB

Finally, Python

1. git clone https://github.com/traildb/traildb-python

2. cd traildb-python

3. python setup.py install

Install Python bindings

from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime

cons = TrailDBConstructor('tiny', ['username', 'action'])

for i in range(3):
    uuid = uuid4().hex
    username = 'user%d' % i
    for day, action in enumerate(['open', 'save', 'close']):
        cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))

cons.finalize()

Create a TrailDB in Python

traildb-python/examples/tutorial_simple_traildb.py

from traildb import TrailDB

for uuid, trail in TrailDB('tiny').trails():
    print uuid, list(trail)

Read a TrailDB in Python

traildb-python/examples/tutorial_simple_traildb.py

event

o

ID1

ID2

ID3

ID4

ID5

Supported Read Operations

Iterate over all trail IDs

Iterator

event

o

ID1

ID2

ID3

ID4

ID5

Supported Read Operations

Iterate over events of a trail

Cursor

event

o

ID1

ID2

ID3

ID4

ID5

Supported Read Operations

Lookup an individual trail given its ID

from traildb import TrailDB

tdb = TrailDB('tiny')

# Get a trail based on a trail id (O(1) operation)
print list(tdb[0])

# Get a trail based on a UUID (O(log(N)) operation)
uuid = tdb.get_uuid(1)
if uuid in tdb:
    print list(tdb[uuid])

Get individual trails

from traildb import TrailDB

tdb = TrailDB('tiny')

# Field names
print tdb.fields

# Number of trails
print len(tdb)

# Minimum and maximum timestamps in this tdb
print tdb.time_range(parsetime=True)

# Number of uniques values in a field
print tdb.lexicon_size('username')

# Iterator of unique values in a field
print list(tdb.lexicon('username'))

Metadata

event

o

ID1

ID2

ID3

ID4

ID5

Event Filters

Iterate over a subset of events in a trail

Cursor with a filter

from traildb import TrailDB

for uuid, trail in TrailDB('tiny').trails(event_filter=[('action', 'open')]):
    print uuid, list(trail)

Simple Key-Value Filter

from traildb import TrailDB

query = [('action', 'open'), ('action', 'close')]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
    print uuid, list(trail)

action=open OR action=close

from traildb import TrailDB

query = [[('action', 'open')], [('username', 'user0')]]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
    print uuid, list(trail)

action=open AND user=user0

from traildb import TrailDB

query = [[('action', 'open')], [('username', 'user0', True)]]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
    print uuid, list(trail)

action=open AND user != user0

Putting it all together

TrailDB Visualization

1. Install Anaconda

2. conda install pandas datashader

Install Anaconda and DataShader

import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd

def get_events(tdb):
    query = [('title', 'Prince (musician)')]
    for i in range(len(tdb)):
        events = list(tdb.trail(i, event_filter=query))
        if events:
            yield events[0].time, events 

def get_dataframe():
    tdb = TrailDB('pydata-tutorial.tdb')
    base = tdb.min_timestamp() 
    types = []
    xs = []
    ys = []
    for y, (first_ts, events) in enumerate(get_events(tdb)):
        for event in events:
            xs.append(int(event.time - base) / (24 * 3600))
            ys.append(y)
            types.append('user' if event.user else 'anon')
    data = pd.DataFrame({'x': xs, 'y': ys})
    data['type'] = pd.Series(types, dtype='category')
    return data

cnv = ds.Canvas(400, 300)
agg = cnv.points(get_dataframe(), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'blue'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'white')
with open('prince.png', 'w') as f:
	f.write(img.to_bytesio().getvalue())

traildb-python/examples/datashader_example.py

import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd

def get_events(tdb):
    query = [('title', 'Prince (musician)')]
    for i in range(len(tdb)):
        events = list(tdb.trail(i, event_filter=query))
        if events:
            yield events[0].time, events 

def get_dataframe():
    tdb = TrailDB('pydata-tutorial.tdb')
    base = tdb.min_timestamp() 
    types = []
    xs = []
    ys = []
    for y, (first_ts, events) in enumerate(sorted(get_events(tdb), reverse=True)):
        for event in events:
            xs.append(int(event.time - base) / (24 * 3600))
            ys.append(y)
            types.append('user' if event.user else 'anon')
    data = pd.DataFrame({'x': xs, 'y': ys})
    data['type'] = pd.Series(types, dtype='category')
    return data

cnv = ds.Canvas(400, 300)
agg = cnv.points(get_dataframe(), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'blue'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'white')
with open('prince.png', 'w') as f:
	f.write(img.to_bytesio().getvalue())

traildb-python/examples/datashader_example.py

Sort rows by the time of first edit

TrailDB 201

Performance

Strings are expensive. Integers FTW

A field has a set of possible values. The first value is always NULL.

NULL

page_open

button_click

submit

a field

field 1

type

value 0

value 1

value 2

value 3

Items

The combination of a field and one of its values is called an item.

An item is represented as a 64-bit integer.

an item

Sweden

Country

35767

234

"Country": "Sweden"

Use integer items when you can

Utilities to convert between fields, values, items and strings.

TrailDB internals

So, a TrailDB is a big bunch of integers and some metadata.

A file like this can be encoded and queried very efficiently.

ID1

ID2

ID3

ID4

ID5

Text

a traildb, one file

Metadata

from traildb import TrailDB
from collections import Counter
import timeit

def string_top():
    tdb = TrailDB('pydata-tutorial')
    return Counter(event.title for uuid, trail in tdb.trails()
                               for event in trail).most_common(5)

def item_top():
    tdb = TrailDB('pydata-tutorial')
    stats = Counter(event.title for uuid, trail in tdb.trails(rawitems=True)
                                for event in trail)
    return [(tdb.get_item_value(item), f) for item, f in stats.most_common(5)]

print 'string_top', timeit.timeit(string_top, number=3)
print 'item_top', timeit.timeit(item_top, number=3)

string_top 4.75762200356
item_top 2.09803509712

traildb-python/examples/top_items.py

Use raw items,

don't translate to strings

For more performance tips, see

traildb.io/docs/technical_overview/#performance-best-practices

Architecture

Sharding Data

by time

Users

Time

TrailDB

Day 1

TrailDB

Day 2

TrailDB

Day 3

Sharding Data

by UUIDs

Users

Time

TrailDB

Users 1

TrailDB

Users 2

Sharding Computation

You can go far with a single beefy server

*nix shell is an unbeatably productive development environment

Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)

←Download TrailDBs from S3

←Read TrailDBs to memory

Process data with 16 cores

Upload results to S3 →

AWS S3

TrailDB

Producer

TrailDB

Consumer A

TrailDB

Consumer B

TrailDB

Consumer C

Master

Data Store

TrailDB is designed for Cloud Environments

http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html

Simple, Elastic, Data-Graph, Dockeritized

Batch Job Engine

Join the project!

Ideas

Better integration with Pandas and Jupyter for data exploration

A set of generally applicable visualizations

Super fast Python processing using Numba

A toolkit of functions for anomaly detection, frequent episodes etc.

Test Python 3 support

More language bindings (NodeJS, PHP, Rust, Julia)

Play with public data sets (e.g. githubarchive.org)

Help to fix issues in our GitHub issue tracker

Use TrailDB in production

Happy Codebase

Outer beauty attracts, but inner beauty captivates

TrailDB Tutorial

Ville Tuulos

Head of Data

ville@adroll.com

Material

TrailDB is

an efficient tool

for storing and querying

series of events.

Events like this, generated by user actions

Or, events generated automatically

That is, any kinds of events

AdRoll's

use case

Lots of Data

Events over time

Grouped by user

Users

Time

Gray: Page view

Red: Ads

Green: 3rd party data

Same data

sorted by account

Patterns!

Users

Time

Zooming In

Event

TrailDB

Why TrailDB?

Primary Key

Events

History

Simple Data Model

Primary Key + Events → Relational DB

Comparison

Primary Key + History → Time-Series DB

History + Events → Log files

Sure,

storing and querying

series of events

is doable using existing tools

so...

Why

we need a new tool?

1) Developer Productivity

2) Prepare for the Future

3) Focus and Simplicity

Secret sauce: Compression

Secret sauce: Compression

Ok, so what is TrailDB exactly?

TDB

Simple: It is a read-only file

Simple: It is a library

Create

Read

Update

Delete

Simple → Productive

Immutable data FTW!

Polyglot → Productive

Use the right tool for the job

Python: Batteries included

R: Robust statistics

D: Performance & Expressivity

C: Performance & Low-level access

Haskell: Blow your mind

Go: Easy concurrency

Sounds great!

Let's try it in practice

Install TrailDB library on OS X

Install TrailDB library on Linux

Old libJudy workaround

OS X (if compiling from source)

Linux

Let's test it!

Smoke test with a test TrailDB

What do you see

Test with a more exciting TrailDB