Processing Trillions of Events with TrailDB

SF Data Mining Meetup 7/19/2016

Ville Tuulos

Head of Data @ AdRoll

ville@adroll.com

TrailDB is

an efficient tool

for storing and querying

series of events.

Why

Lots of Data:

Time

events over time

Users

grouped by user

red:ads

gray: page views

green:3rd party data

Same Data

Sorted by Account

Full of patterns!

These users don't sleep: fraud?

A very engaging site

Active only during business hours

Evening campaigns

New Prospecting campaign launched→

User 1

User 2

User 3

User 4

Time

Users

Trails

Zooming in

TrailDB

This is TrailDB

Event

What

Primary Key

e.g. user ID, document ID

Events

what happened and when?

History

trails of events

Simple Data Model

Primary Key + Events → Relational DB

History is lost in destructive updates

Comparison

Primary Key + History → Time-Series DB

Individual events are lost in aggregation

History + Events → Log files

Expensive to query

ID1

ID2

ID3

Text

ID1

ID2

ID3

Secret sauce: Compression

Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.

In contrast to gzip, you need to decompress only what you need.

http://tuulos.github.io/pydata-2014/

Secret sauce: Compression

Compression is not only about space but about speed too

TDB

all-events.tdb

1.5 GB

Simple: It is a read-only file

Simple: It is a library

Like SQLite, not like Postgres or Redis.

Implemented in C for maximum performance.

Create

Read

Update

Delete

Polyglot → Productive

Use the right tool for the job

Python: Batteries included

R: Robust statistics

D: Performance & Expressivity

C: Performance & Low-level access

Haskell: Blow your mind

more to come!

Clean API

Read

- Find a trail given a UUID or Trail ID

- Iterate over a trail

- Handle events efficiently

- Filter a subset of events

Create

- Create a new TrailDB based on raw events

- Merge an existing TrailDB to a new one

How



from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime

# Define fields (schema)
cons = TrailDBConstructor('tiny', ['username', 'action'])

for i in range(3):
    uuid = uuid4().hex
    username = 'user%d' % i
    for day, action in enumerate(['open', 'save', 'close']):
        # Add events to TrailDB
        cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))

# Finalize TrailDB
cons.finalize()

Create TrailDB in Python


from traildb import TrailDB

for uuid, trail in TrailDB('tiny').trails():
    print uuid, list(trail)

Read TrailDB in Python


2ec4c2917e0f45c79accd43d385f4a5c
    [event(time=1456819200L, username='user2', action='open'),
     event(time=1456905600L, username='user2', action='save'),
     event(time=1456992000L, username='user2', action='close')]

275d264fddde498f8a134f4518afcb6b
    [event(time=1454313600L, username='user1', action='open'),
     event(time=1454400000L, username='user1', action='save'),
     event(time=1454486400L, username='user1', action='close')]

bed2864b4f51445daaa1d0a0d67bb5b5
    [event(time=1451635200L, username='user0', action='open'),
     event(time=1451721600L, username='user0', action='save'),
     event(time=1451808000L, username='user0', action='close')]

Output

Try this!

Stunning TrailDB visualizations on your laptop in 5 easy steps

http://traildb.io/data/wikipedia-history-small.tdb

4. Download a sample of Wikipedia edit history as a TrailDB

3. Install DataShader Package for Python

https://github.com/bokeh/datashader

1. Install TrailDB

brew install traildb or http://traildb.io/docs/getting_started/

2. Install Python bindings for TrailDB

git clone https://github.com/traildb/traildb-python

or download the full history, 663M events, 5.8GB:

http://traildb.io/data/wikipedia-history.tdb


import sys

from traildb import TrailDB
from itertools import islice

import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd

def get_trails(path):
    x = []
    y = []
    types = []
    for i, trail in enumerate(islice(TrailDB(path), 1000)):
        for event in trail:
            x.append(event.time / (24 * 60 * 60))
            y.append(i)
            types.append('user' if event.user else 'anon')
    df = pd.DataFrame({'x': x, 'y': y})
    df['type'] = pd.Series(types, dtype='category')
    return df

cnv = ds.Canvas(400, 300)
agg = cnv.points(get_trails(sys.argv[1]), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'yellow'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'black')
with open('output.png', 'w') as f:
	f.write(img.to_bytesio().getvalue())

5. Run the follow script and modify it for deeper insights

Anonymous edits (red dots) stop in 2010

Prince dies in April 2016

event.title == "Prince (musician)"

sort users by first edit date

Most edits about Prince's death are by new users

Processing Trillions of Events with TrailDB

Ville Tuulos

Head of Data @ AdRoll

ville@adroll.com

TrailDB is

an efficient tool

for storing and querying

series of events.

Why

Lots of Data:

Time

events over time

Users

grouped by user

Same Data

Sorted by Account

Full of patterns!

Zooming in

TrailDB

This is TrailDB

What

Primary Key

Events

History

Simple Data Model

Primary Key + Events → Relational DB

Comparison

Primary Key + History → Time-Series DB

History + Events → Log files

Secret sauce: Compression

Secret sauce: Compression

TDB

Simple: It is a read-only file

Simple: It is a library

Create

Read

Update

Delete

Polyglot → Productive

Use the right tool for the job

Python: Batteries included

R: Robust statistics

D: Performance & Expressivity

C: Performance & Low-level access

Haskell: Blow your mind

more to come!

Clean API

Read

- Find a trail given a UUID or Trail ID

- Iterate over a trail

- Handle events efficiently

- Filter a subset of events

Create

- Create a new TrailDB based on raw events

- Merge an existing TrailDB to a new one

How

Create TrailDB in Python

Read TrailDB in Python

Output

Try this!

event.title == "Prince (musician)"

event.title == "Prince (musician)"

Happy Codebase

Designed for cloud computing

High-performance C, many bindings

Small codebase, minimal dependencies

90%+ test coverage

Battle-hardened: 1.5 years in serious use

Takes backwards compatibility very seriously

Friendly community

A growing set of tools built on top of TrailDB

Get started at

traildb.io

Thank you

Try TrailDB today!

SF Data Mining Meetup 7/19/2016: TrailDB

More from Ville Tuulos