Processing Trillions of Events with TrailDB
SF Data Mining Meetup 7/19/2016
Ville Tuulos
Head of Data @ AdRoll
ville@adroll.com
TrailDB is
an efficient tool
for storing and querying
series of events.
Why
Lots of Data:
Time
events over time
Users
grouped by user
red:ads
gray: page views
green:3rd party data
Same Data
Sorted by Account
Full of patterns!
These users don't sleep: fraud?
A very engaging site
Active only during business hours
Evening campaigns
New Prospecting campaign launched→
User 1
User 2
User 3
User 4
Time
Users
Trails
Zooming in
TrailDB
This is TrailDB
Event
What
Primary Key
e.g. user ID, document ID
Events
what happened and when?
History
trails of events
Simple Data Model
Primary Key + Events → Relational DB
History is lost in destructive updates
Comparison
Primary Key + History → Time-Series DB
Individual events are lost in aggregation
History + Events → Log files
Expensive to query
ID1
ID2
ID3
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
ID1
ID2
ID3
Secret sauce: Compression
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Secret sauce: Compression
Compression is not only about space but about speed too
TDB
all-events.tdb
1.5 GB
Simple: It is a read-only file
Simple: It is a library
Like SQLite, not like Postgres or Redis.
Implemented in C for maximum performance.
Create
Read
Update
Delete
Polyglot → Productive
Use the right tool for the job
Python: Batteries included
R: Robust statistics
D: Performance & Expressivity
C: Performance & Low-level access
Haskell: Blow your mind
more to come!
Clean API
Read
- Find a trail given a UUID or Trail ID
- Iterate over a trail
- Handle events efficiently
- Filter a subset of events
Create
- Create a new TrailDB based on raw events
- Merge an existing TrailDB to a new one
How
from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime
# Define fields (schema)
cons = TrailDBConstructor('tiny', ['username', 'action'])
for i in range(3):
uuid = uuid4().hex
username = 'user%d' % i
for day, action in enumerate(['open', 'save', 'close']):
# Add events to TrailDB
cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))
# Finalize TrailDB
cons.finalize()
Create TrailDB in Python
from traildb import TrailDB
for uuid, trail in TrailDB('tiny').trails():
print uuid, list(trail)
Read TrailDB in Python
2ec4c2917e0f45c79accd43d385f4a5c
[event(time=1456819200L, username='user2', action='open'),
event(time=1456905600L, username='user2', action='save'),
event(time=1456992000L, username='user2', action='close')]
275d264fddde498f8a134f4518afcb6b
[event(time=1454313600L, username='user1', action='open'),
event(time=1454400000L, username='user1', action='save'),
event(time=1454486400L, username='user1', action='close')]
bed2864b4f51445daaa1d0a0d67bb5b5
[event(time=1451635200L, username='user0', action='open'),
event(time=1451721600L, username='user0', action='save'),
event(time=1451808000L, username='user0', action='close')]
Output
Try this!
Stunning TrailDB visualizations on your laptop in 5 easy steps
4. Download a sample of Wikipedia edit history as a TrailDB
3. Install DataShader Package for Python
1. Install TrailDB
brew install traildb or http://traildb.io/docs/getting_started/
2. Install Python bindings for TrailDB
git clone https://github.com/traildb/traildb-python
or download the full history, 663M events, 5.8GB:
import sys
from traildb import TrailDB
from itertools import islice
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd
def get_trails(path):
x = []
y = []
types = []
for i, trail in enumerate(islice(TrailDB(path), 1000)):
for event in trail:
x.append(event.time / (24 * 60 * 60))
y.append(i)
types.append('user' if event.user else 'anon')
df = pd.DataFrame({'x': x, 'y': y})
df['type'] = pd.Series(types, dtype='category')
return df
cnv = ds.Canvas(400, 300)
agg = cnv.points(get_trails(sys.argv[1]), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'yellow'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'black')
with open('output.png', 'w') as f:
f.write(img.to_bytesio().getvalue())
5. Run the follow script and modify it for deeper insights
Anonymous edits (red dots) stop in 2010
Prince dies in April 2016
event.title == "Prince (musician)"
event.title == "Prince (musician)"
sort users by first edit date
Most edits about Prince's death are by new users
Happy Codebase
Designed for cloud computing
High-performance C, many bindings
Small codebase, minimal dependencies
90%+ test coverage
Battle-hardened: 1.5 years in serious use
Takes backwards compatibility very seriously
Friendly community
A growing set of tools built on top of TrailDB
Thank you
Try TrailDB today!
SF Data Mining Meetup 7/19/2016: TrailDB
By Ville Tuulos
SF Data Mining Meetup 7/19/2016: TrailDB
- 3,824