TrailDB Tutorial
PyData SF 2016
Ville Tuulos
Head of Data
ville@adroll.com
Material
TrailDB is
an efficient tool
for storing and querying
series of events.
Events like this, generated by user actions
Or, events generated automatically
That is, any kinds of events
AdRoll's
use case
Lots of Data
Events over time
Grouped by user
Users
Time
Gray: Page view
Red: Ads
Green: 3rd party data
Same data
sorted by account
Patterns!
an engaging site
evening campaigns
new campaign started
these users never sleep
Users
Time
Zooming In
Event
TrailDB
Why TrailDB?
Primary Key
e.g. user ID, document ID
Events
what happened and when?
History
trails of events
Simple Data Model
Primary Key + Events → Relational DB
History is lost in destructive updates
Comparison
Primary Key + History → Time-Series DB
Individual events are lost in aggregation
History + Events → Log files
Expensive to query
Sure,
storing and querying
series of events
is doable using existing tools
so...
Why
we need a new tool?
1) Developer Productivity
2) Prepare for the Future
3) Focus and Simplicity
ID1
ID2
ID3
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
ID1
ID2
ID3
Secret sauce: Compression
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Secret sauce: Compression
Compression is not only about space but about speed too
Ok, so what is TrailDB exactly?
TDB
all-events.tdb
1.5 GB
Simple: It is a read-only file
Simple: It is a library
like SQLite, not like Postgres or Redis
Create
Read
Update
Delete
Simple → Productive
>>> 1 + 2
>>> a + b
>>> a() + b()
which one of the following is easiest to reason about?
Immutable data FTW!
Polyglot → Productive
Use the right tool for the job
Python: Batteries included
R: Robust statistics
D: Performance & Expressivity
C: Performance & Low-level access
Haskell: Blow your mind
Go: Easy concurrency
Sounds great!
Let's try it in practice
Install TrailDB library on OS X
1. brew install traildb
2. git clone https://github.com/traildb/traildb
Install TrailDB library on Linux
1. apt-get install libarchive-dev libjudy-dev pkg-config
(or yum install judy-devel libarchive-devel pkg-config)
2. git clone https://github.com/traildb/traildb
3. cd traildb
4. ./waf configure
5. ./waf install
Old libJudy workaround
1. curl -O http://traildb.io/data/libJudy.so.1.0.3
2. cp libJudy.so.1.0.3 /usr/lib/libJudy.so.1
OS X (if compiling from source)
Linux
1. brew install homebrew/boneyard/judy
Let's test it!
1. cd traildb
2. tdb dump -i test
Smoke test with a test TrailDB
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Time
Trails
ID1
ID2
ID3
ID4
ID5
What do you see
A TrailDB is an ordered set of trails.
a traildb, one file
1. curl -O http://traildb.io/data/pydata-tutorial.tdb
2. tdb dump -i pydata-tutorial.tdb
Test with a more exciting TrailDB
t
2016-01-02
Trail deconstructed
A trail is a list of events, ordered by time, identified by a unique key
t
2016-01-03
t
2016-01-07
User3214
a trail
t
2016-01-02
Event deconstructed
An event is a set of fields. The first field is always time.
page_open
signup
Sweden
an event
field 0
time
field 1
type
field 2
page_id
field 3
country
Let's create a TrailDB
user3435, 1454923792, page_open, signup
user243, 1454923791, submit, form2
user9076, 1454923802, search, landing
event
o
o
o
o
o
o
o
o
o
o
o
user3435
user243
user9076
How to make a TrailDB
Construct a new TrailDB based on an unordered stream of events
1. cat > test.csv
uuid time field1 field2
2 10 first_event second_event
3 9 another_event yet_another
2. tdb make -i test.csv --csv-header -o mytest
3. tdb dump -i mytest
Convert a CSV to a TrailDB
Finally, Python
1. git clone https://github.com/traildb/traildb-python
2. cd traildb-python
3. python setup.py install
Install Python bindings
from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime
cons = TrailDBConstructor('tiny', ['username', 'action'])
for i in range(3):
uuid = uuid4().hex
username = 'user%d' % i
for day, action in enumerate(['open', 'save', 'close']):
cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))
cons.finalize()
Create a TrailDB in Python
traildb-python/examples/tutorial_simple_traildb.py
from traildb import TrailDB
for uuid, trail in TrailDB('tiny').trails():
print uuid, list(trail)
Read a TrailDB in Python
traildb-python/examples/tutorial_simple_traildb.py
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Supported Read Operations
Iterate over all trail IDs
Iterator
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Supported Read Operations
Iterate over events of a trail
Cursor
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Supported Read Operations
Lookup an individual trail given its ID
from traildb import TrailDB
tdb = TrailDB('tiny')
# Get a trail based on a trail id (O(1) operation)
print list(tdb[0])
# Get a trail based on a UUID (O(log(N)) operation)
uuid = tdb.get_uuid(1)
if uuid in tdb:
print list(tdb[uuid])
Get individual trails
from traildb import TrailDB
tdb = TrailDB('tiny')
# Field names
print tdb.fields
# Number of trails
print len(tdb)
# Minimum and maximum timestamps in this tdb
print tdb.time_range(parsetime=True)
# Number of uniques values in a field
print tdb.lexicon_size('username')
# Iterator of unique values in a field
print list(tdb.lexicon('username'))
Metadata
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Event Filters
Iterate over a subset of events in a trail
Cursor with a filter
from traildb import TrailDB
for uuid, trail in TrailDB('tiny').trails(event_filter=[('action', 'open')]):
print uuid, list(trail)
Simple Key-Value Filter
from traildb import TrailDB
query = [('action', 'open'), ('action', 'close')]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
print uuid, list(trail)
action=open OR action=close
from traildb import TrailDB
query = [[('action', 'open')], [('username', 'user0')]]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
print uuid, list(trail)
action=open AND user=user0
from traildb import TrailDB
query = [[('action', 'open')], [('username', 'user0', True)]]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
print uuid, list(trail)
action=open AND user != user0
Putting it all together
TrailDB Visualization
1. Install Anaconda
2. conda install pandas datashader
Install Anaconda and DataShader
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd
def get_events(tdb):
query = [('title', 'Prince (musician)')]
for i in range(len(tdb)):
events = list(tdb.trail(i, event_filter=query))
if events:
yield events[0].time, events
def get_dataframe():
tdb = TrailDB('pydata-tutorial.tdb')
base = tdb.min_timestamp()
types = []
xs = []
ys = []
for y, (first_ts, events) in enumerate(get_events(tdb)):
for event in events:
xs.append(int(event.time - base) / (24 * 3600))
ys.append(y)
types.append('user' if event.user else 'anon')
data = pd.DataFrame({'x': xs, 'y': ys})
data['type'] = pd.Series(types, dtype='category')
return data
cnv = ds.Canvas(400, 300)
agg = cnv.points(get_dataframe(), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'blue'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'white')
with open('prince.png', 'w') as f:
f.write(img.to_bytesio().getvalue())
traildb-python/examples/datashader_example.py
Users
Time
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd
def get_events(tdb):
query = [('title', 'Prince (musician)')]
for i in range(len(tdb)):
events = list(tdb.trail(i, event_filter=query))
if events:
yield events[0].time, events
def get_dataframe():
tdb = TrailDB('pydata-tutorial.tdb')
base = tdb.min_timestamp()
types = []
xs = []
ys = []
for y, (first_ts, events) in enumerate(sorted(get_events(tdb), reverse=True)):
for event in events:
xs.append(int(event.time - base) / (24 * 3600))
ys.append(y)
types.append('user' if event.user else 'anon')
data = pd.DataFrame({'x': xs, 'y': ys})
data['type'] = pd.Series(types, dtype='category')
return data
cnv = ds.Canvas(400, 300)
agg = cnv.points(get_dataframe(), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'blue'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'white')
with open('prince.png', 'w') as f:
f.write(img.to_bytesio().getvalue())
traildb-python/examples/datashader_example.py
Sort rows by the time of first edit
Users
Time
TrailDB 201
Performance
Strings are expensive. Integers FTW
A field has a set of possible values. The first value is always NULL.
NULL
page_open
button_click
submit
a field
field 1
type
value 0
value 1
value 2
value 3
Items
The combination of a field and one of its values is called an item.
An item is represented as a 64-bit integer.
an item
Sweden
Country
35767
234
"Country": "Sweden"
Use integer items when you can
Utilities to convert between fields, values, items and strings.
TrailDB internals
So, a TrailDB is a big bunch of integers and some metadata.
A file like this can be encoded and queried very efficiently.
ID1
ID2
ID3
ID4
ID5
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
a traildb, one file
Metadata
from traildb import TrailDB
from collections import Counter
import timeit
def string_top():
tdb = TrailDB('pydata-tutorial')
return Counter(event.title for uuid, trail in tdb.trails()
for event in trail).most_common(5)
def item_top():
tdb = TrailDB('pydata-tutorial')
stats = Counter(event.title for uuid, trail in tdb.trails(rawitems=True)
for event in trail)
return [(tdb.get_item_value(item), f) for item, f in stats.most_common(5)]
print 'string_top', timeit.timeit(string_top, number=3)
print 'item_top', timeit.timeit(item_top, number=3)
string_top 4.75762200356
item_top 2.09803509712
traildb-python/examples/top_items.py
Use raw items,
don't translate to strings
For more performance tips, see
traildb.io/docs/technical_overview/#performance-best-practices
Architecture
Sharding Data
by time
Users
Time
TrailDB
Day 1
TrailDB
Day 2
TrailDB
Day 3
Sharding Data
by UUIDs
Users
Time
TrailDB
Users 1
TrailDB
Users 2
Sharding Computation
You can go far with a single beefy server
*nix shell is an unbeatably productive development environment
Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)
←Download TrailDBs from S3
←Read TrailDBs to memory
Process data with 16 cores
Upload results to S3 →
AWS S3
TrailDB
Producer
TrailDB
Consumer A
TrailDB
Consumer B
TrailDB
Consumer C
Master
Data Store
TrailDB is designed for Cloud Environments
Simple, Elastic, Data-Graph, Dockeritized
Batch Job Engine
Join the project!
Ideas
Better integration with Pandas and Jupyter for data exploration
A set of generally applicable visualizations
Super fast Python processing using Numba
A toolkit of functions for anomaly detection, frequent episodes etc.
Test Python 3 support
More language bindings (NodeJS, PHP, Rust, Julia)
Play with public data sets (e.g. githubarchive.org)
Help to fix issues in our GitHub issue tracker
Use TrailDB in production
Happy Codebase
Outer beauty attracts, but inner beauty captivates
Small codebase, minimal dependencies
90%+ test coverage
Battle-hardened: 1.5 years in serious use
Takes backwards compatibility very seriously
Friendly community
A growing set of tools built on top of TrailDB
Thank you!
is hiring!
TrailDB Tutorial - PyData 2016
By Ville Tuulos
TrailDB Tutorial - PyData 2016
- 8,137