PyData SF 2016
an engaging site
evening campaigns
new campaign started
these users never sleep
e.g. user ID, document ID
what happened and when?
trails of events
History is lost in destructive updates
Individual events are lost in aggregation
Expensive to query
ID1
ID2
ID3
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
ID1
ID2
ID3
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Compression is not only about space but about speed too
all-events.tdb
1.5 GB
like SQLite, not like Postgres or Redis
>>> 1 + 2
>>> a + b
>>> a() + b()
which one of the following is easiest to reason about?
1. brew install traildb
2. git clone https://github.com/traildb/traildb
1. apt-get install libarchive-dev libjudy-dev pkg-config
(or yum install judy-devel libarchive-devel pkg-config)
2. git clone https://github.com/traildb/traildb
3. cd traildb
4. ./waf configure
5. ./waf install
1. curl -O http://traildb.io/data/libJudy.so.1.0.3
2. cp libJudy.so.1.0.3 /usr/lib/libJudy.so.1
1. brew install homebrew/boneyard/judy
1. cd traildb
2. tdb dump -i test
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Time
Trails
ID1
ID2
ID3
ID4
ID5
A TrailDB is an ordered set of trails.
a traildb, one file
1. curl -O http://traildb.io/data/pydata-tutorial.tdb
2. tdb dump -i pydata-tutorial.tdb
2016-01-02
A trail is a list of events, ordered by time, identified by a unique key
2016-01-03
2016-01-07
User3214
a trail
2016-01-02
An event is a set of fields. The first field is always time.
page_open
signup
Sweden
an event
field 0
time
field 1
type
field 2
page_id
field 3
country
user3435, 1454923792, page_open, signup
user243, 1454923791, submit, form2
user9076, 1454923802, search, landing
event
o
o
o
o
o
o
o
o
o
o
o
user3435
user243
user9076
Construct a new TrailDB based on an unordered stream of events
1. cat > test.csv
uuid time field1 field2
2 10 first_event second_event
3 9 another_event yet_another
2. tdb make -i test.csv --csv-header -o mytest
3. tdb dump -i mytest
1. git clone https://github.com/traildb/traildb-python
2. cd traildb-python
3. python setup.py install
from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime
cons = TrailDBConstructor('tiny', ['username', 'action'])
for i in range(3):
uuid = uuid4().hex
username = 'user%d' % i
for day, action in enumerate(['open', 'save', 'close']):
cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))
cons.finalize()
traildb-python/examples/tutorial_simple_traildb.py
from traildb import TrailDB
for uuid, trail in TrailDB('tiny').trails():
print uuid, list(trail)
traildb-python/examples/tutorial_simple_traildb.py
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Iterate over all trail IDs
Iterator
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Iterate over events of a trail
Cursor
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Lookup an individual trail given its ID
from traildb import TrailDB
tdb = TrailDB('tiny')
# Get a trail based on a trail id (O(1) operation)
print list(tdb[0])
# Get a trail based on a UUID (O(log(N)) operation)
uuid = tdb.get_uuid(1)
if uuid in tdb:
print list(tdb[uuid])
from traildb import TrailDB
tdb = TrailDB('tiny')
# Field names
print tdb.fields
# Number of trails
print len(tdb)
# Minimum and maximum timestamps in this tdb
print tdb.time_range(parsetime=True)
# Number of uniques values in a field
print tdb.lexicon_size('username')
# Iterator of unique values in a field
print list(tdb.lexicon('username'))
event
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ID1
ID2
ID3
ID4
ID5
Cursor with a filter
from traildb import TrailDB
for uuid, trail in TrailDB('tiny').trails(event_filter=[('action', 'open')]):
print uuid, list(trail)
from traildb import TrailDB
query = [('action', 'open'), ('action', 'close')]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
print uuid, list(trail)
from traildb import TrailDB
query = [[('action', 'open')], [('username', 'user0')]]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
print uuid, list(trail)
from traildb import TrailDB
query = [[('action', 'open')], [('username', 'user0', True)]]
for uuid, trail in TrailDB('tiny').trails(event_filter=query):
print uuid, list(trail)
1. Install Anaconda
2. conda install pandas datashader
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd
def get_events(tdb):
query = [('title', 'Prince (musician)')]
for i in range(len(tdb)):
events = list(tdb.trail(i, event_filter=query))
if events:
yield events[0].time, events
def get_dataframe():
tdb = TrailDB('pydata-tutorial.tdb')
base = tdb.min_timestamp()
types = []
xs = []
ys = []
for y, (first_ts, events) in enumerate(get_events(tdb)):
for event in events:
xs.append(int(event.time - base) / (24 * 3600))
ys.append(y)
types.append('user' if event.user else 'anon')
data = pd.DataFrame({'x': xs, 'y': ys})
data['type'] = pd.Series(types, dtype='category')
return data
cnv = ds.Canvas(400, 300)
agg = cnv.points(get_dataframe(), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'blue'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'white')
with open('prince.png', 'w') as f:
f.write(img.to_bytesio().getvalue())
traildb-python/examples/datashader_example.py
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd
def get_events(tdb):
query = [('title', 'Prince (musician)')]
for i in range(len(tdb)):
events = list(tdb.trail(i, event_filter=query))
if events:
yield events[0].time, events
def get_dataframe():
tdb = TrailDB('pydata-tutorial.tdb')
base = tdb.min_timestamp()
types = []
xs = []
ys = []
for y, (first_ts, events) in enumerate(sorted(get_events(tdb), reverse=True)):
for event in events:
xs.append(int(event.time - base) / (24 * 3600))
ys.append(y)
types.append('user' if event.user else 'anon')
data = pd.DataFrame({'x': xs, 'y': ys})
data['type'] = pd.Series(types, dtype='category')
return data
cnv = ds.Canvas(400, 300)
agg = cnv.points(get_dataframe(), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'blue'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'white')
with open('prince.png', 'w') as f:
f.write(img.to_bytesio().getvalue())
traildb-python/examples/datashader_example.py
Sort rows by the time of first edit
A field has a set of possible values. The first value is always NULL.
NULL
page_open
button_click
submit
a field
field 1
type
value 0
value 1
value 2
value 3
The combination of a field and one of its values is called an item.
An item is represented as a 64-bit integer.
an item
Sweden
Country
35767
234
"Country": "Sweden"
Utilities to convert between fields, values, items and strings.
So, a TrailDB is a big bunch of integers and some metadata.
A file like this can be encoded and queried very efficiently.
ID1
ID2
ID3
ID4
ID5
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
a traildb, one file
Metadata
from traildb import TrailDB
from collections import Counter
import timeit
def string_top():
tdb = TrailDB('pydata-tutorial')
return Counter(event.title for uuid, trail in tdb.trails()
for event in trail).most_common(5)
def item_top():
tdb = TrailDB('pydata-tutorial')
stats = Counter(event.title for uuid, trail in tdb.trails(rawitems=True)
for event in trail)
return [(tdb.get_item_value(item), f) for item, f in stats.most_common(5)]
print 'string_top', timeit.timeit(string_top, number=3)
print 'item_top', timeit.timeit(item_top, number=3)
string_top 4.75762200356
item_top 2.09803509712
traildb-python/examples/top_items.py
Use raw items,
don't translate to strings
traildb.io/docs/technical_overview/#performance-best-practices
TrailDB
Day 1
TrailDB
Day 2
TrailDB
Day 3
TrailDB
Users 1
TrailDB
Users 2
*nix shell is an unbeatably productive development environment
Example: Run a task that accesses all AdRoll deliveries over 30 days on a single d2.8xlarge (tens of billions of events)
←Download TrailDBs from S3
←Read TrailDBs to memory
Process data with 16 cores
Upload results to S3 →
TrailDB
Producer
TrailDB
Consumer A
TrailDB
Consumer B
TrailDB
Consumer C
Master
Data Store
Better integration with Pandas and Jupyter for data exploration
A set of generally applicable visualizations
Super fast Python processing using Numba
A toolkit of functions for anomaly detection, frequent episodes etc.
Test Python 3 support
More language bindings (NodeJS, PHP, Rust, Julia)
Play with public data sets (e.g. githubarchive.org)
Help to fix issues in our GitHub issue tracker
Use TrailDB in production
Outer beauty attracts, but inner beauty captivates