SF Data Mining Meetup 7/19/2016
red:ads
gray: page views
green:3rd party data
These users don't sleep: fraud?
A very engaging site
Active only during business hours
Evening campaigns
New Prospecting campaign launched→
User 1
User 2
User 3
User 4
Time
Users
Trails
Event
e.g. user ID, document ID
what happened and when?
trails of events
History is lost in destructive updates
Individual events are lost in aggregation
Expensive to query
ID1
ID2
ID3
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
ID1
ID2
ID3
Internally, TrailDB uses a number of different compression techniques to condense the data to the minimal amount of space.
In contrast to gzip, you need to decompress only what you need.
Compression is not only about space but about speed too
all-events.tdb
1.5 GB
Like SQLite, not like Postgres or Redis.
Implemented in C for maximum performance.
from traildb import TrailDBConstructor, TrailDB
from uuid import uuid4
from datetime import datetime
# Define fields (schema)
cons = TrailDBConstructor('tiny', ['username', 'action'])
for i in range(3):
uuid = uuid4().hex
username = 'user%d' % i
for day, action in enumerate(['open', 'save', 'close']):
# Add events to TrailDB
cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action))
# Finalize TrailDB
cons.finalize()
from traildb import TrailDB
for uuid, trail in TrailDB('tiny').trails():
print uuid, list(trail)
2ec4c2917e0f45c79accd43d385f4a5c
[event(time=1456819200L, username='user2', action='open'),
event(time=1456905600L, username='user2', action='save'),
event(time=1456992000L, username='user2', action='close')]
275d264fddde498f8a134f4518afcb6b
[event(time=1454313600L, username='user1', action='open'),
event(time=1454400000L, username='user1', action='save'),
event(time=1454486400L, username='user1', action='close')]
bed2864b4f51445daaa1d0a0d67bb5b5
[event(time=1451635200L, username='user0', action='open'),
event(time=1451721600L, username='user0', action='save'),
event(time=1451808000L, username='user0', action='close')]
Stunning TrailDB visualizations on your laptop in 5 easy steps
4. Download a sample of Wikipedia edit history as a TrailDB
3. Install DataShader Package for Python
1. Install TrailDB
brew install traildb or http://traildb.io/docs/getting_started/
2. Install Python bindings for TrailDB
git clone https://github.com/traildb/traildb-python
or download the full history, 663M events, 5.8GB:
import sys
from traildb import TrailDB
from itertools import islice
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd
def get_trails(path):
x = []
y = []
types = []
for i, trail in enumerate(islice(TrailDB(path), 1000)):
for event in trail:
x.append(event.time / (24 * 60 * 60))
y.append(i)
types.append('user' if event.user else 'anon')
df = pd.DataFrame({'x': x, 'y': y})
df['type'] = pd.Series(types, dtype='category')
return df
cnv = ds.Canvas(400, 300)
agg = cnv.points(get_trails(sys.argv[1]), 'x', 'y', ds.count_cat('type'))
colors = {'anon': 'red', 'user': 'yellow'}
img=tf.set_background(tf.colorize(agg, colors, how='eq_hist'), 'black')
with open('output.png', 'w') as f:
f.write(img.to_bytesio().getvalue())
5. Run the follow script and modify it for deeper insights
Anonymous edits (red dots) stop in 2010
Prince dies in April 2016
sort users by first edit date
Most edits about Prince's death are by new users