A weekend of Analytic(s) Hacking in MongoDB
-- or --
Cultural Learnings of Mongodbia for Make Benefit Glorious Nation of Analytics Computation
System A: ELK
Pros
- Aggregates Any Sort of structured Data
- Filters based upon any sort of criteria
- Panels provide great WYSIWYG
- Scalability is simple, configuration based.
- Personally benchmarked to a hundreds events per second via test suite.
- Out of the Box solution.
Cons/Whatifs
- Repeatability: What if I want to recompute a certain subset of my data. (Music / Agreements)
- ES does not promise to be a reliable data store
- LS agents transmit via UDP
- Moving parts. Different things can go down. Network packets can drop, etc
Pros(Cont)
Cons(Cont)
- Moving parts. Different things can go down. Network packets can drop, etc
Protocols
Protocols
- Binary Protocols: MySQL Client/Server Protocol
- Binary Protocols(Cont): JNI, RMI
- Binary Protocols(Cont): Basic Socket connections, Thrift, Google Protocol Buffers,
- HTTP Based: Rest, Soap
- Websockets
- File IO
{
"tags": {
"album": "5403b0c2f5f5680200b45d89",
"song": "5403b0c2f5f5680200b45d91",
"locatorname": null,
"eventtype": "startTrackPlayback",
"artist": "5403b0c2f5f5680200b45d87",
"locatortype": "buttonClick",
"user": null,
"device": "windows-unknown",
"browser": "chrome"
},
"timestamp": 1427223676.554661,
"host": "localhost",
"message": null,
"time_taken": null,
"client": "UNKNOWN",
"exc_info": null,
"logger": "analytic",
"type": "realtime_api",
"levelname": "INFO"
}
Data now
- Goes to a file instead of a Network Connection.
- Conformed JSON standard is http://jsonlines.org/
- This serves as an interface.
- Logstash has a plugin that can read JSON lines.
- System A is happy.
- My process has all it takes to be repeatable.
- I'm not a slave to my technology. I know file IO works liked it has worked for my forefathers. Files can be rotated and rsyncs also work
Requirements
- Events: plays, favorites, etc
- Reporting granularity: day, month, year, week, alltime
- Concepts: Song, Artists, Albums, Labels
- TopN & Time Series
All hail Mongodb
Assumptions
Collections
- stats_counters
- stats_counters_week
- stats_counters_month
- stats_counters_year
- stats_counters_alltime
- Granularity: (Day, Week, Month, Year, All)
- Criteria: (Song, Album, Label, Artist, User)
{
"_id": ObjectId("55942e2a1a704cc8c3151b90"),
"criteria": "user",
"criteria_value": "5534dce2be48510300d4e5f0",
"day": 5,
"event": "playlistSavePopupOpen",
"label": "123",
"month": 5,
"year": 2015,
"counter": 32,
"ymd": 20150505
}
This ...
def stash_aggregation(song_record)
# / / / / / / / / / / / / / /
# \ \ \ In case you're wondering this is a zip line \ \ \
# / / / / / / / / / / / / / /
for criteria in ('album', 'label', 'artist'):
search_criteria = ({'criteria': inner_criteria_value, 'event': event, 'ymd': ymd})
counter_object = db.stats_counters.find_one(search_criteria)
if counter_object is None:
inserted_id = db.stats_counters.insert({"year": year, "month": month, "day": day, "label": label,\
"ymd": ymd,
"criteria": criteria, "criteria_value": inner_criteria_value,\
"event": event, "counter": 0})
else:
inserted_id = counter_object['_id']
db.stats_counters.update({"_id":inserted_id},{"$inc": {"counter": counter}})
(iso_year, iso_week, iso_day) = datetime.date(year, month, day).isocalendar()
ingest_week(year, iso_week, event, counter, song_id, song_label, song_album, song_artist)
ingest_month(year, month, event, counter, song_id, song_label, song_album, song_artist)
ingest_year(year, event, counter, song_id, song_label, song_album, song_artist)
ingest_alltime(event, counter, song_id, song_label, song_album, song_artist)
And more ...
def ingest_week(year, iso_week, event, counter, song, song_label, song_album, song_artist):
search_criteria = ({'event': event, 'year': year, 'iso_week': iso_week})
for (criteria, value) in (('song', song), ('album', song_album), ('label', song_label), ('artist', song_artist)):
search_criteria["criteria"] = criteria
search_criteria["criteria_value"] = value
counter_object = db.stats_counters_week.find_one(search_criteria)
if counter_object is None:
insertion_object = dict(search_criteria)
insertion_object["counter"] = 0
inserted_id = db.stats_counters_week.insert(insertion_object)
else:
inserted_id = counter_object['_id']
db.stats_counters_week.update({"_id":inserted_id},{"$inc": {"counter": counter}})
Per event
- 5 (Day, Week, Month, Year, All)
- 4 (Song, Album, Label, Artist)
- 20 inserts per event
The Results
- Performance is so bad that I didn't have the patience to measure it.
Google Knows
- Mongodb is Constrained to 1 Core per connection, so the client should be threaded.
- Python driver sucks
You want to borrow your neighbours hose so you can wash your car. But you remember that last week, you broke their rake, so you need to go to the hardware store to buy a new one. But that means driving to the hardware store, so you have to look for your keys. You eventually find your keys inside a tear in a cushion - but you can’t leave the cushion torn, because the dog will destroy the cushion if they find a little tear. The cushion needs a little more stuffing before it can be repaired, but it’s a special cushion filled with exotic Tibetan yak hair.
The next thing you know, you’re standing on a hillside in Tibet shaving a yak. And all you wanted to do was wash your car.
Assumptions v2.0
- C is fast
- Hence a C program that implements the above logic will be faster.
Assumptions v3.0
- C is fast
- Mongodb's native import utilities must be blazingly fast.
- Hence a C program that spits inserts queries to a file that can be imported via
mongoimport --db dbname --collection colname --type json --file fname
File Writing
Not as fast as anticipated but still notable speedup.
mongoimport
- Parse Error Failure.
- Aha... Its because my file is large. I need to use
-jsonArray.
- But now this is too slow.
- Linux split command into roughly 1MB chunks. I've hundreds of files with no way of monitoring progress.
ingest_week(year, iso_week, event, counter, song_id, song_label, song_album, song_artist)
ingest_month(year, month, event, counter, song_id, song_label, song_album, song_artist)
ingest_year(year, event, counter, song_id, song_label, song_album, song_artist)
ingest_alltime(event, counter, song_id, song_label, song_album, song_artist)
Easy Wins anyone?
Lets use the aggregation framework for these.
And also for these
for criteria in ("album", "artist", "label"):
# ...
pass
This is what an aggregation looks like
aggregation_result = db.stats_counters.aggregate([
{"$match": {"event": event, "criteria": criteria, "valid": true}},
{"$match": {"ymd": { "$gte": ymd_from, "$lte": ymd_to},
{"$group":{"_id":"$criteria_value", "sum":{"$sum":"$counter"}}}
])
for result in aggregation_result['result']:
#print repr(result)
insertion_object = {"counter" : result['sum'], "criteria" : criteria, \
"criteria_value" : aggregation_result['result'][0]["_id"], "event" : event, \
"year" : year, "week": week, "label": "None"}
inserted_id = db.stats_counters_week.insert(insertion_object)
- Almost 20x speedup. (For this part of the code)
- Everything is fast. 6 hours for 100K records.
- But why do inserts suck so bad
- Maybe its a ...
counter_object = db.stats_counters.find_one(search_criteria)
inserted_id = None
if counter_object is None:
inserted_id = db.stats_counters.insert({"year": year, "month": month, "day": day, "label": None,\
"ymd": int("%02d%02d%02d"%(year, month, day)),\
"criteria": criteria, "criteria_value": criteria_value,\
"event": event, "counter": 0, "valid": True, "counter": count})
else:
inserted_id = counter_object['_id']
db.stats_counters.update({
'event': '%s', 'criteria':'%s', 'criteria_value': '%s', 'ymd': %s},
{ $set: {'year': %s,
'month': %s,
'day': %s,
'label': '%s'
},
$inc: {'counter': %s}
},
{upsert:true})
Changing This
To This
Viola
30 Minutes
Code Sober & RTFM
Now I can process 100K events in half an hour, which is a days worth of data. Its good, but if I discover a bug and want to recompute 90 days worth of data ~= 2 days.
All Hail Pandas?
"""
A data frame is a table, or two-dimensional array-like structure,\
in which each column contains measurements on one variable, and each row contains one case.
Data frames are implemented natively have a minimal memory footprint and are blazing fast
Below is a dataframe containing #event_type,user,song,artist,album,date
"""
df = pd.DataFrame() #Containing
"""
Code to store the aggregation results into mongo
"""
for criteria in ('song', 'artist', 'album'):
checkpoint = time.time()
event_counts = df.groupby(['$event', '$' + criteria])['$' + criteria].agg(['count'])
checkpoint = time.time()
for row in event_counts.iterrows():
event = row[0][0]
criteria_value = row[0][1]
count = int(row[1])
insert_into_mongo(event, criteria, criteria_value, count, date_list)
Time For the process
5 Seconds
Without delving into the UI side of things
TLDR
- No one tool does everything.
- Mongo aggregations are used for reporting from the precomputed results on the display layer.
Questions?
http://slides.com/iqbaltalaatbhatti/mo#/
About Me
https://twitter.com/iqqi84
https://au.linkedin.com/pub/iqbal-bhatti/14/63/493
CTO & Cofounder Patari[Pakistans largest Music Streaming Portal]
iqbal@patari.pk
CTO Active Capital IT[Software Consultancy working with Cataloging, Artwork & Telecom Sectors]
italaat@acit.com
A weekend of Analytics Hacking in Mongodb
By Iqbal Talaat Bhatti
A weekend of Analytics Hacking in Mongodb
Talk first given on the Sydney Mongodb Users Group on 25th August 2015.
- 1,402