Scaling Facebook Analytics Infrastructure with MongoDB & node.js

Matt Insler, West Coast Director of Engineering, Unified Social

mattinsler.com

github.com/mattinsler

www.unifiedsocial.com

Street Cred:

Defense - I could tell you, but then I'd have to...

Banking - MongoDB version 0.4?

PageLever - First engineering hire

Unified - Head of West Coast Dev

Mongo & node in production:

Analytics Minilytics Now

Facebook Analytics Data

(what we're working with)

Page-level data published once a day at 2 AM PT, and varies between 2-7 days behind current date
Post-level data published every 10-15 minutes
Polling is the only option
API limits are roughly 600 calls per 600 seconds per API token

Our first take, circa 2011:

Monolithic Rails App
MongoDB 1.8 on MongoHQ cloud
queue_classic for jobs
TONS of workers to pull down Page data

1 worker per page at a time

~1,000 Pages per hour

First Take - Problems:

Little visibility into errors or problems (our fault)
Just ~1000 Pages per hour (Ruby's fault)
Data structure: KISS principal
Dots in keys (our fault)

 {
    fb_page_id: '...',
	metric: 'page_fans',
	period: 'lifetime',
	data: {
		'2013-01-01': 0,
		'2013-01-02': 3,
		...
	}
}

But it got the job done.

Then we started building new apps:

The Big Kahuna:

Requirements:

Collect, store and update data in "real-time", despite the lack of a true real-time API
Lots of data & lots of calls (posts, likes, comments, shares)
Run custom logic for derived metrics

Thousands of Facebook Pages

Dozens of API requests per Page

Every 10-15 minutes

24/7

HARD.

Time for BIG changes:

New sync system in node.js, complete rewrite
New durable queuing system (home grown)

Queue introspection
Error Visibility
Operation Statistics
Retry to deal with 10-15% error rate from Facebook
Heroku auto-scaling (homegrown)

Write / Upsert-only methodology

Wins

Loads of concurrency
1,000 Pages/hour => 1,000 Pages/minute
Less servers, less $$
4k+ writes/sec to Mongo

Oh shit, mongo

Single documents for time-series data

Documents grow and reallocate leaving huge holes in memory (oops, compaction issues in SQL vs Mongo)

Global write lock (CPU-bound)
Write queue length > 50 on average (up to 600 at times)
Not a ton of room to take on more data/concurrency

Fixing mongo (thanks MongoHQ)

Time-series data laid out as single documents
Moved to bare-metal boxes
Sharded 8 DBs on 2 boxes (8x write throughput)

4 master/4 slave per box

Moved to SSDs (10x write throughput)

Lesson

You can scale up for less than scaling out
Sharding on a single box can fix write throughput
Sharding makes backups difficult to deal with

Oh shit, node

I/O Concurrency is different than concurrency

CPU intensive tasks are still blocking

RAM limitations on your hosting environment

Just because you have 10MB of JSON, doesn't mean it's not 100MB parsed

Log and catch errors everywhere
Beware of DB server connection issues

My friend node

Make small jobs that mainly do I/O
Limit work done per worker to fit in RAM
Auto-scale to do more work when necessary
Child processes catching 'uncaughtException' and logging
Monitor DB connection errors and restart automatically

So much more

Customer base is expanding
Data collected is expanding
So many social networks, so little time

Have ideas?

Want to help?

Unified Social

www.unifiedsocial.com

careers@unifiedsocial.com

Matt Insler

mattinsler.com

github.com/mattinsler

matt@unifiedsocial.com