Lucía: Daddy, 2014
Alex Fernández
Chief Senior Data Scientist at MediaSmart Mobile
Serve Mobile Ads
Real Time Bidding (RTB)
Performance campaigns
85 K+ requests / second
10 M+ impressions / day
40+ servers
20+ countries
Image source: Stupid Zombies 2
We help pay for your entertainment
How big is "big data"?
Check
Inventory
3+ MM bid offers per day
About 100 MM per month
Ask simple questions
On a budget!
We are a startup
Sent over RabbitMQ
Gobbled our systems
Stored in CouchBase
Too slow to query
Stored in Redis
Too much information
Visualize one month inventory
Run most queries in < 1 sec
Development: a few weeks
Budget: < 1% of income
~ $1K / month budget
At BSD'13
Well, most of the time
MetaMarkets
Hadoop
Cassandra
Amazon RedShift
Google BigQuery
Amazon Kinesis
Amazon Data Pipeline
Columnar database
"Petabyte-scale"
Cheap! All you can eat!
160 GB SSD: $700 / month
PostgreSQL Interface
Full scans are not evil anymore!
but the usual way of access
Constant look-up times
1 second per 4 M values per node
Does not cache subqueries well
Goya: Interior of a Prison, 1793
region
day
hour
ad size
language
country code
operating system
...
3 MM events / day
Each event characterized by 24 fields
Each field can take 2~20 values
Can simply count events with fields
Recommended by Amazon
Type tables, main data with foreign keys
Data loading becomes very hard
Each combination of values is stored only once
Takes ~half the space, > 2x the querying time
Several refs: hits a bottleneck
Data lines are loaded as is
Very simple code
Easy loading, no foreign keys
Redshift behaves surprisingly well
Group events by fields
Pure exponential:
2 values/field: 2 24 ~ 16 M refs
3 values/field: 3 24 ~ 282 MM refs
Actual values:
Hourly: ~200M events → ~2M refs
Daily: 3.5 MM events → 8M refs
That is convenient!
Close, but no cigar
Thanks to an answer on
Cross Validated
There are many possible values
with different frequencies
Normal distribution?
Quite a few papers about it
None that clarify our case
Initially visible fields
Accumulated on their own
A whole day → 20K values!
A new accumulation process
Bids in real time:
Inventory analysis:
Bidding Data
Less events (10~60 M / day)
Additional fields (50% more)
Does not aggregate as nicely
24 + 12 fields
Many values to accumulate:
bids sent, bid price, bids won...
Actual aggregation values:
10 M → 2 M unique events
We want MetaMarkets!
We want all possible fields,
just in case
All fields should go equally fast
Actual quotes from our users
Amazon Data Pipeline: not so hot
Big data is challenging
Near the operating limits of current tech
Performance should be key to user systems
Good engineering requires many iterations
Test everything
Aggregate everything
Build APIs to isolate systems
Deliver fast, iterate several times