Ville Tuulos
Sr. Principal Engineer
ville@adroll.com
Petabyte-Scale Data Workflows with Docker, Luigi, and Spot Instances
People who converted
People who visited the site
Everyone
Challenge:
Build a Product that Finds New Customers for Any Company
People who Converted
People who Visited the Site
Everyone Else
Millions of Features
Billions of
Cookies
Core: The Matrix
Just a Machine Learning Problem
Machine Learning
Customer-Facing Metrics
Internal
Analytics
A Modern Data-Driven Product
has multiple facets
Build Matrix
Parse Data
Compute Metrics
Build Model
Apply Model
Analytics Task 1
Analytics Task 2
A Modern Data-Driven Product
is powered by pipelines of data-processing tasks
Ad-hoc
Analytics
MVP
Beta
EMEA
Launch
Deeper
Analytics
Real BI
Metrics v.2
Model v.3
A Modern Data-Driven Product
is developed iteratively
BI for
Sales
Metrics v.1
Metrics v.0
Model v.2
Model v.1
Model v.0
Machine Learning | Analytics | BI | |
People | Low-Level Devs | Data Scientists | Analysts |
Languages | C, Python | R, SQL | SQL, Tableau |
Scale | PB | GB | MB |
A Modern Data-Driven Product
is developed by multiple people with different skillsets
Fast
Speed of development
Cheap
Cost of operation
Good
Robustness
A Modern Data-Driven Product
is constrained by eternal truths of product development
Data Ecosystems
The Easiest Way to Fast, Good, and Cheap?
AWS as a Data Ecosystem
Develop differentiating tech as fast as possible
Outsource everything else to AWS
No quiet, reverent cathedral-building here - rather, the Linux community seemed to resemble a great babbling bazaar of differing agendas and approaches out of which a coherent and stable system could seemingly emerge only by a succession of miracles.
Eric S. Raymond - The Cathedral and The Bazaar
Data Ecosystems
The Cathedral or The Bazaar?
AdRoll + AWS = Bazaar
Do whatever makes you productive with infinite,
non-opinionated resources
Fast
1. How to package and deploy tasks?
2. How to manage resources and execute tasks?
3. How to orchestrate multiple dependent tasks?
with minimal overhead and loss of productivity
Managing Complexity
The Bazaar does not equate to anarchy and chaos
Packaging &
Deployment
Build Matrix
C, libJudy, libCmph
Compute Metrics
Python > 2.7, numpy, Numba, LLVM
Analytics Task
R > 3.2.0, dplyr, doMC, data.table
build_matrix
comp_metrics
do_analytics
Docker Repository
Old New Idea: Containerize Batch Jobs
Containers - Keeping It Simple
Leverage existing skills and battle-hardened libraries while avoiding complexity, surprises, and brittleness.
Fast
Good
Task Execution
Quentin, An Elastic Scheduler
Log Parser
Log Parser
Log Parser
S3
Some tasks are distributed
to maximize throughput between S3 and EC2
Build Model
S3
Some tasks run on a single large instance
bit-level optimized C code,
memory-mapping 200GB of data in a single process
Not everything has to be distributed
Sometimes the most meaningful implementation is
a good old Linux process
that just happens to run on a large instance.
Fast
Task Scheduling & Resource Management
You can do this with ECS
Log Parser 34/128
Analytics Task 1/1
Build Model 2/2
Metrics 15/32
Job Queue
Auto-Scaling Group A
Auto-Scaling Group B
Auto-Scaling Group C
r3.4xlarge, spot
d2.8xlarge
m3.4xlarge, spot
Task Scheduling & Resource Management
Hopefully you will be able to do this ECS
Log Parser 34/128
Analytics Task 1/1
Build Model 2/2
Metrics 15/32
Job Queue
Auto-Scaling Group A
Auto-Scaling Group B
Auto-Scaling Group C
r3.4xlarge, spot
d2.8xlarge
m3.4xlarge, spot
CloudWatch Metrics
Queue Size
Auto-Scaling Policies
Maximize Resource Utilization with Elasticity
We run instances only when tasks need them.
This pattern works well with spot instances.
The spot market can be a tricky beast:
Prepare to change instance types on the fly.
Cheap
Task Orchestration
Luigi & S3
Build Matrix
Parse Data
Compute Metrics
Build Model
Apply Model
Analytics Task 1
Analytics Task 2
Tasks Form a Dependency Graph
Build Matrix
Parse Data
Compute Metrics
Build Model
Apply Model
Analytics Task 1
Analytics Task 2
Tasks Data Forms a Dependency Graph
Parsed Data
Matrix
Model
Metrics
class PivotRunner(luigi.Task):
blob_path = luigi.Parameter()
out_path = luigi.Parameter()
segments = luigi.Parameter()
def requires(self):
return BlobTask(blob_path=self.blob_path)
def output(self):
return luigi.s3.S3Target(self.out_path)
def run(self):
q = {
"cmdline" : ["pivot %s {%s}" % (self.out_path, self.segments)],
"image": 'docker:5000/pivot:latest',
"caps" : "type=r3.4xlarge"
}
scheduler.run_queries('pivot', [json.dumps(q)], max_retries=1)
Dependencies are modeled as Luigi.Tasks
Make the Data Pipeline and its State Explicit
Model dependencies between inputs and outputs,
not between tasks.
State is stored in immutable files in S3.
Easy to understand, access, and troubleshoot:
Perfect data fabric.
Good
Job Scheduler
Auto-Scaling Groups
EC2 (Spot) Instances
S3
Putting It All Together
Conclusion
AdRoll Prospecting
- Implemented in C, Bash, Python, Lua, R, JavaScript, Erlang, and a custom DSL
- Powered by a graph of ~50 Luigi tasks
- Updates ~20,000 models daily, ingesting hundreds of TBs of raw data
- At peak, tasks consume over 20TB+ RAM
- Hundreds of largest EC2 spot instances launched and killed daily
- Billions of cookies analyzed daily to power the customer-facing dashboard
- Beta used successfully by 100+ customers
- Implemented by a team of 6 engineers, on schedule
Thank You!
is hiring
ville@adroll.com
Petabyte-Scale Data Workflows with Docker, Luigi and Spot Instances
By Ville Tuulos
Petabyte-Scale Data Workflows with Docker, Luigi and Spot Instances
- 34,358