People who Converted
People who Visited the Site
Everyone Else
Millions of Features
Billions of
Cookies
Machine Learning
Customer-Facing Metrics
Internal
Analytics
Build Matrix
Parse Data
Compute Metrics
Build Model
Apply Model
Analytics Task 1
Analytics Task 2
Ad-hoc
Analytics
MVP
Beta
EMEA
Launch
Deeper
Analytics
Real BI
Metrics v.2
Model v.3
BI for
Sales
Metrics v.1
Metrics v.0
Model v.2
Model v.1
Model v.0
Machine Learning | Analytics | BI | |
People | Low-Level Devs | Data Scientists | Analysts |
Languages | C, Python | R, SQL | SQL, Tableau |
Scale | PB | GB | MB |
Speed of development
Cost of operation
Robustness
No quiet, reverent cathedral-building here - rather, the Linux community seemed to resemble a great babbling bazaar of differing agendas and approaches out of which a coherent and stable system could seemingly emerge only by a succession of miracles.
Eric S. Raymond - The Cathedral and The Bazaar
Do whatever makes you productive with infinite,
non-opinionated resources
1. How to package and deploy tasks?
2. How to manage resources and execute tasks?
3. How to orchestrate multiple dependent tasks?
with minimal overhead and loss of productivity
Build Matrix
C, libJudy, libCmph
Compute Metrics
Python > 2.7, numpy, Numba, LLVM
Analytics Task
R > 3.2.0, dplyr, doMC, data.table
build_matrix
comp_metrics
do_analytics
Docker Repository
Leverage existing skills and battle-hardened libraries while avoiding complexity, surprises, and brittleness.
Log Parser
Log Parser
Log Parser
S3
to maximize throughput between S3 and EC2
Build Model
S3
bit-level optimized C code,
memory-mapping 200GB of data in a single process
Sometimes the most meaningful implementation is
a good old Linux process
that just happens to run on a large instance.
Log Parser 34/128
Analytics Task 1/1
Build Model 2/2
Metrics 15/32
Job Queue
Auto-Scaling Group A
Auto-Scaling Group B
Auto-Scaling Group C
r3.4xlarge, spot
d2.8xlarge
m3.4xlarge, spot
Log Parser 34/128
Analytics Task 1/1
Build Model 2/2
Metrics 15/32
Job Queue
Auto-Scaling Group A
Auto-Scaling Group B
Auto-Scaling Group C
r3.4xlarge, spot
d2.8xlarge
m3.4xlarge, spot
CloudWatch Metrics
Queue Size
Auto-Scaling Policies
We run instances only when tasks need them.
This pattern works well with spot instances.
The spot market can be a tricky beast:
Prepare to change instance types on the fly.
Build Matrix
Parse Data
Compute Metrics
Build Model
Apply Model
Analytics Task 1
Analytics Task 2
Build Matrix
Parse Data
Compute Metrics
Build Model
Apply Model
Analytics Task 1
Analytics Task 2
Parsed Data
Matrix
Model
Metrics
class PivotRunner(luigi.Task):
blob_path = luigi.Parameter()
out_path = luigi.Parameter()
segments = luigi.Parameter()
def requires(self):
return BlobTask(blob_path=self.blob_path)
def output(self):
return luigi.s3.S3Target(self.out_path)
def run(self):
q = {
"cmdline" : ["pivot %s {%s}" % (self.out_path, self.segments)],
"image": 'docker:5000/pivot:latest',
"caps" : "type=r3.4xlarge"
}
scheduler.run_queries('pivot', [json.dumps(q)], max_retries=1)
Model dependencies between inputs and outputs,
not between tasks.
State is stored in immutable files in S3.
Easy to understand, access, and troubleshoot:
Perfect data fabric.
Job Scheduler
Auto-Scaling Groups
EC2 (Spot) Instances
S3
ville@adroll.com