Ville Tuulos

Sr. Principal Engineer

ville@adroll.com

Petabyte-Scale Data Workflows with Docker, Luigi, and Spot Instances

People who converted

People who visited the site

Everyone

Challenge:

Build a Product that Finds New Customers for Any Company

People who Converted

People who Visited the Site

Everyone Else

Millions of Features

Billions of

Cookies

Core: The Matrix

Just a Machine Learning Problem

Machine Learning

Customer-Facing Metrics

Internal

Analytics

A Modern Data-Driven Product

has multiple facets

Build Matrix

Parse Data

Compute Metrics

Build Model

Apply Model

Analytics Task 1

Analytics Task 2

A Modern Data-Driven Product

is powered by pipelines of data-processing tasks

Ad-hoc

Analytics

MVP

Beta

EMEA

Launch

Deeper

Analytics

Real BI

Metrics v.2

Model v.3

A Modern Data-Driven Product

is developed iteratively

BI for

Sales

Metrics v.1

Metrics v.0

Model v.2

Model v.1

Model v.0

  Machine Learning Analytics BI
People Low-Level Devs Data Scientists Analysts
Languages C, Python R, SQL SQL, Tableau
Scale PB GB MB

A Modern Data-Driven Product

is developed by multiple people with different skillsets

Fast

Speed of development

Cheap

Cost of operation

Good

Robustness

A Modern Data-Driven Product

is constrained by eternal truths of product development

Data Ecosystems

The Easiest Way to Fast, Good, and Cheap?

AWS as a Data Ecosystem

Develop differentiating tech as fast as possible

Outsource everything else to AWS

No quiet, reverent cathedral-building here - rather, the Linux community seemed to resemble a great babbling bazaar of differing agendas and approaches out of which a coherent and stable system could seemingly emerge only by a succession of miracles.

Eric S. Raymond - The Cathedral and The Bazaar

Data Ecosystems

The Cathedral or The Bazaar?

AdRoll + AWS = Bazaar

Do whatever makes you productive with infinite,

non-opinionated resources

Fast

1. How to package and deploy tasks?

2. How to manage resources and execute tasks?

3. How to orchestrate multiple dependent tasks?

 

with minimal overhead and loss of productivity

Managing Complexity

The Bazaar does not equate to anarchy and chaos

Packaging &

Deployment

Build Matrix

C, libJudy, libCmph

Compute Metrics

Python > 2.7, numpy, Numba, LLVM

Analytics Task

R > 3.2.0, dplyr, doMC, data.table

build_matrix

comp_metrics

do_analytics

Docker Repository

Old New Idea: Containerize Batch Jobs

 

Containers - Keeping It Simple

Leverage existing skills and battle-hardened libraries while avoiding complexity, surprises, and brittleness.

Fast

Good

Task Execution

Quentin, An Elastic Scheduler

Log Parser

Log Parser

Log Parser

S3

Some tasks are distributed

to maximize throughput between S3 and EC2

Build Model

S3

Some tasks run on a single large instance

bit-level optimized C code, 

memory-mapping 200GB of data in a single process

Not everything has to be distributed

 

Sometimes the most meaningful implementation is

a good old Linux process

that just happens to run on a large instance.

Fast

Task Scheduling & Resource Management

You can do this with ECS

Log Parser 34/128

Analytics Task 1/1

Build Model 2/2

Metrics 15/32

Job Queue

Auto-Scaling Group A

Auto-Scaling Group B

Auto-Scaling Group C

r3.4xlarge, spot
d2.8xlarge
m3.4xlarge, spot

Task Scheduling & Resource Management

Hopefully you will be able to do this ECS

Log Parser 34/128

Analytics Task 1/1

Build Model 2/2

Metrics 15/32

Job Queue

Auto-Scaling Group A

Auto-Scaling Group B

Auto-Scaling Group C

r3.4xlarge, spot
d2.8xlarge
m3.4xlarge, spot

CloudWatch Metrics

Queue Size

Auto-Scaling Policies

Maximize Resource Utilization with Elasticity

We run instances only when tasks need them.

This pattern works well with spot instances.

 

The spot market can be a tricky beast:

Prepare to change instance types on the fly.

 

Cheap

Task Orchestration

Luigi & S3

Build Matrix

Parse Data

Compute Metrics

Build Model

Apply Model

Analytics Task 1

Analytics Task 2

Tasks Form a Dependency Graph 

Build Matrix

Parse Data

Compute Metrics

Build Model

Apply Model

Analytics Task 1

Analytics Task 2

Tasks Data Forms a Dependency Graph 

Parsed Data

Matrix

Model

Metrics

class PivotRunner(luigi.Task):
    blob_path = luigi.Parameter()
    out_path = luigi.Parameter()
    segments = luigi.Parameter()

    def requires(self):
        return BlobTask(blob_path=self.blob_path)

    def output(self):
        return luigi.s3.S3Target(self.out_path)

    def run(self):
        q = {
            "cmdline" : ["pivot %s {%s}" % (self.out_path, self.segments)],
            "image": 'docker:5000/pivot:latest',
            "caps" : "type=r3.4xlarge"
        }
        scheduler.run_queries('pivot', [json.dumps(q)], max_retries=1)

Dependencies are modeled as Luigi.Tasks

Make the Data Pipeline and its State Explicit

Model dependencies between inputs and outputs,

not between tasks.

 

State is stored in immutable files in S3.

Easy to understand, access, and troubleshoot:

Perfect data fabric.

Good

Job Scheduler

Auto-Scaling Groups

EC2 (Spot) Instances

S3

Putting It All Together

Conclusion

AdRoll Prospecting

  • Implemented in C, Bash, Python, Lua, R, JavaScript, Erlang, and a custom DSL
  • Powered by a graph of ~50 Luigi tasks
  • Updates ~20,000 models daily, ingesting hundreds of TBs of raw data
  • At peak, tasks consume over 20TB+ RAM
  • Hundreds of largest EC2 spot instances launched and killed daily
  • Billions of cookies analyzed daily to power the customer-facing dashboard
  • Beta used successfully by 100+ customers
  • Implemented by a team of 6 engineers, on schedule

Thank You!

is hiring

 

ville@adroll.com

Petabyte-Scale Data Workflows with Docker, Luigi and Spot Instances

By Ville Tuulos

Petabyte-Scale Data Workflows with Docker, Luigi and Spot Instances

  • 33,830