Human Centric

Machine Learning Infrastructure

@

Ville Tuulos

QCon SF, November 2018

Meet Alex, a new chief data scientist at Caveman Cupcakes

You are hired!

We need a dynamic pricing model.

Optimal pricing model

Great job!

The model works

perfectly!

Could you

predict churn

too?

Optimal pricing model

Optimal churn model

Alex's model

Good job again!

Promising results!

Can you include a causal attribution model for marketing?

Optimal pricing model

Optimal churn model

Alex's model

Attribution

model

Are you sure

these results

make sense?

Take two

Meet the new data science team at Caveman Cupcakes

You are hired!

Pricing model

Churn model

Attribution

model

the human is the bottleneck

the human is the bottleneck

VS

Build

Data Warehouse

Build

Data Warehouse

Compute Resources

Build

Data Warehouse

Compute Resources

Job Scheduler

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Collaboration Tools

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Collaboration Tools

Model Deployment

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Collaboration Tools

Model Deployment

Feature Engineering

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Collaboration Tools

Model Deployment

Feature Engineering

ML Libraries

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Collaboration Tools

Model Deployment

Feature Engineering

ML Libraries

How much

data scientist

cares

Build

Data Warehouse

Compute Resources

Job Scheduler

Versioning

Collaboration Tools

Model Deployment

Feature Engineering

ML Libraries

How much

data scientist

cares

How much

infrastructure

is needed

Build

Deploy

Deploy

No plan survives contact with enemy

Deploy

No plan survives contact with enemy

No model survives contact with reality

our ML infra supports

two human activities:

building and deploying

data science workflows.

Screenplay Analysis Using NLP

Fraud Detection

Title Portfolio Optimization

Estimate Word-of-Mouth Effects

Incremental Impact of Marketing

Classify Support Tickets

Predict Quality of Network

Content Valuation

Cluster Tweets

Intelligent Infrastructure

Machine Translation

Optimal CDN Caching

Predict Churn

Content Tagging

Optimize Production Schedules

Notebooks: Nteract

Job Scheduler: Meson

Compute Resources: Titus

Query Engine: Spark

Data Lake: S3

{

data

compute

prototyping

ML Libraries: R, XGBoost, TF etc.

models

Bad Old Days

Data Scientist built an NLP model in Python. Easy and fun!

How to run at scale?

Custom Titus executor.

How to schedule the model to update daily?Learn about the job scheduler.

How to access data at scale?

Slow!

How to expose the model to a custom UI? Custom web backend.

Time to production:

4 months

How to monitor models in production?

How to iterate on a new version without breaking the production version?

How to let another data scientist iterate on her version of the model safely?

How to debug yesterday's failed production run?

How to backfill historical data?

How to make this faster?

Notebooks: Nteract

Job Scheduler: Meson

Compute Resources: Titus

Query Engine: Spark

Data Lake: S3

{

data

compute

prototyping

ML Libraries: R, XGBoost, TF etc.

models

ML Wrapping: Metaflow

Metaflow

Build









def compute(input):
    output = my_model(input)
    return output

output

input

compute

How to get started?

# python myscript.py

from metaflow import FlowSpec, step

class MyFlow(FlowSpec):

  @step
  def start(self):
    self.next(self.a, self.b)

  @step
  def a(self):
    self.next(self.join)
  
  @step
  def b(self):
    self.next(self.join)
  
  @step
  def join(self, inputs):
    self.next(self.end)

MyFlow()

start

How to structure my code?

B

A

join

end

# python myscript.py run

metaflow("MyFlow") %>%
  step(
    step = "start",
    next_step = c("a", "b")
  ) %>%
  step(
    step = "A",
    r_function = r_function(a_func),
    next_step = "join"
  ) %>%
  step(
    step = "B",
    r_function = r_function(b_func),
    next_step = "join"
  ) %>%
  step(
    step = "Join",
    r_function = r_function(join,
                 join_step = TRUE),

start

How to deal with models

written in R?

B

A

join

end

# RScript myscript.R

Metaflow adoption

at Netflix

134 projects on Metaflow

as of November 2018

start

How to prototype and test

my code locally?

B

A

join

end

# python myscript.py resume B

x=0

x+=2

x+=3

max(A.x, B.x)



@step
def start(self):
  self.x = 0
  self.next(self.a, self.b)

@step
def a(self):
  self.x += 2 
  self.next(self.join)
  
@step
def b(self):
  self.x += 3 
  self.next(self.join)
  
@step
def join(self, inputs):
  self.out = max(i.x for i in inputs)
  self.next(self.end)

start

How to get access to more CPUs,

GPUs, or memory?

B

A

join

end









@titus(cpu=16, gpu=1)
@step
def a(self):
  tensorflow.train()
  self.next(self.join)
  
@titus(memory=200000)
@step
def b(self):
  massive_dataframe_operation()
  self.next(self.join)

16 cores, 1GPU

200GB RAM

# python myscript.py run

start

How to distribute work over

many parallel jobs?

A

join

end






@step
def start(self):
  self.grid = [’x’,’y’,’z’] 
  self.next(self.a, foreach=’grid’)

@titus(memory=10000)
@step
def a(self):
  self.x = ord(self.input)
  self.next(self.join)
  
@step
def join(self, inputs):
  self.out = max(i.x for i in inputs)
  self.next(self.end)

40% of projects run steps outside their dev environment.

How quickly they start using Titus?








from metaflow import Table

@titus(memory=200000, network=20000)
@step
def b(self):
   # Load data from S3 to a dataframe
   # at 10Gbps
   df = Table('vtuulos', 'input_table')
   self.next(self.end)

start

How to access large amounts of input data?

B

A

join

end

S3

Case Study: Marketing Cost per Incremental Watcher

1. Build a separate model for every new title with marketing spend. 

Parallel foreach.

2. Load and prepare input data for each model.

Download Parquet directly from S3.

Total amount of model input data: 890GB.

3. Fit a model.

Train each model on an instance with 400GB of RAM, 16 cores.

The model is written in R.

4. Share updated results.

Collect results of individual models, write to a table.

Results shown on a Tableau dashboard.

Deploy





# Access Savin's runs
namespace('user:savin')
run = Flow('MyFlow').latest_run
print(run.id) # = 234 
print(run.tags) # = ['unsampled_model']

# Access David's runs
namespace('user:david')
run = Flow('MyFlow').latest_run
print(run.id) # = 184 
print(run.tags) # = ['sampled_model']

# Access everyone's runs
namespace(None)
run = Flow('MyFlow').latest_run
print(run.id) # = 184 

start

How to version my results and

access results by others?

B

A

join

end

david: sampled_model

savin: unsampled_model

start

How to deploy my workflow to production?

B

A

join

end

#python myscript.py meson create

26% of projects get deployed to the production scheduler.

How quickly the first deployment happens?

start

How to monitor models and

examine results?

B

A

join

end

x=0

x+=2

x+=3

max(A.x, B.x)

start

How to deploy results as

a microservice?

 

B

A

join

end

x=0

x+=2

x+=3

max(A.x, B.x)

Metaflow

hosting







from metaflow import WebServiceSpec
from metaflow import endpoint

class MyWebService(WebServiceSpec):

    @endpoint
    def show_data(self, request_dict):
        # TODO: real-time predict here
        result = self.artifacts.flow.x
        return {'result': result}
# curl http://host/show_data

{"result": 3}{

Case Study: Launch Date Schedule Optimization

1. Batch optimize launch date schedules for new titles daily. 

Batch optimization deployed on Meson.

2. Serve results through a custom UI.

Results deployed on Metaflow Hosting.

3. Support arbitrary what-if scenarios in the custom UI.

Run optimizer in real-time in a custom web endpoint.

Metaflow

diverse problems

diverse people

help people build

help people deploy

diverse models

happy people, healthy business

thank you!

 

@vtuulos

vtuulos@netflix.com

Bruno Coldiori

https://www.flickr.com/photos/br1dotcom/8900102170/

https://www.maxpixel.net/Isolated-Animal-Hundeportrait-Dog-Nature-3234285

Photo Credits

Human Centric Machine Learning Infrastructure Qcon 2018

By Ville Tuulos

Human Centric Machine Learning Infrastructure Qcon 2018

  • 4,286