Modeling Pipeline

what?
why?
how?

What?

The modeling pipeline is a tool

that you can use to streamline

your model building workflow.

Why?

Your time is valuable; the pipeline saves you time. 

If we all use the same methods to construct our models, comparative evaluation becomes much easier. 

We will all have the same language to talk about our models.

How?

Input:

  • Configuration file

Output:

  • Analytics dataset in Hive
  • Training dataset in Hive
  • Holdout dataset in Hive
  • PMML model in MySQL
  • Scored holdout in Redshift

Configuration File

  • Features, filters, and target variables correspond to those mapped in hypercube framework
  • The core table is the hub of your star schema

Analytics Dataset

  • The analytics dataset is built by creating a helper table with visitor or user IDs
  • These are mapped to either a training or a holdout label
  • You define the ratio of IDs in each group

Analytics Dataset

  • A feature table is created with the features you defined, filtered as specified
  • Features are binned, as specified in hypercube maps
  • This feature table is joined with the helper table

Training/

Holdout Datasets

  • These datasets are a subset of the analytics dataset
  • Numerator and denominator columns are added
INSERT OVERWRITE TABLE model_epmi02_training_prod 
select cast(numerator as float)/cast(denominator as float)*cast(1000 as float) target_variable, 
cast(denominator as float)/cast(1000 as float) weight, 
course_epmv,course_rpmv,course_interest,subcat_interest,persona 
from (select course_epmv,course_rpmv,course_interest,subcat_interest,persona, 
sum(enrolled) numerator, 
sum(impressions) denominator 
from dm_dataset_epmi02_prod where push_flag=0 and search_flag=0 and dataset='training' 
group by 
course_epmv,course_rpmv,course_interest,subcat_interest,persona) x;

PMML Model

  • PMML is a modeling language that is generalizable and can be used to describe any model type
  • Using your training dataset, a PMML model is built
  • Currently, the model type is a decision tree
  • This model is store in MySQL table variant_configs

Scored Holdout

  • This table contains all of your features, and a final score (probability rate) for each record
  • This table is uploaded to Redshift for you to analyze in Tableau

Next Steps

  • --PMML flag so you can point to a model made outside of the pipeline
  • Support for other model types
  • Support for complex filters

Now 
Build
One

https://udemywiki.atlassian.net/wiki/display/ENG/Modeling+Pipeline

Modeling Pipeline

By marswilliams

Modeling Pipeline

  • 526