Modeling Pipeline

what?
why?
how?

What?

The modeling pipeline is a tool

that you can use to streamline

your model building workflow.

Why?

Your time is valuable; the pipeline saves you time.

If we all use the same methods to construct our models, comparative evaluation becomes much easier.

We will all have the same language to talk about our models.

How?

Input:

Configuration file

Output:

Analytics dataset in Hive
Training dataset in Hive
Holdout dataset in Hive
PMML model in MySQL
Scored holdout in Redshift

Configuration File

Features, filters, and target variables correspond to those mapped in hypercube framework
The core table is the hub of your star schema

Analytics Dataset

The analytics dataset is built by creating a helper table with visitor or user IDs
These are mapped to either a training or a holdout label
You define the ratio of IDs in each group

Analytics Dataset

A feature table is created with the features you defined, filtered as specified
Features are binned, as specified in hypercube maps
This feature table is joined with the helper table

Training/

Holdout Datasets

These datasets are a subset of the analytics dataset
Numerator and denominator columns are added

INSERT OVERWRITE TABLE model_epmi02_training_prod 
select cast(numerator as float)/cast(denominator as float)*cast(1000 as float) target_variable, 
cast(denominator as float)/cast(1000 as float) weight, 
course_epmv,course_rpmv,course_interest,subcat_interest,persona 
from (select course_epmv,course_rpmv,course_interest,subcat_interest,persona, 
sum(enrolled) numerator, 
sum(impressions) denominator 
from dm_dataset_epmi02_prod where push_flag=0 and search_flag=0 and dataset='training' 
group by 
course_epmv,course_rpmv,course_interest,subcat_interest,persona) x;

PMML Model

PMML is a modeling language that is generalizable and can be used to describe any model type
Using your training dataset, a PMML model is built
Currently, the model type is a decision tree
This model is store in MySQL table variant_configs

Scored Holdout

This table contains all of your features, and a final score (probability rate) for each record
This table is uploaded to Redshift for you to analyze in Tableau

Modeling Pipeline

what?
why?
how?

What?

Why?

How?

Configuration File

Analytics Dataset

Analytics Dataset

Training/

Holdout Datasets

PMML Model

Scored Holdout

Next Steps

Now
Build
One

Modeling Pipeline

Modeling Pipeline

marswilliams

Modeling Pipeline

what? why? how?

What?

Why?

How?

Configuration File

Analytics Dataset

Analytics Dataset

Training/

Holdout Datasets

PMML Model

Scored Holdout

Next Steps

Now Build One

Modeling Pipeline

More from marswilliams

what?
why?
how?

Now
Build
One