Modeling Pipeline
what?
why?
how?

What?
The modeling pipeline is a tool
that you can use to streamline
your model building workflow.
Why?
Your time is valuable; the pipeline saves you time.
If we all use the same methods to construct our models, comparative evaluation becomes much easier.
We will all have the same language to talk about our models.
How?
Input:
- Configuration file
Output:
- Analytics dataset in Hive
- Training dataset in Hive
- Holdout dataset in Hive
- PMML model in MySQL
- Scored holdout in Redshift
Configuration File

- Features, filters, and target variables correspond to those mapped in hypercube framework
- The core table is the hub of your star schema
Analytics Dataset
- The analytics dataset is built by creating a helper table with visitor or user IDs
- These are mapped to either a training or a holdout label
- You define the ratio of IDs in each group
Analytics Dataset
- A feature table is created with the features you defined, filtered as specified
- Features are binned, as specified in hypercube maps
- This feature table is joined with the helper table
Training/
Holdout Datasets
- These datasets are a subset of the analytics dataset
- Numerator and denominator columns are added
INSERT OVERWRITE TABLE model_epmi02_training_prod
select cast(numerator as float)/cast(denominator as float)*cast(1000 as float) target_variable,
cast(denominator as float)/cast(1000 as float) weight,
course_epmv,course_rpmv,course_interest,subcat_interest,persona
from (select course_epmv,course_rpmv,course_interest,subcat_interest,persona,
sum(enrolled) numerator,
sum(impressions) denominator
from dm_dataset_epmi02_prod where push_flag=0 and search_flag=0 and dataset='training'
group by
course_epmv,course_rpmv,course_interest,subcat_interest,persona) x;
PMML Model
- PMML is a modeling language that is generalizable and can be used to describe any model type
- Using your training dataset, a PMML model is built
- Currently, the model type is a decision tree
- This model is store in MySQL table variant_configs
Scored Holdout
- This table contains all of your features, and a final score (probability rate) for each record
- This table is uploaded to Redshift for you to analyze in Tableau
Next Steps
- --PMML flag so you can point to a model made outside of the pipeline
- Support for other model types
- Support for complex filters
Now
Build
One
https://udemywiki.atlassian.net/wiki/display/ENG/Modeling+Pipeline
Modeling Pipeline
By marswilliams
Modeling Pipeline
- 526