Feature Pipeline
Building the Foundation
Overview
user-site interaction
logging
summary tables
feature tables
data mining tables
raw data tables
hypercubes
metacube
tableau
score tables
course feature table
redis
middleware
MySQL
Redshift
Hive
elasticsearch
R
a 'feature' in machine learning is a variable that explains the behavior of a response variable
- Larry Wai
a response variable measures a key metric that we want to effect for the business, like revenue or NPS
- Larry Wai
features are used to predict whether or not a particular item is relevant to the user, as measured by response variables
- Larry Wai
Goals
- Feature tables are used to create predictive models to make course recommendations or improve search results
Features
- We can update our model and increase its predictive power if we add more relevant features
- Over time, we may learn that some features are more or less valuable than once thought
enrollment_funnel
course_consumption
course_rating_mult
user_persona
course_performance
user_creation
course_interest
subcat_interest
todate_course_rating
course_list_price
user_consumption
user_activity
impression_funnel
Core Table
Helper Table
visitor_path
course_free_mult
course_feature
course_metadata
Impression Funnel
This table is the backbone of many feature tables. This table also serves as the core join table for several data mining tables and hypercubes.
CREATE TABLE impression_funnel (
userid bigint, visitorid bigint,
courseid bigint, push_flag int,
search_flag int, markedasseen_flag int,
islanded_flag int, enrolled int,
booking float, revenue float,
minconsumed_1wk int, nps_flag_1wk int,
nps_1wk int, context_type string,
query string, context string,
subcontext string, normalized_query string )
PARTITIONED BY (datestamp date)
User Persona
This table provides a way to analyze enrollment and revenue in relation to user persona. A user persona describes the enrollment, consumption, and transaction patterns of a user's behavior.
A feature hypercube is created from this table, which is then used for the experiment hypercube.
CREATE TABLE user_persona (
userid bigint,
persona varchar(14),
past_enrollments int,
past_bookings int,
past_minconsumed int,
past_bookings_per_enrollment int,
past_minconsumed_per_enrollment int )
How Are These Tables Built?
Hive Daily ETL
This class handles updates to partitioned tables, which are typically partitioned by date.
Hive ETL
This class handles updates to full refresh tables, which don't require historical data persistence.
Extract
- Select from external data source
- Unload to s3 bucket
Transform
- Create external table in Hive
- Read from s3 bucket
Load
- Select from external tables and Hive tables
- Transform data
- Can use Python and Java UDFs
- Insert into Hive summary table
Upload
- Upload summary table to Redshift for consumers using R, Tableau, Chartio
Visitor Course Score Workflow
Crafting Recommendations
Helper Tables
- Features for existing visitors (those who return within 14 days)
- Features for new visitors (first time visitors and those who return after the 14-day window)
Score Tables
- Scores for 3k courses for existing visitors
- Top scores for existing visitors
CREATE TABLE visitor_course_tmp_scores (
visitorid bigint,
courseid bigint,
epmi float,
rpe float,
cpe float,
npe float
)
-1 249454 6.864902 1.0195512 4.507095 50.436234
-1 620966 6.864902 8.059692 12.811697 45.934177
-1 492808 6.864902 1.0195512 7.3605323 45.934177
-1 619578 6.864902 12.884531 12.811697 56.168747
INSERT INTO TABLE visitor_course_tmp_scores
SELECT visitorid,
courseid,
pscore('epmi01',
'course_epmv',cast(course_epmv AS string),
'course_rpmv',cast(course_rpmv AS string),
'course_interest',cast(course_interest AS string),
'course_subcat_interest',cast(course_subcat_interest AS string),
'persona',cast(persona AS string)),
pscore('rpe03',
'course_bpe',cast(course_bpe AS string)),
pscore('cpe01',
'avg_minconsumed_1wk',cast(avg_minconsumed_1wk AS string),
'course_bpe',cast(course_bpe AS string),
'persona',cast(persona AS string)),
pscore('npe01',
'avg_npsbp_1wk',cast(avg_npsbp_1wk AS string),
'avg_minconsumed_1wk',cast(avg_minconsumed_1wk AS string),
'persona',cast(persona AS string))
FROM visitor_course_features_new_visitors;
package com.udemy.predictivemodel;
import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;
public class PredictiveModelUdf extends UDF {
private HashMap < String, PmmlModel > pmmlModelMap = new HashMap < String, PmmlModel > ();
public void loadPmmlModel(String modelId) throws IOException {
pmmlModelMap.put(modelId, new PmmlModel(modelId));
}
public double evaluate(String modelId, String...featureIDValues) throws Exception {
if (!pmmlModelMap.containsKey(modelId)) {
loadPmmlModel(modelId);
}
PmmlModel pmmlModel = pmmlModelMap.get(modelId);
HashMap < String, Object > scoringFeatureMap = pmmlModel.getScoringFeatureMap();
String[] parts = null;
for (int i = 0; i < featureIDValues.length; i += 2) {
String featureID = featureIDValues[i];
String featureValue = featureIDValues[i + 1];
scoringFeatureMap.put(featureID, featureValue);
}
try {
return pmmlModel.getScore(scoringFeatureMap);
} catch (ArrayIndexOutOfBoundsException e) {
throw new Exception
("[PredictiveModelUdf.evaluate ArrayIndexOutOfBoundsException - featureIDValue="
+ parts + "]: " + e.getMessage());
} catch (Exception e) {
throw new Exception
("[PredictiveModelUdf.evaluate Other Exception]: " + e);
}
}
}
deck
By marswilliams
deck
An overview of the feature pipeline workflow.
- 488