user-site interaction
logging
summary tables
feature tables
data mining tables
raw data tables
hypercubes
metacube
tableau
score tables
course feature table
redis
middleware
MySQL
Redshift
Hive
elasticsearch
R
a 'feature' in machine learning is a variable that explains the behavior of a response variable
- Larry Wai
a response variable measures a key metric that we want to effect for the business, like revenue or NPS
- Larry Wai
features are used to predict whether or not a particular item is relevant to the user, as measured by response variables
- Larry Wai
enrollment_funnel
course_consumption
course_rating_mult
user_persona
course_performance
user_creation
course_interest
subcat_interest
todate_course_rating
course_list_price
user_consumption
user_activity
impression_funnel
Core Table
Helper Table
visitor_path
course_free_mult
course_feature
course_metadata
This table is the backbone of many feature tables. This table also serves as the core join table for several data mining tables and hypercubes.
CREATE TABLE impression_funnel (
userid bigint, visitorid bigint,
courseid bigint, push_flag int,
search_flag int, markedasseen_flag int,
islanded_flag int, enrolled int,
booking float, revenue float,
minconsumed_1wk int, nps_flag_1wk int,
nps_1wk int, context_type string,
query string, context string,
subcontext string, normalized_query string )
PARTITIONED BY (datestamp date)
This table provides a way to analyze enrollment and revenue in relation to user persona. A user persona describes the enrollment, consumption, and transaction patterns of a user's behavior.
A feature hypercube is created from this table, which is then used for the experiment hypercube.
CREATE TABLE user_persona (
userid bigint,
persona varchar(14),
past_enrollments int,
past_bookings int,
past_minconsumed int,
past_bookings_per_enrollment int,
past_minconsumed_per_enrollment int )
This class handles updates to partitioned tables, which are typically partitioned by date.
This class handles updates to full refresh tables, which don't require historical data persistence.
CREATE TABLE visitor_course_tmp_scores (
visitorid bigint,
courseid bigint,
epmi float,
rpe float,
cpe float,
npe float
)
-1 249454 6.864902 1.0195512 4.507095 50.436234
-1 620966 6.864902 8.059692 12.811697 45.934177
-1 492808 6.864902 1.0195512 7.3605323 45.934177
-1 619578 6.864902 12.884531 12.811697 56.168747
INSERT INTO TABLE visitor_course_tmp_scores
SELECT visitorid,
courseid,
pscore('epmi01',
'course_epmv',cast(course_epmv AS string),
'course_rpmv',cast(course_rpmv AS string),
'course_interest',cast(course_interest AS string),
'course_subcat_interest',cast(course_subcat_interest AS string),
'persona',cast(persona AS string)),
pscore('rpe03',
'course_bpe',cast(course_bpe AS string)),
pscore('cpe01',
'avg_minconsumed_1wk',cast(avg_minconsumed_1wk AS string),
'course_bpe',cast(course_bpe AS string),
'persona',cast(persona AS string)),
pscore('npe01',
'avg_npsbp_1wk',cast(avg_npsbp_1wk AS string),
'avg_minconsumed_1wk',cast(avg_minconsumed_1wk AS string),
'persona',cast(persona AS string))
FROM visitor_course_features_new_visitors;
package com.udemy.predictivemodel;
import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;
public class PredictiveModelUdf extends UDF {
private HashMap < String, PmmlModel > pmmlModelMap = new HashMap < String, PmmlModel > ();
public void loadPmmlModel(String modelId) throws IOException {
pmmlModelMap.put(modelId, new PmmlModel(modelId));
}
public double evaluate(String modelId, String...featureIDValues) throws Exception {
if (!pmmlModelMap.containsKey(modelId)) {
loadPmmlModel(modelId);
}
PmmlModel pmmlModel = pmmlModelMap.get(modelId);
HashMap < String, Object > scoringFeatureMap = pmmlModel.getScoringFeatureMap();
String[] parts = null;
for (int i = 0; i < featureIDValues.length; i += 2) {
String featureID = featureIDValues[i];
String featureValue = featureIDValues[i + 1];
scoringFeatureMap.put(featureID, featureValue);
}
try {
return pmmlModel.getScore(scoringFeatureMap);
} catch (ArrayIndexOutOfBoundsException e) {
throw new Exception
("[PredictiveModelUdf.evaluate ArrayIndexOutOfBoundsException - featureIDValue="
+ parts + "]: " + e.getMessage());
} catch (Exception e) {
throw new Exception
("[PredictiveModelUdf.evaluate Other Exception]: " + e);
}
}
}