Feature Pipeline

Building the Foundation

Overview

user-site interaction

logging

summary tables

feature tables

data mining tables

raw data tables

hypercubes

metacube

tableau

score tables

course feature table

redis

middleware

MySQL

Redshift

Hive

elasticsearch

R

a 'feature' in machine learning is a variable that explains the behavior of a response variable

- Larry Wai

a response variable measures a key metric that we want to effect for the business, like revenue or NPS

- Larry Wai

features are used to predict whether or not a particular item is relevant to the user, as measured by response variables

- Larry Wai

Goals

  • Feature tables are used to create predictive models to make course recommendations or improve search results

Features

  • We can update our model and increase its predictive power if we add more relevant features
  • Over time, we may learn that some features are more or less valuable than once thought

enrollment_funnel

course_consumption

course_rating_mult

user_persona

course_performance

user_creation

course_interest

subcat_interest

todate_course_rating

course_list_price

user_consumption

user_activity

impression_funnel

Core Table

Helper Table

visitor_path

course_free_mult

course_feature

course_metadata

Impression Funnel

This table is the backbone of many feature tables. This table also serves as the core join table for several data mining tables and hypercubes.

CREATE TABLE impression_funnel (
    userid bigint,  visitorid bigint,
    courseid bigint,  push_flag int,
    search_flag int,  markedasseen_flag int,
    islanded_flag int,  enrolled int,
    booking float,  revenue float,
    minconsumed_1wk int,  nps_flag_1wk int,
    nps_1wk int,  context_type string,
    query string,  context string,
    subcontext string,  normalized_query string )

PARTITIONED BY (datestamp date)

User Persona

This table provides a way to analyze enrollment and revenue in relation to user persona. A user persona describes the enrollment, consumption, and transaction patterns of a user's behavior. 

A feature hypercube is created from this table, which is then used for the experiment hypercube.

CREATE TABLE user_persona (
    userid bigint,
    persona varchar(14),
    past_enrollments int,
    past_bookings int,
    past_minconsumed int,
    past_bookings_per_enrollment int,
    past_minconsumed_per_enrollment int )

How Are These Tables Built?

Hive Daily ETL

This class handles updates to partitioned tables, which are typically partitioned by date.

Hive ETL

This class handles updates to full refresh tables, which don't require historical data persistence.

Extract

  • Select from external data source
  • Unload to s3 bucket

Transform

  • Create external table in Hive
    • Read from s3 bucket

Load

  • Select from external tables and Hive tables
  • Transform data
    • Can use Python and Java UDFs
  • Insert into Hive summary table

Upload

  • Upload summary table to Redshift for consumers using R, Tableau, Chartio

Visitor Course Score Workflow

Crafting Recommendations

Helper Tables

  • Features for existing visitors (those who return within 14 days)
  • Features for new visitors (first time visitors and those who return after the 14-day window)

Score Tables

  • Scores for 3k courses for existing visitors 
  • Top scores for existing visitors
CREATE TABLE visitor_course_tmp_scores (
    visitorid bigint,
    courseid bigint,
    epmi float,
    rpe float,
    cpe float,
    npe float
)
-1    249454    6.864902    1.0195512    4.507095    50.436234
-1    620966    6.864902    8.059692    12.811697    45.934177
-1    492808    6.864902    1.0195512    7.3605323    45.934177
-1    619578    6.864902    12.884531    12.811697    56.168747
INSERT INTO TABLE visitor_course_tmp_scores
SELECT visitorid,
       courseid,
       pscore('epmi01',
            'course_epmv',cast(course_epmv AS string),
            'course_rpmv',cast(course_rpmv AS string),
            'course_interest',cast(course_interest AS string),
            'course_subcat_interest',cast(course_subcat_interest AS string),
            'persona',cast(persona AS string)),
       pscore('rpe03',
            'course_bpe',cast(course_bpe AS string)),
       pscore('cpe01',
            'avg_minconsumed_1wk',cast(avg_minconsumed_1wk AS string),
            'course_bpe',cast(course_bpe AS string),
            'persona',cast(persona AS string)),
       pscore('npe01',
            'avg_npsbp_1wk',cast(avg_npsbp_1wk AS string),
            'avg_minconsumed_1wk',cast(avg_minconsumed_1wk AS string),
            'persona',cast(persona AS string))
FROM visitor_course_features_new_visitors;
package com.udemy.predictivemodel;

import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;

public class PredictiveModelUdf extends UDF {

 private HashMap < String, PmmlModel > pmmlModelMap = new HashMap < String, PmmlModel > ();

 public void loadPmmlModel(String modelId) throws IOException {
  pmmlModelMap.put(modelId, new PmmlModel(modelId));
 }

 public double evaluate(String modelId, String...featureIDValues) throws Exception {
  if (!pmmlModelMap.containsKey(modelId)) {
   loadPmmlModel(modelId);
  }
  PmmlModel pmmlModel = pmmlModelMap.get(modelId);
  HashMap < String, Object > scoringFeatureMap = pmmlModel.getScoringFeatureMap();
  String[] parts = null;
  for (int i = 0; i < featureIDValues.length; i += 2) {
   String featureID = featureIDValues[i];
   String featureValue = featureIDValues[i + 1];
   scoringFeatureMap.put(featureID, featureValue);
  }
  try {
   return pmmlModel.getScore(scoringFeatureMap);
  } catch (ArrayIndexOutOfBoundsException e) {
   throw new Exception
    ("[PredictiveModelUdf.evaluate ArrayIndexOutOfBoundsException - featureIDValue=" 
      + parts + "]: " + e.getMessage());
  } catch (Exception e) {
   throw new Exception
    ("[PredictiveModelUdf.evaluate Other Exception]: " + e);
  }
 }
}

deck

By marswilliams

deck

An overview of the feature pipeline workflow.

  • 488