Data Mining Pipeline

Making Meaning from Data

Tables
Hypercubes
Metacube

Tables

Goals

Data mining tables aggregate information from many feature tables to provide insight or answer a question
Data mining tables are used to create hypercubes for analyzing experiments

Features

Data mining tables do not affect the recommendation system in production
Data mining tables are used for analysis of features during a sliding 90-day timeframe

Tables in Production

dm_course_consumption

dm_user_persona

dm_course_interest

dm_visitor_experiment

dm_query_course_performance

dm_query_performance

dm_top_query_performance

dm-course_performance

dm_search_position

dm_visitor_enrollment

dm_visitor_engagement

dm_subcat_interest

dm_course_performance

This table provides a way to analyze course performance across several metrics over a 90-day time period. A feature hypercube is created from this table, which is then used for the experiment hypercube.

CREATE TABLE dm_course_performance (

courseid bigint,

num_visitors int,

course_epmv int,

course_bpmv int,

course rpmv int,

course_bpe int,

course_rpe int )

PARTITIONED BY (datestamp date)

Extract
Transform
Load

Title Text

Hypercubes

Goals

Hypercubes allow slicing, or dimension reduction
Hypercubes allow drill-up/drill down; you can analyze one dimension or increase complexity by examining interaction between many dimensions
Hypercubes allow roll up; you can summarize across one dimension

Features

Hypercubes can be configured on the fly (CLI) or in a more permanent fashion (hypercubes.py)
Hypercubes are self-building
Hypercube queries are optimized for performance
Hypercubes are shareable and can be combined

Experiment Hypercubes

course_interest

purchaser_type

success_subcontext

course_enrollment

avg_minconsumed_1wk

persona

featured_subcontext

course_epmv

course_bpe

context_type

avg_npsbp_1wk

subcat_interest

totals

denominator

Hypercube Components

Hypercube map
Hypercube core
Feature Hypercube
Experiment Hypercube

feature

experiment hypercube

hypercube map

feature_map['course_epmv'] = {
    'table': 'dm_course_performance',
    'field': 'course_epmv',
    'type': 'int',
    'binning': {
        'null_value': '8',
        'values': ['1', '4', '8', '16', '32'],
        'default_value': '64'

}}
feature_map['course_bpmv'] = {
    'table': 'dm_course_performance',
    'field': 'course_bpmv',
    'type': 'int',
    'binning': {
        'null_value': '4',
        'values': ['1', '2', '4', '8', '16', '32'],
'default_value': '64'

}}

CREATE TABLE dm_hypercube_re_19074_course_epmv_visitorid_prod (
variant varchar(255),
  days_since_last_visit int,
  path_type varchar(64),
  path_topology varchar(64),
  datestamp date,
  visitorid bigint,
  course_epmv int,
  impressions int,
  views int,
  enrolled int,
  revenue float,
  minconsumed_1wk int,
  nps_flag_1wk int,
  nps_1wk int
)

INSERT INTO dm_hypercube_re_19074_course_epmv_visitorid_prod
SELECT variant, days_since_last_visit, path_type, path_topology, datestamp, visitorid, course_epmv, sum(impressions), sum(views), sum(enrolled), sum(revenue), sum(minconsumed_1wk), sum(nps_flag_1wk), sum(nps_1wk)
FROM (SELECT
vv.variant variant,
CASE WHEN vg.days_since_last_visit IS NULL THEN 0
WHEN vg.days_since_last_visit<=7 THEN 7
WHEN vg.days_since_last_visit<=14 THEN 14
WHEN vg.days_since_last_visit<=30 THEN 30
WHEN vg.days_since_last_visit<=91 THEN 91
ELSE 0 END days_since_last_visit,
CASE WHEN vp.path_type IS NULL THEN 'other'
ELSE vp.path_type END path_type,
CASE WHEN vp.path_topology IS NULL THEN 'other'
ELSE vp.path_topology END path_topology,
mf.datestamp datestamp, mf.visitorid visitorid,
CASE WHEN cp.course_epmv IS NULL THEN 8
WHEN cp.course_epmv<=1 THEN 1
WHEN cp.course_epmv<=4 THEN 4
WHEN cp.course_epmv<=8 THEN 8
WHEN cp.course_epmv<=16 THEN 16
WHEN cp.course_epmv<=32 THEN 32
ELSE 64 END course_epmv,
markedasseen_flag impressions, islanded_flag views, enrolled enrolled, booking revenue,
minconsumed_1wk minconsumed_1wk, nps_flag_1wk nps_flag_1wk, nps_1wk nps_1wk
FROM impression_funnel mf
LEFT OUTER JOIN dm_hypercube_visitor_variant_re_19074_prod vv ON
vv.visitorid=mf.visitorid AND vv.datestamp=mf.datestamp
LEFT OUTER JOIN dm_visitor_engagement vg ON
vg.visitorid=mf.visitorid AND vg.datestamp=mf.datestamp
LEFT OUTER JOIN visitor_path vp ON
vp.visitorid=mf.visitorid AND vp.datestamp=mf.datestamp
LEFT OUTER JOIN dm_course_performance cp ON
cp.courseid=mf.courseid AND cp.datestamp=mf.datestamp
WHERE mf.datestamp BETWEEN '2021-01-01' AND '2022-01-01'
) x
GROUP BY variant, days_since_last_visit, path_type, path_topology, datestamp, visitorid, course_epmv;

Metacube

Overview

The metacube combines all of the experiment hypercubes into one table for uploading to redshift, for use in Tableau, R, & Chartio.

experiment

hypercube

experiment

hypercube

experiment

hypercube

experiment

hypercube

metacube

CREATE TABLE dm_recommendation_experiment_metacube_prod (
dimension_name string,
dimension_value string,
visitor_days int,
  variant varchar(255),
  days_since_last_visit int,
  path_type varchar(64),
  path_topology varchar(64),
  datestamp date,
  impressions int,
  views int,
  enrolled int,
  revenue float,
  minconsumed_1wk int,
  nps_flag_1wk int,
  nps_1wk int,
  impressions_2 int,
  views_2 int,
  enrolled_2 int,
  revenue_2 float,
  minconsumed_1wk_2 int,
  nps_flag_1wk_2 int,
  nps_1wk_2 int
)
PARTITIONED BY (experiment bigint, component_id varchar(10), hashing_variable varchar(10));

Experiment Workbook

deck

By marswilliams

Data Mining Pipeline

Making Meaning from Data

Tables Hypercubes Metacube

Tables

Goals

Features

Tables in Production

dm_course_performance

ExtractTransformLoad

Title Text

Hypercubes

Goals

Features

Experiment Hypercubes

Hypercube Components

Metacube

Overview

Experiment Workbook

deck

More from marswilliams

Tables
Hypercubes
Metacube

Extract
Transform
Load