Data Mining Overview

Tables
Hypercubes
Metacube

Tables

Goals

  • Data mining tables aggregate information from many feature tables to provide insight or answer a question
  • Data mining tables are used to create hypercubes for analyzing experiments

Features

  • Data mining tables do not affect the recommendation system in production
  • Data mining tables are used for analysis of features during a sliding 90-day timeframe

Extract

Transform

Load

Extract

Transform

Load

Load to Redshift

Workflow Objects

  • Inherits from:
    • hive_workflow
    • hive_daily_workflow
  • Option to upload to Redshift
  • Custom validation logic

Successes

  • Easy to create a workflow
  • Configurable
  • Upstream validation
  • Terminal validation

Regrets

  • Latency in extract phase
  • Source of truth is Redshift
  • String transform in load to Redshift
  • Formatting difficulties
  • Many workflow objects
  • Duplicated tests

The Future

Table Name

Schedule

Frequency

Dependencies

Validation

Hypercubes

Goals

  • Hypercubes allow slicing, or dimension reduction
  • Hypercubes allow drill-up/drill down; you can analyze one dimension or increase complexity by examining interaction between many dimensions
  • Hypercubes allow roll up; you can summarize across one dimension

Features

  • Hypercubes can be configured on the fly (CLI) or in a more permanent fashion (hypercubes.py)
  • Hypercubes are self-building
  • Hypercube queries are optimized for performance
  • Hypercubes are shareable and can be combined

Hypercube Components

  • Hypercube map
  • Hypercube core
  • Feature Hypercube
  • Experiment Hypercube

Successes

  • Configurable
  • Componetized

Regrets

  • High latency building multiple hypercubes, even with multiprocessing

The Future

Component

Experiment

Hashing

Measures

Features

Denominators

Metacube

Overview

The metacube combines all of the experiment hypercubes into one table for uploading to redshift, for use in Tableau, R, & Chartio.

Experiment Workbook

Data Mining Overview

By marswilliams

Data Mining Overview

  • 467