From Look to Book

A Day in the Life of an Ecommerce Data Scientist

Aboute Me

Garrett Eastham

AI + Ecommerce Focus
CS @ Stanford
Background in Web Analytics
Career in Product Management
Prior: Bazaarvoice, RetailMeNot

Founder & Chief Data Scientist

edgecase

About Edgecase

edgecase

Product Data Enrichemnt

Enable better online search and navigation experiences by providing rich, structured product attribute information at scale
Leverage combination of human and machine technology

Today's Agenda

How Did I Get Here?

The Data Science Lifecycle

The Ecommerce Dataset

Example Problem: Categorize Search Queries

General Q&A

How Did I Get Here?

Stanford - Human Computer Interaction

HeyElroy - decision assistance UI

Preference extraction

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Web analytics and the crazy world of "proving" ROI

Developed the ViewStream project

Founding PM for project "Magpie"

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Edgecase - CEO

Left BV in spring 2012 with ~$20k in savings

Built advisory board, first offering, and signed 2 customers

Raised seed round ($750k) and then venture round ($3.5m) to get company off the ground in 2013

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Edgecase - CEO

Edgecase - CPO

Embraced the need for a pivoted product strategy

Rebuilt entire platform & kept existing customer commitments

Hired product leadership to replace me

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Edgecase - CEO

Edgecase - CPO

Edgecase - Data Science

FINALLY get to work full time on data science

Ramped up on latest technologies (chose Spark)

Size of data (10g+ / day) necessitated the need to also learn data engineering (AWS)

Data Science @ Edgecase

Team of 1 / Off-Stack / 6 mo. - 1 yr IP Timeline

Both define AND execute on research objectives

Wear all "data" hats - Executive, Scientist, Product Manager, Analyst

Data Science @ Edgecase

Machine Curation

Taxonomy Development

ROI Measurement

Predictive Analytics

The Data Science Lifecycle

Preparation

Dissemination

Analysis

Reflection

Discovery

Data Prep

Model Planning

Model Building

Communicate Results

Business Questions

Business Operations

Operationalize

The Data Science Lifecycle

Preparation

Dissemination

Analysis

Reflection

82%

5%

9%

4%

Source: CrowdFlower 2016 Data Science Report

The Ecommerce Dataset

Homepage

The Ecommerce Dataset

New Customers

Existing Customers

Search Engine (organic / paid)

Homepage

The Ecommerce Dataset

$$$

Page 1

Page 2

Page 3

Page 4

Click-Stream Logs

483782 - PageView - ... - 10.1.1.123 - ... - http://...

849283 - PageView - ... - 10.1.1.001 - ... - http://...

948272 - PageView - ... - 10.4.1.293 - ... - http://...

483782 - PageView - ... - 10.1.1.123 - ... - http://...

948272 - PageView - ... - 10.4.1.293 - ... - http://...

The Problem

Free-Text Queries

red lace shoes

red shoelace

lace-up tennis shoes

black velcro lace shoes

lace up Toms running

Shoes

Running Shoes

Lace-Up Running Shoes

Input

Output

Product Taxonomies

Provides structure and "meaning" to product catalog
Maps to merchant workflow / P&L
Standardizes reporting

Where Do We Start?

Discovery

Data Prep

Model Planning

Model Building

Communicate Results

Operationalize

Answer Two Key Questions

What data will we need?
Where does that data live?

Let's Look at the Logs

Discovery

Data Prep

Model Planning

Model Building

// Import spark libraries
import org.apache.spark.sql._
import sqlContext._

// Load raw Snowplow logs
val raw_snowplow_logs = sqlContext.load("path/to/raw/logs")

// Show the schema
raw_snowplow_logs.printSchema()

/* == Output ==
|-- app_id: string (nullable = true)
|-- doc_height: long (nullable = true)
|-- doc_width: long (nullable = true)
|-- domain_sessionid: string (nullable = true)
|-- domain_sessionidx: long (nullable = true)
|-- dvce_tstamp: string (nullable = true)
|-- event: string (nullable = true)
|-- page_referrer: string (nullable = true)
|-- page_title: string (nullable = true)
|-- page_url: string (nullable = true)
 */

Communicate Results

Operationalize

Design a Relevant Schema

Discovery

Data Prep

Model Planning

Model Building

Queries

Session ID
Query Terms
Timestamp

Product Views

Session ID
Product ID
Timestamp

Transactions

Session ID
Transaction ID
Timestamp

The QPT Model

Communicate Results

Operationalize

Design, Test and Run ETL

Discovery

Data Prep

Model Planning

Model Building

// Query object
case class Query(
  sessionId: String,        // The session ID of the user originating the pageview
  terms: String,            // The free-text search query
  timestamp: String         // The timestamp of the originating query event
)

// ProductView object
case class ProductView(
  sessionId: String,        // The session ID of the user originating the pageview
  productId: String,        // The external product ID of the originating product page
  timestamp: String         // The timestamp of the originating pageview event
)

// Transaction object
case class Transaction(
  sessionId: String,        // The session ID of the user originating the pageview
  transactionId: String,    // The transaction ID of the converted session
  timestamp: String         // The timestamp of the originating query event
)

Communicate Results

Operationalize

Design, Test and Run ETL

Discovery

Data Prep

Model Planning

Model Building

// Setup logs for SQL access
val snowplow.registerTempTable("snowplow")

// Query for relevant event meta-data from logs
val events = sqlContext.sql("SELECT sessionId, terms, timestamp
                             FROM snowplow
                             WHERE pagetype = 'search'")

// Map events onto query construct
val queries = events.map(r => Try(Query(r(0).toString, 
                                        r(1).toString, 
                                        r(2).toString)))
                    .filter(_.isSuccess)
                    .map(_.get)

// Repeat for product views and transactions
val product_views = ...
val transactions = ...

Communicate Results

Operationalize

Analyze Data Set

Discovery

Data Prep

Model Planning

Model Building

// Register data model tables
queries.registerTempTable("queries")
product_views.registerTempTable("product_views")
transactions.registerTempTable("transactions")

// Q: How many people query each day?
sqlContext.sql("SELECT timestamp, count(distinct(sessionId)) as total_searchers
                FROM queries
                GROUP BY timestamp")

// Q: Do people actually purchase after making a query?
sqlContext.sql("SELECT count(distinct(q.sessionId)) as total_searchers_that_purchase
                FROM queries q
                JOIN transactions t
                ON q.sessionId = t.sessionId")

Communicate Results

Operationalize

Selecting a Model to Train

Discovery

Data Prep

Model Planning

Model Building

Facebook FastText

Very fast, localized Word2Vec training
Leverages character-to-character window size - makes it great for "small", highly similar documents (i.e. - search terms)
Hierarchical classification - enables very "wide", out-of-the-box training capabilities
Simple command line interface

red lace shoes

r-e-d l-a-c-e s-h-o-e-s

Communicate Results

Operationalize

Setup Training and Test Data

Discovery

Data Prep

Model Planning

Model Building

// Load external product ID ==> category ID mapping CSV
sqlContext.read.format("com.databricks.spark.csv")
           .option("header", "true").option("inferSchema", "true")
           .load("path/to/category_mapping.csv")
           .registerTempTable("categories")

// Join Queries ==> Products ==> Categories
val queries_to_categories = sqlContext.sql("SELECT c.categoryId, q.terms
                                            FROM queries q
                                            JOIN products p
                                                ON q.sessionId = p.sessionId
                                            JOIN categories c
                                                ON p.productId = c.productId")

// Generate FastText formatted string
// -- FastText example row: __label__123 red lace shoes
val fast_text_strings = queries_to_categories.map(r => "__label__"
                                                    + r(0).toString 
                                                    + " " + r(1).toString)

// Create test /train splits
val splits = fast_text_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)

Communicate Results

Operationalize

Train Model and Test Accuracy

Discovery

Data Prep

Model Planning

Model Building

// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category

// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1

// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5

How could we improve the model?

Communicate Results

Operationalize

Improving the Model

Discovery

Data Prep

Model Planning

Model Building

Intuition: Not all product views are created equal.

$$$

Page 1

Page 2

Page 3

Page 4

Some draw more attention than others.

The best convert.

Solution: Update our model to incorporate product dwell time and shopper conversion behavior.

Communicate Results

Operationalize

Update Training Data

Discovery

Data Prep

Model Planning

Model Building

// Adding dwell time requires creating and updating localized variables within session
val product_views = events.map(r => (r(0).toString, r)).groupByKey().flatMap(x => {
  // Intra-session variables
  var productDwellTimes = scala.collection.mutable.Map[String, Integer]()

  // Sort events by time stamp
  x._2.toList.sortBy(y => y(1).toString).map { y => {
    // Update product dwell time
    if(lastVisitedProductHash != "") {
      // Get relevant times
      val timestamp = convertToMilliseconds(y(1).toString)
      val lastVisitedProductTime = productDwellTimes(lastVisitedProductHash)

      // Update dwell time
      productDwellTimes(lastVisitedProductHash) = timestamp - lastVisitedProductTime

      // Clear last visited product
      lastVisitedProductHash = ""
    }
  }}
})

Communicate Results

Operationalize

Update Training Data

Discovery

Data Prep

Model Planning

Model Building

// Create copies
val queries_to_categories = mapQueriesToCategories()
val fast_text_formatted_strings = queries_to_categories.flatMap(r => {
  // Determine how many copies of search bucket to provide
  val search_dwelltime = r(2).toString.toDouble
  val number_of_copies = 5
  var copies = if(search_dwelltime > 0 && search_dwelltime < max_dwelltime)
                 (search_dwelltime / max_dwelltime * number_of_copies).toInt + 1
               else number_of_copies

  // Check if dwell time was below
  if(search_dwelltime < 10) copies = 1

  // Check if there was a conversion
  if(r(3).toString.toBoolean) copies += 3

  // Create labels
  (0 to copies).map(x => "__label__" + r(0).toString + " " + r(1).toString)
})

// Create test / train splits
val splits = fast_text_formatted_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)

Communicate Results

Operationalize

Re-train and Test Model

Discovery

Data Prep

Model Planning

Model Building

// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category

// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1

// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5

Questions?

Communicate Results

Operationalize

General Assembly Presentation - January 10, 2016

By Garrett Eastham

General Assembly Presentation - January 10, 2016

From Look to Book

Aboute Me

Garrett Eastham

About Edgecase

Product Data Enrichemnt

Today's Agenda

How Did I Get Here?

The Data Science Lifecycle

The Ecommerce Dataset

Example Problem: Categorize Search Queries

General Q&A

How Did I Get Here?

How Did I Get Here?

How Did I Get Here?

How Did I Get Here?

How Did I Get Here?

Data Science @ Edgecase

Data Science @ Edgecase

Machine Curation

Taxonomy Development

ROI Measurement

Predictive Analytics

The Data Science Lifecycle

The Data Science Lifecycle

82%

5%

9%

4%

The Ecommerce Dataset

The Ecommerce Dataset

The Ecommerce Dataset

The Problem

Product Taxonomies

Where Do We Start?

Let's Look at the Logs

Design a Relevant Schema

Design, Test and Run ETL

Design, Test and Run ETL

Analyze Data Set

Selecting a Model to Train

Setup Training and Test Data

Train Model and Test Accuracy

How could we improve the model?

Improving the Model

Update Training Data

Update Training Data

Re-train and Test Model

Questions?

General Assembly Presentation - January 10, 2016

More from Garrett Eastham