From Look to Book

A Day in the Life of an Ecommerce Data Scientist

Aboute Me

Garrett Eastham

  • AI + Ecommerce Focus
  • CS @ Stanford
  • Background in Web Analytics
  • Career in Product Management
  • Prior: Bazaarvoice, RetailMeNot

Founder & Chief Data Scientist

edgecase

About Edgecase

edgecase

Product Data Enrichemnt

  • Enable better online search and navigation experiences by providing rich, structured product attribute information at scale
  • Leverage combination of human and machine technology

Today's Agenda

How Did I Get Here?

The Data Science Lifecycle

The Ecommerce Dataset

Example Problem: Categorize Search Queries

General Q&A

How Did I Get Here?

Stanford - Human Computer Interaction

  • HeyElroy - decision assistance UI

 

  • Preference extraction

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

  • Web analytics and the crazy world of "proving" ROI

 

  • Developed the ViewStream project

 

  • Founding PM for project "Magpie"

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Edgecase - CEO

  • Left BV in spring 2012 with ~$20k in savings

 

  • Built advisory board, first offering, and signed 2 customers

 

  • Raised seed round ($750k) and then venture round ($3.5m) to get company off the ground in 2013

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Edgecase - CEO

Edgecase - CPO

  • Embraced the need for a pivoted product strategy

 

  • Rebuilt entire platform & kept existing customer commitments

 

  • Hired product leadership to replace me

How Did I Get Here?

Stanford - Human Computer Interaction

Bazaarvoice - Product Manager

Edgecase - CEO

Edgecase - CPO

Edgecase - Data Science

  • FINALLY get to work full time on data science

 

  • Ramped up on latest technologies (chose Spark)

 

  • Size of data (10g+ / day) necessitated the need to also learn data engineering (AWS)

Data Science @ Edgecase

Team of 1 / Off-Stack / 6 mo. - 1 yr IP Timeline

Both define AND execute on research objectives

Wear all "data" hats - Executive, Scientist, Product Manager, Analyst

Data Science @ Edgecase

Machine Curation

Taxonomy Development

ROI Measurement

Predictive Analytics

The Data Science Lifecycle

Preparation

Dissemination

Analysis

Reflection

Discovery

Data Prep

Model Planning

Model Building

Communicate Results

Business Questions

Business Operations

Operationalize

The Data Science Lifecycle

Preparation

Dissemination

Analysis

Reflection

82%

5%

9%

4%

The Ecommerce Dataset

Homepage

Category

Search

Product

Transaction

Executive

Analyst

Merchants

Product Manager

  • Did site conversion increase?
  • Is traffic increasing?
  • Whats the conversion of new vs. returning traffic?
  • How's the A/B test performing?
  • What's the sell through rate?
  • Is new inventory being found?
  • What's does our conversion funnel look like?
  • Do shoppers like the new UX?

The Ecommerce Dataset

New Customers

Existing Customers

Search Engine (organic / paid)

Homepage

Category

Search

Product

Transaction

Paid Advertising

Email Promotion

Retargeting Campaign

The Ecommerce Dataset

$$$

Page 1

Page 2

Page 3

Page 4

Click-Stream Logs

483782 - PageView - ... - 10.1.1.123 - ... - http://...

849283 - PageView - ... - 10.1.1.001 - ... - http://...

948272 - PageView - ... - 10.4.1.293 - ... - http://...

483782 - PageView - ... - 10.1.1.123 - ... - http://...

948272 - PageView - ... - 10.4.1.293 - ... - http://...

The Problem

Free-Text Queries

red lace shoes

red shoelace

lace-up tennis shoes

black velcro lace shoes

lace up Toms running

Shoes

Running Shoes

Lace-Up Running Shoes

Input

Output

Product Taxonomies

  • Provides structure and "meaning" to product catalog
  • Maps to merchant workflow / P&L
  • Standardizes reporting

Where Do We Start?

Discovery

Data Prep

Model Planning

Model Building

Communicate Results

Operationalize

Answer Two Key Questions

  1. What data will we need?
  2. Where does that data live?

Let's Look at the Logs

Discovery

Data Prep

Model Planning

Model Building

// Import spark libraries
import org.apache.spark.sql._
import sqlContext._

// Load raw Snowplow logs
val raw_snowplow_logs = sqlContext.load("path/to/raw/logs")

// Show the schema
raw_snowplow_logs.printSchema()

/* == Output ==
|-- app_id: string (nullable = true)
|-- doc_height: long (nullable = true)
|-- doc_width: long (nullable = true)
|-- domain_sessionid: string (nullable = true)
|-- domain_sessionidx: long (nullable = true)
|-- dvce_tstamp: string (nullable = true)
|-- event: string (nullable = true)
|-- page_referrer: string (nullable = true)
|-- page_title: string (nullable = true)
|-- page_url: string (nullable = true)
 */

Communicate Results

Operationalize

Design a Relevant Schema

Discovery

Data Prep

Model Planning

Model Building

Queries

  • Session ID
  • Query Terms
  • Timestamp

Product Views

  • Session ID
  • Product ID
  • Timestamp

Transactions

  • Session ID
  • Transaction ID
  • Timestamp

The QPT Model

Communicate Results

Operationalize

Design, Test and Run ETL

Discovery

Data Prep

Model Planning

Model Building

// Query object
case class Query(
  sessionId: String,        // The session ID of the user originating the pageview
  terms: String,            // The free-text search query
  timestamp: String         // The timestamp of the originating query event
)

// ProductView object
case class ProductView(
  sessionId: String,        // The session ID of the user originating the pageview
  productId: String,        // The external product ID of the originating product page
  timestamp: String         // The timestamp of the originating pageview event
)

// Transaction object
case class Transaction(
  sessionId: String,        // The session ID of the user originating the pageview
  transactionId: String,    // The transaction ID of the converted session
  timestamp: String         // The timestamp of the originating query event
)

Communicate Results

Operationalize

Design, Test and Run ETL

Discovery

Data Prep

Model Planning

Model Building

// Setup logs for SQL access
val snowplow.registerTempTable("snowplow")

// Query for relevant event meta-data from logs
val events = sqlContext.sql("SELECT sessionId, terms, timestamp
                             FROM snowplow
                             WHERE pagetype = 'search'")

// Map events onto query construct
val queries = events.map(r => Try(Query(r(0).toString, 
                                        r(1).toString, 
                                        r(2).toString)))
                    .filter(_.isSuccess)
                    .map(_.get)

// Repeat for product views and transactions
val product_views = ...
val transactions = ...

Communicate Results

Operationalize

Analyze Data Set

Discovery

Data Prep

Model Planning

Model Building

// Register data model tables
queries.registerTempTable("queries")
product_views.registerTempTable("product_views")
transactions.registerTempTable("transactions")

// Q: How many people query each day?
sqlContext.sql("SELECT timestamp, count(distinct(sessionId)) as total_searchers
                FROM queries
                GROUP BY timestamp")

// Q: Do people actually purchase after making a query?
sqlContext.sql("SELECT count(distinct(q.sessionId)) as total_searchers_that_purchase
                FROM queries q
                JOIN transactions t
                ON q.sessionId = t.sessionId")

Communicate Results

Operationalize

Selecting a Model to Train

Discovery

Data Prep

Model Planning

Model Building

Facebook FastText

  • Very fast, localized Word2Vec training
  • Leverages character-to-character window size - makes it great for "small", highly similar documents (i.e. - search terms)
  • Hierarchical classification - enables very "wide", out-of-the-box training capabilities
  • Simple command line interface

red lace shoes

r-e-d  l-a-c-e  s-h-o-e-s

vs

Communicate Results

Operationalize

Setup Training and Test Data

Discovery

Data Prep

Model Planning

Model Building

// Load external product ID ==> category ID mapping CSV
sqlContext.read.format("com.databricks.spark.csv")
           .option("header", "true").option("inferSchema", "true")
           .load("path/to/category_mapping.csv")
           .registerTempTable("categories")

// Join Queries ==> Products ==> Categories
val queries_to_categories = sqlContext.sql("SELECT c.categoryId, q.terms
                                            FROM queries q
                                            JOIN products p
                                                ON q.sessionId = p.sessionId
                                            JOIN categories c
                                                ON p.productId = c.productId")

// Generate FastText formatted string
// -- FastText example row: __label__123 red lace shoes
val fast_text_strings = queries_to_categories.map(r => "__label__"
                                                    + r(0).toString 
                                                    + " " + r(1).toString)

// Create test /train splits
val splits = fast_text_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)

Communicate Results

Operationalize

Train Model and Test Accuracy

Discovery

Data Prep

Model Planning

Model Building

// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category

// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1

// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5

How could we improve the model?

Communicate Results

Operationalize

Improving the Model

Discovery

Data Prep

Model Planning

Model Building

Intuition: Not all product views are created equal.

$$$

Page 1

Page 2

Page 3

Page 4

Some draw more attention than others.

The best convert.

Solution: Update our model to incorporate product dwell time and shopper conversion behavior.

Communicate Results

Operationalize

Update Training Data

Discovery

Data Prep

Model Planning

Model Building

// Adding dwell time requires creating and updating localized variables within session
val product_views = events.map(r => (r(0).toString, r)).groupByKey().flatMap(x => {
  // Intra-session variables
  var productDwellTimes = scala.collection.mutable.Map[String, Integer]()

  // Sort events by time stamp
  x._2.toList.sortBy(y => y(1).toString).map { y => {
    // Update product dwell time
    if(lastVisitedProductHash != "") {
      // Get relevant times
      val timestamp = convertToMilliseconds(y(1).toString)
      val lastVisitedProductTime = productDwellTimes(lastVisitedProductHash)

      // Update dwell time
      productDwellTimes(lastVisitedProductHash) = timestamp - lastVisitedProductTime

      // Clear last visited product
      lastVisitedProductHash = ""
    }
  }}
})

Communicate Results

Operationalize

Update Training Data

Discovery

Data Prep

Model Planning

Model Building

// Create copies
val queries_to_categories = mapQueriesToCategories()
val fast_text_formatted_strings = queries_to_categories.flatMap(r => {
  // Determine how many copies of search bucket to provide
  val search_dwelltime = r(2).toString.toDouble
  val number_of_copies = 5
  var copies = if(search_dwelltime > 0 && search_dwelltime < max_dwelltime)
                 (search_dwelltime / max_dwelltime * number_of_copies).toInt + 1
               else number_of_copies

  // Check if dwell time was below
  if(search_dwelltime < 10) copies = 1

  // Check if there was a conversion
  if(r(3).toString.toBoolean) copies += 3

  // Create labels
  (0 to copies).map(x => "__label__" + r(0).toString + " " + r(1).toString)
})

// Create test / train splits
val splits = fast_text_formatted_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)

Communicate Results

Operationalize

Re-train and Test Model

Discovery

Data Prep

Model Planning

Model Building

// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category

// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1

// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5

Questions?

Communicate Results

Operationalize

Made with Slides.com