A Day in the Life of an Ecommerce Data Scientist
Founder & Chief Data Scientist
edgecase
edgecase
Stanford - Human Computer Interaction
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Edgecase - CEO
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Edgecase - CEO
Edgecase - CPO
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Edgecase - CEO
Edgecase - CPO
Edgecase - Data Science
Team of 1 / Off-Stack / 6 mo. - 1 yr IP Timeline
Both define AND execute on research objectives
Wear all "data" hats - Executive, Scientist, Product Manager, Analyst
Preparation
Dissemination
Analysis
Reflection
Discovery
Data Prep
Model Planning
Model Building
Communicate Results
Business Questions
Business Operations
Operationalize
Preparation
Dissemination
Analysis
Reflection
Homepage
Category
Search
Product
Transaction
Executive
Analyst
Merchants
Product Manager
New Customers
Existing Customers
Search Engine (organic / paid)
Homepage
Category
Search
Product
Transaction
Paid Advertising
Email Promotion
Retargeting Campaign
$$$
Page 1
Page 2
Page 3
Page 4
Click-Stream Logs
483782 - PageView - ... - 10.1.1.123 - ... - http://...
849283 - PageView - ... - 10.1.1.001 - ... - http://...
948272 - PageView - ... - 10.4.1.293 - ... - http://...
483782 - PageView - ... - 10.1.1.123 - ... - http://...
948272 - PageView - ... - 10.4.1.293 - ... - http://...
Free-Text Queries
red lace shoes
red shoelace
lace-up tennis shoes
black velcro lace shoes
lace up Toms running
Shoes
Running Shoes
Lace-Up Running Shoes
Input
Output
Discovery
Data Prep
Model Planning
Model Building
Communicate Results
Operationalize
Answer Two Key Questions
Discovery
Data Prep
Model Planning
Model Building
// Import spark libraries
import org.apache.spark.sql._
import sqlContext._
// Load raw Snowplow logs
val raw_snowplow_logs = sqlContext.load("path/to/raw/logs")
// Show the schema
raw_snowplow_logs.printSchema()
/* == Output ==
|-- app_id: string (nullable = true)
|-- doc_height: long (nullable = true)
|-- doc_width: long (nullable = true)
|-- domain_sessionid: string (nullable = true)
|-- domain_sessionidx: long (nullable = true)
|-- dvce_tstamp: string (nullable = true)
|-- event: string (nullable = true)
|-- page_referrer: string (nullable = true)
|-- page_title: string (nullable = true)
|-- page_url: string (nullable = true)
*/Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
Queries
Product Views
Transactions
The QPT Model
Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Query object
case class Query(
sessionId: String, // The session ID of the user originating the pageview
terms: String, // The free-text search query
timestamp: String // The timestamp of the originating query event
)
// ProductView object
case class ProductView(
sessionId: String, // The session ID of the user originating the pageview
productId: String, // The external product ID of the originating product page
timestamp: String // The timestamp of the originating pageview event
)
// Transaction object
case class Transaction(
sessionId: String, // The session ID of the user originating the pageview
transactionId: String, // The transaction ID of the converted session
timestamp: String // The timestamp of the originating query event
)Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Setup logs for SQL access
val snowplow.registerTempTable("snowplow")
// Query for relevant event meta-data from logs
val events = sqlContext.sql("SELECT sessionId, terms, timestamp
FROM snowplow
WHERE pagetype = 'search'")
// Map events onto query construct
val queries = events.map(r => Try(Query(r(0).toString,
r(1).toString,
r(2).toString)))
.filter(_.isSuccess)
.map(_.get)
// Repeat for product views and transactions
val product_views = ...
val transactions = ...Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Register data model tables
queries.registerTempTable("queries")
product_views.registerTempTable("product_views")
transactions.registerTempTable("transactions")
// Q: How many people query each day?
sqlContext.sql("SELECT timestamp, count(distinct(sessionId)) as total_searchers
FROM queries
GROUP BY timestamp")
// Q: Do people actually purchase after making a query?
sqlContext.sql("SELECT count(distinct(q.sessionId)) as total_searchers_that_purchase
FROM queries q
JOIN transactions t
ON q.sessionId = t.sessionId")Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
Facebook FastText
red lace shoes
r-e-d l-a-c-e s-h-o-e-s
vs
Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Load external product ID ==> category ID mapping CSV
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("inferSchema", "true")
.load("path/to/category_mapping.csv")
.registerTempTable("categories")
// Join Queries ==> Products ==> Categories
val queries_to_categories = sqlContext.sql("SELECT c.categoryId, q.terms
FROM queries q
JOIN products p
ON q.sessionId = p.sessionId
JOIN categories c
ON p.productId = c.productId")
// Generate FastText formatted string
// -- FastText example row: __label__123 red lace shoes
val fast_text_strings = queries_to_categories.map(r => "__label__"
+ r(0).toString
+ " " + r(1).toString)
// Create test /train splits
val splits = fast_text_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category
// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1
// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
Intuition: Not all product views are created equal.
$$$
Page 1
Page 2
Page 3
Page 4
Some draw more attention than others.
The best convert.
Solution: Update our model to incorporate product dwell time and shopper conversion behavior.
Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Adding dwell time requires creating and updating localized variables within session
val product_views = events.map(r => (r(0).toString, r)).groupByKey().flatMap(x => {
// Intra-session variables
var productDwellTimes = scala.collection.mutable.Map[String, Integer]()
// Sort events by time stamp
x._2.toList.sortBy(y => y(1).toString).map { y => {
// Update product dwell time
if(lastVisitedProductHash != "") {
// Get relevant times
val timestamp = convertToMilliseconds(y(1).toString)
val lastVisitedProductTime = productDwellTimes(lastVisitedProductHash)
// Update dwell time
productDwellTimes(lastVisitedProductHash) = timestamp - lastVisitedProductTime
// Clear last visited product
lastVisitedProductHash = ""
}
}}
})Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Create copies
val queries_to_categories = mapQueriesToCategories()
val fast_text_formatted_strings = queries_to_categories.flatMap(r => {
// Determine how many copies of search bucket to provide
val search_dwelltime = r(2).toString.toDouble
val number_of_copies = 5
var copies = if(search_dwelltime > 0 && search_dwelltime < max_dwelltime)
(search_dwelltime / max_dwelltime * number_of_copies).toInt + 1
else number_of_copies
// Check if dwell time was below
if(search_dwelltime < 10) copies = 1
// Check if there was a conversion
if(r(3).toString.toBoolean) copies += 3
// Create labels
(0 to copies).map(x => "__label__" + r(0).toString + " " + r(1).toString)
})
// Create test / train splits
val splits = fast_text_formatted_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)Communicate Results
Operationalize
Discovery
Data Prep
Model Planning
Model Building
// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category
// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1
// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5Communicate Results
Operationalize