From Look to Book
A Day in the Life of an Ecommerce Data Scientist
Aboute Me
Garrett Eastham
- AI + Ecommerce Focus
- CS @ Stanford
- Background in Web Analytics
- Career in Product Management
- Prior: Bazaarvoice, RetailMeNot
Founder & Chief Data Scientist


edgecase
About Edgecase


edgecase
Product Data Enrichemnt
- Enable better online search and navigation experiences by providing rich, structured product attribute information at scale
- Leverage combination of human and machine technology
Today's Agenda
How Did I Get Here?
The Data Science Lifecycle
The Ecommerce Dataset
Example Problem: Categorize Search Queries
General Q&A
How Did I Get Here?
Stanford - Human Computer Interaction
- HeyElroy - decision assistance UI
- Preference extraction
How Did I Get Here?
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
- Web analytics and the crazy world of "proving" ROI
- Developed the ViewStream project
- Founding PM for project "Magpie"
How Did I Get Here?
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Edgecase - CEO
- Left BV in spring 2012 with ~$20k in savings
- Built advisory board, first offering, and signed 2 customers
- Raised seed round ($750k) and then venture round ($3.5m) to get company off the ground in 2013
How Did I Get Here?
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Edgecase - CEO
Edgecase - CPO
- Embraced the need for a pivoted product strategy
- Rebuilt entire platform & kept existing customer commitments
- Hired product leadership to replace me
How Did I Get Here?
Stanford - Human Computer Interaction
Bazaarvoice - Product Manager
Edgecase - CEO
Edgecase - CPO
Edgecase - Data Science
- FINALLY get to work full time on data science
- Ramped up on latest technologies (chose Spark)
- Size of data (10g+ / day) necessitated the need to also learn data engineering (AWS)
Data Science @ Edgecase
Team of 1 / Off-Stack / 6 mo. - 1 yr IP Timeline
Both define AND execute on research objectives
Wear all "data" hats - Executive, Scientist, Product Manager, Analyst
Data Science @ Edgecase
Machine Curation
Taxonomy Development
ROI Measurement
Predictive Analytics
The Data Science Lifecycle
Preparation
Dissemination
Analysis
Reflection
Discovery
Data Prep
Model Planning
Model Building
Communicate Results
Business Questions
Business Operations
Operationalize
The Data Science Lifecycle
Preparation
Dissemination
Analysis
Reflection
82%
5%
9%
4%
The Ecommerce Dataset
Homepage
Category
Search
Product
Transaction
Executive
Analyst
Merchants
Product Manager
- Did site conversion increase?
- Is traffic increasing?
- Whats the conversion of new vs. returning traffic?
- How's the A/B test performing?
- What's the sell through rate?
- Is new inventory being found?
- What's does our conversion funnel look like?
- Do shoppers like the new UX?
The Ecommerce Dataset
New Customers
Existing Customers
Search Engine (organic / paid)
Homepage
Category
Search
Product
Transaction
Paid Advertising
Email Promotion
Retargeting Campaign
The Ecommerce Dataset
$$$
Page 1
Page 2
Page 3
Page 4
Click-Stream Logs
483782 - PageView - ... - 10.1.1.123 - ... - http://...
849283 - PageView - ... - 10.1.1.001 - ... - http://...
948272 - PageView - ... - 10.4.1.293 - ... - http://...
483782 - PageView - ... - 10.1.1.123 - ... - http://...
948272 - PageView - ... - 10.4.1.293 - ... - http://...
The Problem
Free-Text Queries
red lace shoes
red shoelace
lace-up tennis shoes
black velcro lace shoes
lace up Toms running
Shoes
Running Shoes
Lace-Up Running Shoes
Input
Output
Product Taxonomies
- Provides structure and "meaning" to product catalog
- Maps to merchant workflow / P&L
- Standardizes reporting
Where Do We Start?
Discovery
Data Prep
Model Planning
Model Building
Communicate Results
Operationalize
Answer Two Key Questions
- What data will we need?
- Where does that data live?
Let's Look at the Logs
Discovery
Data Prep
Model Planning
Model Building
// Import spark libraries
import org.apache.spark.sql._
import sqlContext._
// Load raw Snowplow logs
val raw_snowplow_logs = sqlContext.load("path/to/raw/logs")
// Show the schema
raw_snowplow_logs.printSchema()
/* == Output ==
|-- app_id: string (nullable = true)
|-- doc_height: long (nullable = true)
|-- doc_width: long (nullable = true)
|-- domain_sessionid: string (nullable = true)
|-- domain_sessionidx: long (nullable = true)
|-- dvce_tstamp: string (nullable = true)
|-- event: string (nullable = true)
|-- page_referrer: string (nullable = true)
|-- page_title: string (nullable = true)
|-- page_url: string (nullable = true)
*/Communicate Results
Operationalize
Design a Relevant Schema
Discovery
Data Prep
Model Planning
Model Building
Queries
- Session ID
- Query Terms
- Timestamp
Product Views
- Session ID
- Product ID
- Timestamp
Transactions
- Session ID
- Transaction ID
- Timestamp
The QPT Model
Communicate Results
Operationalize
Design, Test and Run ETL
Discovery
Data Prep
Model Planning
Model Building
// Query object
case class Query(
sessionId: String, // The session ID of the user originating the pageview
terms: String, // The free-text search query
timestamp: String // The timestamp of the originating query event
)
// ProductView object
case class ProductView(
sessionId: String, // The session ID of the user originating the pageview
productId: String, // The external product ID of the originating product page
timestamp: String // The timestamp of the originating pageview event
)
// Transaction object
case class Transaction(
sessionId: String, // The session ID of the user originating the pageview
transactionId: String, // The transaction ID of the converted session
timestamp: String // The timestamp of the originating query event
)Communicate Results
Operationalize
Design, Test and Run ETL
Discovery
Data Prep
Model Planning
Model Building
// Setup logs for SQL access
val snowplow.registerTempTable("snowplow")
// Query for relevant event meta-data from logs
val events = sqlContext.sql("SELECT sessionId, terms, timestamp
FROM snowplow
WHERE pagetype = 'search'")
// Map events onto query construct
val queries = events.map(r => Try(Query(r(0).toString,
r(1).toString,
r(2).toString)))
.filter(_.isSuccess)
.map(_.get)
// Repeat for product views and transactions
val product_views = ...
val transactions = ...Communicate Results
Operationalize
Analyze Data Set
Discovery
Data Prep
Model Planning
Model Building
// Register data model tables
queries.registerTempTable("queries")
product_views.registerTempTable("product_views")
transactions.registerTempTable("transactions")
// Q: How many people query each day?
sqlContext.sql("SELECT timestamp, count(distinct(sessionId)) as total_searchers
FROM queries
GROUP BY timestamp")
// Q: Do people actually purchase after making a query?
sqlContext.sql("SELECT count(distinct(q.sessionId)) as total_searchers_that_purchase
FROM queries q
JOIN transactions t
ON q.sessionId = t.sessionId")Communicate Results
Operationalize
Selecting a Model to Train
Discovery
Data Prep
Model Planning
Model Building
Facebook FastText
- Very fast, localized Word2Vec training
- Leverages character-to-character window size - makes it great for "small", highly similar documents (i.e. - search terms)
- Hierarchical classification - enables very "wide", out-of-the-box training capabilities
- Simple command line interface
red lace shoes
r-e-d l-a-c-e s-h-o-e-s
vs
Communicate Results
Operationalize
Setup Training and Test Data
Discovery
Data Prep
Model Planning
Model Building
// Load external product ID ==> category ID mapping CSV
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("inferSchema", "true")
.load("path/to/category_mapping.csv")
.registerTempTable("categories")
// Join Queries ==> Products ==> Categories
val queries_to_categories = sqlContext.sql("SELECT c.categoryId, q.terms
FROM queries q
JOIN products p
ON q.sessionId = p.sessionId
JOIN categories c
ON p.productId = c.productId")
// Generate FastText formatted string
// -- FastText example row: __label__123 red lace shoes
val fast_text_strings = queries_to_categories.map(r => "__label__"
+ r(0).toString
+ " " + r(1).toString)
// Create test /train splits
val splits = fast_text_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)Communicate Results
Operationalize
Train Model and Test Accuracy
Discovery
Data Prep
Model Planning
Model Building
// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category
// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1
// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5How could we improve the model?
Communicate Results
Operationalize
Improving the Model
Discovery
Data Prep
Model Planning
Model Building
Intuition: Not all product views are created equal.
$$$
Page 1
Page 2
Page 3
Page 4
Some draw more attention than others.
The best convert.
Solution: Update our model to incorporate product dwell time and shopper conversion behavior.
Communicate Results
Operationalize
Update Training Data
Discovery
Data Prep
Model Planning
Model Building
// Adding dwell time requires creating and updating localized variables within session
val product_views = events.map(r => (r(0).toString, r)).groupByKey().flatMap(x => {
// Intra-session variables
var productDwellTimes = scala.collection.mutable.Map[String, Integer]()
// Sort events by time stamp
x._2.toList.sortBy(y => y(1).toString).map { y => {
// Update product dwell time
if(lastVisitedProductHash != "") {
// Get relevant times
val timestamp = convertToMilliseconds(y(1).toString)
val lastVisitedProductTime = productDwellTimes(lastVisitedProductHash)
// Update dwell time
productDwellTimes(lastVisitedProductHash) = timestamp - lastVisitedProductTime
// Clear last visited product
lastVisitedProductHash = ""
}
}}
})Communicate Results
Operationalize
Update Training Data
Discovery
Data Prep
Model Planning
Model Building
// Create copies
val queries_to_categories = mapQueriesToCategories()
val fast_text_formatted_strings = queries_to_categories.flatMap(r => {
// Determine how many copies of search bucket to provide
val search_dwelltime = r(2).toString.toDouble
val number_of_copies = 5
var copies = if(search_dwelltime > 0 && search_dwelltime < max_dwelltime)
(search_dwelltime / max_dwelltime * number_of_copies).toInt + 1
else number_of_copies
// Check if dwell time was below
if(search_dwelltime < 10) copies = 1
// Check if there was a conversion
if(r(3).toString.toBoolean) copies += 3
// Create labels
(0 to copies).map(x => "__label__" + r(0).toString + " " + r(1).toString)
})
// Create test / train splits
val splits = fast_text_formatted_strings.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0)
val test = splits(1)Communicate Results
Operationalize
Re-train and Test Model
Discovery
Data Prep
Model Planning
Model Building
// Run supervised training from FastText CLI
$ ./fasttext supervised -input queries_train.txt -output query_to_category
// Test the accuracy of the model - top 1
$ ./fasttext test query_to_category.bin queries_test.txt 1
// Test the accuracy of the model - top 5
$ ./fasttext test query_to_category.bin queries_test.txt 5Questions?
Communicate Results
Operationalize
General Assembly Presentation - January 10, 2016
By Garrett Eastham
General Assembly Presentation - January 10, 2016
- 374