The role of A/B testing in a modern organisation

Narbeh Yousefian

Co Founder, Digdeep Digital

November 11, 2014

www.digdeepdigital.com.au

Be Predictive, Not Predictable

We are a global partner of Snowplow Analytics, an event analytics platform

We build on top of this data

We Experiements

So much so, we have our own

{
    /**
     * User Level IGLU Schema using data Layer variables
     */
  "$schema": "http://iglucentral.com/schemas/com.au.MYSITE.self-desc/schema/jsonschema/0-0-0#",
  "description": "Schema for user trackPageView",
  "self": {
    "vendor": "com.au.MYSITE",
    "name": "user",
    "format": "jsonschema",
    "version": "1-0-0"
  },
    "type": "object",
  "properties": {
    "event_id": {
      "type": "string"
    },
    "CookieID": {
      "type": "string"
    },
    "DeviceID": {
      "type": "string"
    },
    "expLab": {
      "type": "array",
      "items": {
        "type": "string"
      }
  },
  "required": ["event_id","CookieID","DeviceId","expLab"],
  "additionalProperties": false

{
    schema: "iglu:com.com.au.MYSITE/user/jsonschema/1-0-0",
    data: {
        event_id: 'xyz123abc',
        CookieId: 'ahwiwcobwcob',
        DeviceId: 'bmeoiheorubverovbev',
        expLabs: ['A', 'B', 'C', 'D', 'E']
    }
}

Event level data + self describing JSON schema pushing custom context data on every request

We also push a custom key across all platforms so we joining 3rd party systems

Redshift FTW!

Putting it together

A/B Testing

An A/B test involves testing two versions of a web page (the control and variation version) — with live traffic and measuring the effect each version has on your conversion rate.

https://blog.bigcommerce.com/10-ecommerce-ab-tests/

AKA: Split Testing, randomised control design, RBT, hypothesis testing, t (z) test, ANOVA, MANOVA, online control experiments, variant testing, parametric, non parametric, Spearman, Kendall Tau, Chi Square, Wilson Binomial, Wilcoxon, Mann-Whitney, McNemar's Test, and so on...

Talk Boundaries

Selling stuff online

Why Discrete Non Contractual?

Excellence in any undertaking can be found in details that most people barely notice, but those who know know. - President Abraham Lincoln

The Modern Organisation

This is not a modern organisation..

Why the Caveat?

"I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail."

It is important to distinguish the difference between what A/B testing tools can do for you and where you need to draw the line in the sand.

Especially when you make strong associations regarding your impact to the bottom line..

An Example

Another example

My Favourite

So what's the beef?

When all you have is black box 3rd party javascript

<script src"//cdn.optimizely.com/js/111111111.js"></script>

chisq.test(matrix(c(A,B,C,D), ncol=2, byrow=T))

but under the hood, all it is doing is something like this

You will likely see a report resembling this

Yet, the examples shown all imply a direct relationship

A common scenario

Let us assume we are an Australian wide online retailer, selling beds and accessories. As a business we have decided to be part of an online sale event, let's call it Click Frenzy.

100 SKU's of our inventory will be part of this sale and as a business, we want to measure the impact of this sales initiative.

In support of this mega sale, we also decide to promote this within our domain emphasising the sale messaging across our website.

Lastly, to ensure we get the best click for our buck, we further decide to run A/B test on the buy Now messaging, introducing a variant during the sale period.

- It is a record month in sales

- Click Frenzy worked!

- Green button is clear winner

Comparing the difference between the buttons for a statistically significant sample we are 95% confident we are AWESOME!

(this happens more so than you think)

At best this is a large assumption that an interaction event can call the purchase event its own.

Put Simply,

Using some non parametric test as a measure of success is a leap of faith when your product page change is assumed to impact revenue

library(Hmisc)
library(ggplot2)
SE_Binom <- data.frame(binconf(c(1835,1735), c(333389,243302), method="wilson") *100 )
SE_Binom$Experiment <- c("A", "B")
SE_Binom
#Buy Now Click through rate
qplot(ymin=Lower, ymax=Upper, x=Experiment, data=SE_Binom, geom="errorbar") + 
  labs(title="A/B Test", y="Buy Now CTR %")  + geom_point(aes(y=PointEst)) + 
  theme(axis.title.x = element_text(vjust=0.5, face="bold", colour="#000000", size=20),
  axis.text.x  = element_text(vjust=0.5, size=16),axis.title.y = element_text(vjust=0.5, 
  face="bold", colour="#000000", size=20),axis.text.y  = element_text(vjust=0.5, size=16),
  plot.title = element_text(vjust=1.5, face="bold",colour="#000000", size=20))

Useful test, wrong context

Might as well get this dude as your spokesman for insights

Let's open up the tool box

Stuff sales Org. cares about

Demand All potential customer sales regardless of stock outs.
Forecast Error (as opposed to forecast accuracy) RMSE
Fill Rate how many customer orders able to fill
Gross Sales Total sales dollars prior to returns and markdowns
Returns $ amount of goods returned as a % of Gross Sales.
Discounts $ amount of the discount a customer receives as a % of Gross Sales
Net Sales Gross Sales after returns and markdowns are subtracted
Margin and Profit: (COGS, ...)
Inventory: (sale/stock ratio,...)

SKU

Price

Units

Competitor Price

Weather

Dummy Vars

Seasonality

Day of week

Month

Intervention

...

Forecast Model

Product A == ARIMA(1,0,0)×(0,1,1)
Product B == ARIMA(1,1,1)×(0,1,1)
Product C == ARIMA(0,1,1)×(0,0,1)
...

Stuff sales Org. already does

What ever the process, your variabe selection will expose a model per SKU

Prioritise your operational Goals

We know that our forecasting model can predict with an overall X% error. That means, if your objective is to increase revenue, your goal is to reduce prediction error.

For any A/B test, you have a rolling start when addressing any experimental type questions because we already have a set of efficient predictors already uncovered by SKU.

Remember: "I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail."

1M Users or 10K SKU's?

It is a bold claim that a button change has a direct relationship on revenue when so much of the sales driver is already known from a better model.

This is a good thing though, because working together, you can test your way to better predictors.

Just don't forget to Block!

BLOCKING

for SKU, Price, Dummy Vars like Day of week, seasonality, weather, Click Frenzy Sale/ Not Sale, we also introduced The Buy Now variant, ...

Product A sample

Product B sample

Product A == ARIMA(1,0,0)×(0,1,1)
Product B == ARIMA(1,1,1)×(0,1,1)
Product C == ARIMA(0,1,1)×(0,0,1)
...

Blindly comparing product A of $100 to other SKU's varying from $20 - $2000 will fail looking at an overall group comparison.

If our forecasting model is by SKU, our blocking means our experiment become concurrent tests, so for every SKU it is a mini experiment in itself.

Concurrent Testing

Example Outcome

for the 100 SKU's introduced to the Click Frenzy Sale, only 20 of these SKU's performed significantly better as a result of the Click Frenzy.

60 out of 100 SKU's did not have enough sample to test the button variation, of those tested, only 1 SKU performed significantly better.

Product A sample

Product B sample

Take Away

In a setting heavily driven by sales, A/B Testing is a complimentary tool in the toolbox

Rather than model users, we flip it is as a product input

Practicality is the order of the day, support what is already known

Final Thought's

Be honest about your internal process: Data and experimentation must be baked into the business operations - there is no point hacking A/B Testing onto a waterfall process, you will lose.

Incremental Testing, Long term objectives. The value lies in your contribution to the long term cause. What is your worth to a billion dollar organisation when you reduce error by 1% over the course of the year?

100 failed tests mean nothing when you are fast to act, and are sensible about your goals.

"My point is not that infinite scroll is stupid. It may be great on your website. But we should have done a better job of understanding the people using our website" – Dan McKinley, Principal Engineer at Etsy

The role of A/B testing in a modern organisation

Narbeh Yousefian

www.digdeepdigital.com.au

We build on top of this data

We Experiements

So much so, we have our own

Putting it together

A/B Testing

Talk Boundaries

Selling stuff online

Why Discrete Non Contractual?

The Modern Organisation

This is not a modern organisation..

Why the Caveat?

An Example

Another example

My Favourite

So what's the beef?

When all you have is black box 3rd party javascript

but under the hood, all it is doing is something like this

You will likely see a report resembling this

Yet, the examples shown all imply a direct relationship

A common scenario

Put Simply,

Useful test, wrong context

Let's open up the tool box

Stuff sales Org. cares about

Stuff sales Org. already does

Prioritise your operational Goals

1M Users or 10K SKU's?

Just don't forget to Block!

BLOCKING

Concurrent Testing

Example Outcome

Take Away

Final Thought's

Be honest about your internal process: Data and experimentation must be baked into the business operations - there is no point hacking A/B Testing onto a waterfall process, you will lose.

Thank you

@digdeep