A/B Tests:

Doing Science on Your Website

Mike Sherov

A: Head of Engineering / B: Head of Architecture

A: Behance Team / B: Adobe

A Word of Caution

Netflix tests everything. They're very proud that they A/B test interactions, offerings, pricing, everything. It's almost enough to get you to believe that rigorous testing is the key to success.

Except they didn't test the model of renting DVDs by mail for a monthly fee.

And they didn't test the model of having an innovative corporate culture.

And they didn't test the idea of betting the company on a switch to online delivery.

The three biggest assets of the company weren't tested, because they couldn't be.
Sure, go ahead and test what's testable. But the real victories come when you have the guts to launch the untestable.

- Seth Godin

A Word of Caution pt. 2

If I had asked people what they wanted, they would have said faster horses.

- Not Henry Ford

Why Testing?

Features may be unoptimized
Features may be superfluous
Features may have a negative effect

Why A/B Testing?

Surveys capture expressed prefs, not revealed prefs
Pre/post testing is vulnerable to global effects like seasonality
Pre/post testing will often miss small optimizations, which lots of A/B tests are

Testing Lifecycle

User Research

Hypothesize

Create A/B Test

Measure

Analyze

Draw Conclusions

Validate Learnings & Clean Up

User Research

Investigate existing user metrics on site
Find user issues from help desk
Conduct in-person interviews to identify problem areas

Hypothesize

Success Metric:
- What statistic you are trying to move
- An (educated) guess of how much it will move
Hypothesis:
- Why you think it will move
- Supporting user research / stats
Supporting Metrics:
- What other stats may move as a result
- What negative impact can this change have

Given a change you'd like to make, note...

Create A/B Test

"Logged in only" uses user id, and will be consistent for the user across computers
LILO uses a cookie, and will be consistent for a user across the logged out logged in experience, but won't be across computers or if cookie is lost / deleted

logged in only OR logged in / logged out?

Create A/B Test

Make sure you are measuring correctly!

Verify with the analytics team!
Measure numerator and denominator for the success metric. e.g. clicks/views
Ensure the supporting metrics are being logged as well. Existing stats may already exist, check first!

Measure

"X users saw A this hour, how many achieved the goal?"
This buckets timeslices into distinct observations!
Determine success ratio per observation for both A and B

Collect Observations

Analyze

When a statistic is significant, it simply means that you are very sure that the statistic is reliable. It doesn't mean the finding is important or that it has any decision-making utility.

Achieve Statistical Significance

http://www.statpac.com/surveys/statistical-significance.htm

Analyze

The measure of statistical significance. It's a measure of the alternative hypothesis (the thing you're testing) vs. the null hypothesis (the idea that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error)
< 0.05 is your cutoff! Anything greater than 0.05 should be ignored as chance
You can lower the p-Value by collecting more samples
In general, the larger the effect, the quicker it'll achieve significance

p-Value

Analyze

A T-test takes in the variance in the observations of the two populations, and the means of the two populations, and will determine if the variance indicates random chance (via p-Value) or not
Two tails hypothesizes that the effect can either be positive or negative. (Almost no one uses a one tailed t-tests unless it's literally impossible for an effect to be in 2 different directions)

Two tailed T Tests

Draw Conclusions

Often times, results may be counterintuitive, or something that seems insignificant is. Ensure results are reviewed with at least one other person
Can you validate the numbers in more than one way? Do sums add up, etc?

Ensure Rigor

Draw Conclusions

Are spammers effecting the results of the test?
Is lack of translations in test vs. control effecting the results?
Do the results change if you slice the data by recency cohorts?

Investigate Anomolies

Draw Conclusions

How confident are you?
What was the change?
What was the effect (relatively & absolutely)?
Over what timeframe?
Any caveats?

Declare a winner, write your winning sentence

e.g. With a 95% confidence level there is a 10% increase in X given over 7 days world wide, and a 35% increase in the United States if the user did Y. This is if a user did Z at least while removing large outliers.

The absolute increase is from 0.92 to 1.24 per user in the USA, and 1.37 to 1.51 world wide

Draw Conclusions

Note the time the feature was locked in
Lock in at 100%

Lock it in!

Validate Learnings

Was the hypothesis confirmed? If not, why?
Are there any more tests we want to follow up with?
Any suprising generalizations we can make about the user experience for future designs / tests?

What did you learn from the test?

Cleanup

Remove feature flag if done testing
Remove any other clean up code
Publish results internally to stakeholders

Leave the code as it was

What Questions Do You Have?