Architecting great experiments
@kylerush
Head of Optimization, Optimizely
means
write it down
or
tweet it
Significance
The risk of encountering a false positive.
95% = out of 100 a/a tests, 5 will inaccurately report a difference.
Power
The risk of encountering a false negative.
80% = out of 100 a/b tests,
20 winners will not be reported.
one-tail vs. two tail
One tail = is the variation better?
Two tail = is the variation
better OR worse?
MDE
Minimum detectable effect
Successful testing strategies
are based around the minimum
detectable effect (MDE) variable.
Sample size
How many subjects
are in your experiment.
Always use a sample size
calculator to calculate sample
size before starting an a/b test.
bit.ly/VUBti8
sample size calculator
bit.ly/SWR3YC
Example
Absolute lowest MDE
conversion rate: 4%
MDE: 1%
POWER: 80%
SIGNIFICANCE: 5%
TAILS: ONE
2,972,435
Visitors per branch
example
Focus on time
1 month = 170,000 unique visitors
CONVERSION RATE: 4%
MDE: 6%
POWER: 80%
SIGNIFICANCE: 5%
TAILS: ONE
83,230
Visitors per branch
example
Small startup
1 month = 3,000 unique visitors
CONVERSION RATE: 4%
MDE: 45%
POWER: 80%
SIGNIFICANCE: 5%
TAILS: ONE
1,567
Visitors per branch
Sample size calculators tell you
how many subjects, but not which
subjects should be in your experiment.
sampling
- can be really hard
- week day vs. weekend traffic
- campaign vs. organic traffic
- returning vs. new visitors
example
E-commerce website
homepage assumptions
- Lots of traffic
- Relatively few conversions
Let's estimate:
- 2.5% conversion rate
- 100,000 monthly unique visitors
Run experiment for 1 month (100k visitors)
CONVERSION RATE: 2.5%
POWER: 80%
SIGNIFICANCE: 5%
TAILS: ONE
MDE: 10%
Checkout page assumptions
- Lower traffic
- Relatively high conversion rate
Let's assume:
- 50% conversion rate
- 10,000 monthly unique visitors
RUN EXPERIMENT FOR 1 MONTH (10k visitors)
CONVERSION RATE: 50%
POWER: 80%
SIGNIFICANCE: 5%
TAILS: ONE
MDE: 5%
just cut your MDE in half!
Start by focusing a/b tests on the
last step in your conversion funnel.
what should i test?
Depends on MDE and time.
Landing page optimization
(mozcon 2013)
bit.ly/1wkpgye
going beyond the low hanging fruit
(conversion conference 2014)
bit.ly/1kU4sZ0
experiment
Terse vs verbose
result
+31% leads
99.9% confidence
goals
- Measure as many goals as possible
- micro: form field errors, time on page
- macro: purchase, revenue
- Choose a primary goal
- Don't forget about down the funnel goals
- Repeat purchase
- Save payment information
example
Success vs. submit
13.72% conversion rate
17.15% conversion rate
Measure as many goals
as possible for every experiment.
standards document
For each page/funnel record:
- Three month monthly average of unique visitors
- Stopping conditions (sample size)
- Goals
- baseline conversion rate
- MDE
- visits per branch
- baseline conversion rate over time
testing standards template
bit.ly/1oVRf6i
Quality assurance
- No bugs in the variation
- No bugs in the control
- Tracking works correctly
ELIMINATING bias
Double blind experiments
Experiment brief
- Hypothesis
- Audience description
- Goals tracked
- Stopping conditions
- Screenshots
- QA summary
experiment brief template
bit.ly/1nvNJjx
statistical tie
Not enough data to
conclude that there is a difference
The overwhelming majority of
experiment results are a statistical tie.
result
second TEST
statistical tie
third TEST
Statistical tie
Always, always record detailed
experiment results in an archive.
EXPERIMENT ARCHIVE
- experiment date
- audience/url
- screenshots
- hypothesis
- results
- link to experiment
- link to result csv
Experiment archive template
bit.ly/1q9tRWI
ARCHITECTING GREAT EXPERIMENTS
@KYLERUSH
Head of Optimization, Optimizely