The Beginner’s Guide to Conversion Rate Optimization
Chapter 10: Measuring Conversion Rate Experts and Calling Winners
By this point you may have started implementing many of the tools we discussed in the last chapter and you’re working to optimize your landing pages and lower your bounce and exit rates. You might even be the king or queen of CRO jargon among your friends—using phrases like “conversion flow” and “user experience” around the water cooler.
If your hypothesis was correct and your test was a clear winner…
Before you dive in to the data, keep the following in mind:
But, as Evan Miller points out in his article “How Not To Run An A/B Test”:
You’ve got your most recent analytics reports and you’ve designed some simple A/B tests. But how do you know which tests are improvements and which ones aren’t moving the needle? And more importantly, what will you do once you’ve called the winners and optimized your first set of tests?
But what if your hypothesis wasn’t correct, and your test is a loser? Now what?
We warned you in back Chapter 4 of the dangers of calling a test too early. Bear with us here, we’re going to discuss this stuff in a bit more detail.
In order for you to have confidence that you’ve in fact found a winner, your results need to be statistically significant. This means the margin of error, or likelihood that your results are merely chance, is low. In general, the larger the sample size, the smaller the margin of error.
If statistical significance, or significance level, is less than 5% probability, this means that your result is at least 95% likely to be accurate (or that it would be produced by chance no more than 5% of the time). Your testing software might use the terminology “95% chance of beating original” or “95% probability of statistical significance.”
So how do you know which optimization efforts are successful? Which version of your page or funnel is better?
[your testing software’s] significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless.
Yikes. In order to avoid what Miller refers to as repeated significance testing error, it’s critical to set a sample size and stick to it. Luckily, most testing software–like Optimizely, Visual Website Optimizer, and Google Content Experiments–includes this. For example, Optimizely recommends at least 100 people see both variants of an A/B test before results begin to become significant. If your software lacks this function, consider using Miller’s free Sample Size Calculator.
Something else to keep in mind is that tests may swing wildly over time. Just because one version of the test starts off strong, that doesn’t mean you can prematurely end the test. Stick to your predetermined sample size, and know that statistical fluctuations that will even out over the duration of the test.
For example, you can flip a coin 5 times, and even though the odds of it landing on heads or tails is 50%, it’s not unlikely to get 5 heads in a row. However, over 100 flips or 1,000 flips the averages quickly even out to 50/50.
For this reason, one sampling of 1,000 users is infinitely more valuable than several samplings of 10 –15 users each.
The winning version of a site is the one with consistently better metrics—it’s as simple as that. Remember, the metrics you’re looking at here are, of course, conversion rates, but also whatever other metrics you’ve determined affect conversion (engagement metrics in particular).
Big, clear cut wins are always great. But that’s not always how tests turn out. For example, a change that lowers bounce rates for the test site but doesn’t improve (though doesn’t worsen) conversion rates could still be considered a “winner”—an improvement worth keeping, because, after all, your goal is to improve your site piece by piece, one test at a time.
Just like with all your efforts leading up to this point, data is key. Just because a test version of your site is prettier or cleaner doesn’t mean it’s the obvious winner. This is where the baseline we talked about in Chapter 4 comes in. This baseline is your site’s pre-optimization “average.” It consists of your typical conversion rate, average bounce rates for top landing pages, how much time the average visitor spends on your site and any other metric you think indicates engagement (which, as we’ve discussed repeatedly, typically leads to conversion).
So how do you know when a test is over?
Remember, you can learn just as much—or more—from a failed test as you can from a successful one, as long as you understand that the resulting data is an opportunity.
Chapter 10 Notes