[your testing software’s] significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless.[1]
Yikes. In order to avoid what Miller refers to as repeated significance testing error, it’s critical to set a sample size and stick to it. Luckily, most testing software–like Optimizely, Visual Website Optimizer, and Google Content Experiments–includes this. For example, Optimizely recommends at least 100 people see both variants of an A/B test before results begin to become significant. If your software lacks this function, consider using Miller’s free Sample Size Calculator.
Something else to keep in mind is that tests may swing wildly over time. Just because one version of the test starts off strong, that doesn’t mean you can prematurely end the test. Stick to your predetermined sample size, and know that statistical fluctuations that will even out over the duration of the test.
For example, you can flip a coin 5 times, and even though the odds of it landing on heads or tails is 50%, it’s not unlikely to get 5 heads in a row. However, over 100 flips or 1,000 flips the averages quickly even out to 50/50.
For this reason, one sampling of 1,000 users is infinitely more valuable than several samplings of 10 –15 users each.
The winning version of a site is the one with consistently better metrics—it’s as simple as that. Remember, the metrics you’re looking at here are, of course, conversion rates, but also whatever other metrics you’ve determined affect conversion (engagement metrics in particular).
Big, clear cut wins are always great. But that’s not always how tests turn out. For example, a change that lowers bounce rates for the test site but doesn’t improve (though doesn’t worsen) conversion rates could still be considered a “winner”—an improvement worth keeping, because, after all, your goal is to improve your site piece by piece, one test at a time.