Thanks for your comment. This is Darwish, the Product Manager working on Stats Engine. You are correct "classic statistics" is the method we used in the past. It also what is most commonly used in industry (the main reason we started with this method). This was not an easy project for us to take on, but after talking to customers and looking at our historical experiment data, it was clear how important this problem was to solve, and thats why we spent a lot of resources on fixing this. Just for those following along on this comment, its not that "classic statistics" on their own that are incorrect, but rather the misuse of these statistics that can be costly. When used "incorrectly" (not using a sample size calculator, running many goals and variations at a time etc..), you can meaningfully increase your chance of making a bad business decision or commit yourself to unnecessarily long sample sizes. Using statistics correctly is an industry-wide problem that many have tried to solve with education (i.e. give statistics crash courses). We hope that our solution shows how important we think it is that statistics drive day-to-day decisions in organizations and that there are different ways (change the math, not the customer) to get customers to this point. Many companies have data science teams and in-house statisticians that are very aware of these problems, but many don't and thats really where we wanted to help out. You can read more about why we thought this was a serious problem here: http://blog.optimizely.com/2015/01/20/statistics-for-the-int...
What's so particularly embarrassing is that you clearly did not have any competent statisticians on board until now. This was not some big surprise that needed "a lot of resources to fix." This is something that should be obvious to anyone who understands hypothesis testing, and is something that statisticians have been describing how to do correctly for over 50 years: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1551774/
This feels like a very wasteful approach to me. I understand the need to protect against the variation of p-value with each test. However, now data needs to mature in two places. Bayesian testing is a natural cyclical approach where past information coupled with data generates new beliefs which become past information. Most conversion rates are small, say less than 30%. Hence, their differences are smaller (as we move towards support) --- yet this method uses the same old prior information (which is that it's reasonable in known cases to look above say 90% as the domain of the test). Given that we're now looking at data through two lenses, I would be shocked if this does not result in a much longer lag period. Am I missing something, is this a free lunch?
I do agree with you that with sequential testing it is possible to get much slower results. This is actually similar to using a sample size calculator for a classical t-test. If you set your minimum detectible effect (MDE) much smaller than the actual effect size of your A/B test, you will end up waiting for many more visitors than were needed to detect significance. We have looked at many historical A/B tests at Optimizely to determine the range of effect sizes which give the most efficiency for our customers. In fact, we didn’t put out Stats Engine sooner because we wanted to be confident that the speed was comparable to the usual, fixed horizon, t-test. This tuning will be part of an ongoing process to customize results at the customer level.
Second, I want to point out that Stats Engine is not a Bayesian test. We do not recompute a posterior from past information after every visitor and use this directly to get significance. Instead such calculations are used as inputs to determine how much information we have compared to a situation of zero effect size. There’s only ‘one lense’ still because we use all this to make and guarantee the usual Frequentist hypothesis testing statements, but now factoring in that an experimenter can look at results at any time.