Thanks for your comment. This is Darwish, the Product Manager working on Stats E...

thinkmoore · on Jan 20, 2015

What's so particularly embarrassing is that you clearly did not have any competent statisticians on board until now. This was not some big surprise that needed "a lot of resources to fix." This is something that should be obvious to anyone who understands hypothesis testing, and is something that statisticians have been describing how to do correctly for over 50 years: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1551774/

joecasson · on Jan 20, 2015

By that logic, no one else in the industry (Adobe, Mixpanel, VWO) has competent statisticians. That's silly.

thinkmoore · on Jan 20, 2015

Apparently that might be the case. I mean the problem with multiple hypothesis testing is not exactly rocket science...

dargani123 · on Jan 20, 2015

Hey - I combined a response in this link here: https://news.ycombinator.com/item?id=8918746

RA_Fisher · on Jan 20, 2015

This feels like a very wasteful approach to me. I understand the need to protect against the variation of p-value with each test. However, now data needs to mature in two places. Bayesian testing is a natural cyclical approach where past information coupled with data generates new beliefs which become past information. Most conversion rates are small, say less than 30%. Hence, their differences are smaller (as we move towards support) --- yet this method uses the same old prior information (which is that it's reasonable in known cases to look above say 90% as the domain of the test). Given that we're now looking at data through two lenses, I would be shocked if this does not result in a much longer lag period. Am I missing something, is this a free lunch?

leo_pekelis · on Jan 20, 2015

I do agree with you that with sequential testing it is possible to get much slower results. This is actually similar to using a sample size calculator for a classical t-test. If you set your minimum detectible effect (MDE) much smaller than the actual effect size of your A/B test, you will end up waiting for many more visitors than were needed to detect significance. We have looked at many historical A/B tests at Optimizely to determine the range of effect sizes which give the most efficiency for our customers. In fact, we didn’t put out Stats Engine sooner because we wanted to be confident that the speed was comparable to the usual, fixed horizon, t-test. This tuning will be part of an ongoing process to customize results at the customer level.

Second, I want to point out that Stats Engine is not a Bayesian test. We do not recompute a posterior from past information after every visitor and use this directly to get significance. Instead such calculations are used as inputs to determine how much information we have compared to a situation of zero effect size. There’s only ‘one lense’ still because we use all this to make and guarantee the usual Frequentist hypothesis testing statements, but now factoring in that an experimenter can look at results at any time.

RA_Fisher · on Jan 20, 2015

I'd be pretty curious to see how the posteriors of the p-values shake out animated by time.