Every time they say "classic statistics" just insert "what we did before now" and see how frustrated you get with this announcement. The whole point of using them is that people don't need a statistician because the tool should make it easy to run solid tests. That of course hasn't been the case and they're finally admitting it.
Thanks for your comment. This is Darwish, the Product Manager working on Stats Engine. You are correct "classic statistics" is the method we used in the past. It also what is most commonly used in industry (the main reason we started with this method). This was not an easy project for us to take on, but after talking to customers and looking at our historical experiment data, it was clear how important this problem was to solve, and thats why we spent a lot of resources on fixing this. Just for those following along on this comment, its not that "classic statistics" on their own that are incorrect, but rather the misuse of these statistics that can be costly. When used "incorrectly" (not using a sample size calculator, running many goals and variations at a time etc..), you can meaningfully increase your chance of making a bad business decision or commit yourself to unnecessarily long sample sizes. Using statistics correctly is an industry-wide problem that many have tried to solve with education (i.e. give statistics crash courses). We hope that our solution shows how important we think it is that statistics drive day-to-day decisions in organizations and that there are different ways (change the math, not the customer) to get customers to this point. Many companies have data science teams and in-house statisticians that are very aware of these problems, but many don't and thats really where we wanted to help out. You can read more about why we thought this was a serious problem here: http://blog.optimizely.com/2015/01/20/statistics-for-the-int...
What's so particularly embarrassing is that you clearly did not have any competent statisticians on board until now. This was not some big surprise that needed "a lot of resources to fix." This is something that should be obvious to anyone who understands hypothesis testing, and is something that statisticians have been describing how to do correctly for over 50 years: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1551774/
This feels like a very wasteful approach to me. I understand the need to protect against the variation of p-value with each test. However, now data needs to mature in two places. Bayesian testing is a natural cyclical approach where past information coupled with data generates new beliefs which become past information. Most conversion rates are small, say less than 30%. Hence, their differences are smaller (as we move towards support) --- yet this method uses the same old prior information (which is that it's reasonable in known cases to look above say 90% as the domain of the test). Given that we're now looking at data through two lenses, I would be shocked if this does not result in a much longer lag period. Am I missing something, is this a free lunch?
I do agree with you that with sequential testing it is possible to get much slower results. This is actually similar to using a sample size calculator for a classical t-test. If you set your minimum detectible effect (MDE) much smaller than the actual effect size of your A/B test, you will end up waiting for many more visitors than were needed to detect significance. We have looked at many historical A/B tests at Optimizely to determine the range of effect sizes which give the most efficiency for our customers. In fact, we didn’t put out Stats Engine sooner because we wanted to be confident that the speed was comparable to the usual, fixed horizon, t-test. This tuning will be part of an ongoing process to customize results at the customer level.
Second, I want to point out that Stats Engine is not a Bayesian test. We do not recompute a posterior from past information after every visitor and use this directly to get significance. Instead such calculations are used as inputs to determine how much information we have compared to a situation of zero effect size. There’s only ‘one lense’ still because we use all this to make and guarantee the usual Frequentist hypothesis testing statements, but now factoring in that an experimenter can look at results at any time.
I wrote about the problem with sequential testing in online experiment three years ago on the Custora blog [1]. And Evan Miller wrote about it two years before me on his blog [2]. I'm glad to see Optimizely finally getting on board. Communicating statistical significance to marketers is always challenging, and I'm sure this will lead to better decisions being made.
Answering addressing a few comments right here. I think the industry deserves a lot of credit in its efforts to help those wanting to run A/B tests. Many people were aware these were issues and many actually tried to fix it (us included). There are many blog posts in the community about why continuous monitoring is dangerous, why you should use a sample size calculator, how to properly set a Minimum Detectable Effect etc... We were part (and definitely not the first) of this group as we published a sample size calculator and spent a lot of time working with our clients on running tests with a safe testing procedure.
However, after doing this and looking more closely attempting to quantify the effect of these efforts we saw an opportunity for a simpler solution that could help even more people. Sequential Testing was this solution, and it's had success in other applications. We wanted to bring sequential testing to A/B testing and take the hard work out of doing it correctly. Specifically, we have built on that groundwork laid in 50's and 60's by providing an always valid notion of p-value that customers are looking for.
While traditional sequential testing combats the continuous monitoring problem well, they require you to have an intimate understanding of the solution that can pose cognitive hurdles for those not well-versed in statistics. You have to either know your target effect size, or have in mind a maximum allowable number of visitors and understand how changes in these will affect the run time of your test. What’s more, it is not straightforward to translate results to standard measures of significance such as p-values. This is actually where the biggest research contribution of Stats Engine comes in. We allow you to run a test, detect a range of effect sizes and provide an always valid FDR-adjusted p-value as opposed to a set of stopping rules that bounds Type 1 error at say 5%. The error rates are valid no matter how the user chooses to interact with the A/B test. Also, FDR control itself has only been around over the last 20-25 years.
Our biggest industry contribution is probably much simpler in us moving a lot of the market to sequential testing more generally. We are happy to be in the position to help build on research and bring this to practical applications.
And any company trying to sell a statistical tool/package that would actually create a graph like that is selling snake oil. Your model only gets better, digitally, and never sees a regression? And you're using this for web analytics?
Hello, Leo, Optimizely's in-house statistician here. The graph you reference is a schematic to show the differences between Optimizely’s previous statistical platform and Stats Engine. It shows a monotone non-decreasing significance because under our sequential testing framework, the significance value represents the total amount of accumulated evidence against the null hypothesis of no difference between a variation and baseline. This wealth of evidence cannot decrease because you can only get more information about your test as you get more visitors. Of course, it is very possible, as the graph shows, that you will not acquire enough contradictory evidence to reach significance in a reasonable number of visitors. What would have happened if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold. Spurious deviations would cause multiple, contradictory declarations that the test is significant and then not. A savvy A/B tester might wait until the oscillations die down. Sequential testing is a principled, mathematical way to differentiate evidence against the null hypothesis from random oscillations in real time. It should be noted that the chance of a type I error is still controlled at 5%.
You do make a good point that sometimes an A/B test will see regression over time. We have explicitly separated this out because we feel detecting a change in the underlying effect size is different from testing whether the effect is non-zero, and different statistical methods are better suited to one over the other. We have built a policy into our framework that monitors for such temporal effects and signals an A/B test is in a ‘reset’ when we discover them. In our historical database, this happened on about 4% of tests.
I concede all this is a lot to get across in one graph, but we do feel that it is a good representation of how significance behaves under Stats Engine. If you would like to read more about the math behind stats engine, here is a link to a full technical article: http://pages.optimizely.com/rs/optimizely/images/stats_engin...
> It shows a monotone non-decreasing significance because [the value] represents the total amount of accumulated evidence against the null hypothesis.
> if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold
So there's your answer: the y-axis on the chart has an unlabeled different meaning for the blue line.
While I have you here Leo, can you explain why you would want to chart only the accumulated evidence for X? It's meaningless without knowing how much evidence has been accumulated for not X.
One point of clarification, the y-axis on the chart does have the same meaning for both lines. It is 1 minus the chance of committing a type I error. I think you do point out an important nuance that under sequential testing a type I error changes to “ever detecting a significant result on an insignificant test” instead of just at one, predetermined visitor count.
The amount of accumulated evidence for X is exactly a p-value, or a measurement which can tell you if there is enough evidence in the experiment to contradict a hypothesis of “no difference between a baseline and variation.” A high p-value, or low significance tells you there is a lack of evidence to make this claim.
You bring up a very interesting point which is that with sequential testing it is actually possible to also look for evidence of ‘not X’ or that there really is no detectable difference. This works by ‘flipping the hypothesis test on it’s head’ and allows for a mathematical formulation of stopping early for futility. We do not currently offer this in Stats Engine because we believe it’s the less important quantity of the two, but it may be the focus of future research.
You keep using that word "classical." If by "classical" you mean frequentist, then sequential testing is the appropriate frequentist method to have been using all along. If by "classical," you mean "old and established," then sequential testing is still the appropriate method to have been using all along.
I don't think the graph is particularly good either, but I think you're maybe reading it wrong too. The y axis is significance level which makes sense would normally improve as visitors go up. Their line arguably is a moving average regression.
He's saying that as new data arrives, it has to adjust both ways, not only toward the correct answer.
(If that wasn't the case, you could just figure out which was the only direction it would move and then stop collecting data. You've already got your answer)
Their graphs don't only adjust up. That one does, but that's because the only really significant downward movement is while their regression is still trailing behind.
It's a statistical significance test not a mean. You expect statistical significance to go up over time. Unless the effect size is zero, it doesn't have an asymptote.
I'm someone that would consider using Optimizely: no formal stats background but understand high school stats, work on web apps professionally and interested in analytics and testing. I've watched the video, read everything on the page and I still don't understand what they're trying to tell me here.
Based on my admittedly limited understanding of stats, unless you set the sample size and decide what significance is in advance your test will probably misinform you. Nothing on this landing page explains to me how this new thing might mean otherwise and it really doesn't help that the page is otherwise full of hubris, eg:"goodbye traditional statistics". Somehow it seems unlikely that a web startup just invalidated all of statistics
In a few sentences: in the past if you didn’t use a sample size calculator properly (set a sample size up front and only evaluate test results at that time) and had tests with a lot of goals and variations, you could increase that chance of making an incorrect declaration. With Stats Engine, we allow you to monitor your results real-time and test as many hypothesis as you would like and give an accurate representation of the likelihood your test is actually a winning or losing test. We’re definitely not trying to claim to have invalidated traditional statistics. If you used the proper testing procedure in the past, then Stats Engine will simply give you an easier workflow than before (no need to pick an appropriate sample size, minimum detectable effect, limit the number of hypotheses being tests). Many companies have data-scientists, statisticians, or are otherwise well-informed on the topic, however many are not. Stats Engine allows you to have an accurate Statistical Significance measure without requiring you to set a sample size because it accounts for the errors that are introduced by looking at your results as experiment data comes in.
I am surprised by all the negative commentary here. On the whole, companies like Optimizely, RJMetrics, Custora, and others are doing more to push statistical analysis to the mass market than anyone else. These tools are not designed for statisticians or ML practitioners so it makes sense they do not put language like Bayesian, etc. front and center. IMO, the more people using data to make decisions, the better.
It's not that they don't put in 'language like Bayesian', it's a different method. Yes, it is an improvement on the t-test straw-man they mention, but it's less flexible and powerful than Bayesian methods. Once you have a posterior, you can ask different questions that their p-values/confidence intervals don't address. For example, probability of an x% increase in conversion rate, or the risk associated with choosing an alternative. Not too mention multi-armed bandits, which not only are expected to arrive at an answer faster, but also maximize conversions along the way.
While I do agree that a sequential hypothesis test like the one we implemented in Stats Engine is different than a completely Bayesian method, I wouldn’t necessarily call it less powerful. In fact, numerous optimality properties exist showing that a properly implemented sequential test minimizes the expected number of visitors needed to correctly reject a null hypothesis of zero difference. I should note that our particular implementation does use some Bayesian ideas as well.
I agree that a benefit of Bayesian analysis is flexibility. Different posterior results are possible with different priors. But in practice this can be a hindrance as well as a benefit. When answer depend on a choice of prior, misusing, or misunderstanding the prior can lead to incorrect conclusions.
There is also a very attractive feature of Frequentist guarantees specifically for A/B testing. They make statements on the long-run average lift, which is a quantity that many businesses care about: what will my average lift be if I implement a variation after my A/B test?
That said, we have, and continue to look at Bayesian methods because we don’t feel that we have to be in either a Frequentist or Bayesian framework, but rather use the tools that are best suited to answer the sorts of statistical questions our customers encounter.
They make statements on the long-run average lift, which is a quantity that many businesses care about: what will my average lift be if I implement a variation after my A/B test?
Could you state clearly what this guarantee is? Unless I'm making a stupid mistake, such guarantees are impossible even in principle with frequentist statistics.
You do not want to use a classical bandit for A/B testing. The problem is that most bandit algorithms assume the conversion rate is constant - i.e., saturday and tuesday are the same. If sat and tues have different conversion rates, this will horribly break a bandit.
This is not a theoretical problem. I have a client who wasted months on this.
I know how to fix this (a Bayesian method, BTW), but I haven't published it. As far as I know, there is very little published research into using Bayesian bandits in assorted real world cases like this.
I very much like that people are starting to care about data-driven decisions... However I find it quite aggravating that these tools don't use the best available methods. Optimizely is celebrating that they built a strange, proprietary solution to a very well studied problem.
The situation to me feels a lot like acupuncture, homeopathic medicine, etc. I agree that these doctors and patients have their hearts are in the right place... I just wish they'd channel that energy in a more positive direction. It's frustrating.
While our solution is different than the current industry standard in A/B testing platforms, all the techniques we are using have been around in the statistics literature for decades, and are tried and true. The particular sequential test of power one that we use has been around since the 1970s and goes back to the time of Herbert Robbins. And FDR control has been well documented in the past 25 years, most notably by Yoav Benjamini, and Yosef Hochberg. We really are standing on the shoulders of giants.
I think our biggest contribution is presenting a principled, powerful mathematical solution in a way that is accessible to practitioners without a formal statistical background. Even if you do have this knowledge, it’s a chance to use these methods without having to reinvent the wheel every time.
There are various methods which could have been used as solutions, and we looked at many different ones to determine a fit to the user model and experience Optimizely is presenting. We are currently doing an AMA on our community portal and I would be happy to discuss potential solutions or any other comments with you there, https://community.optimizely.com/t5/Product-What-s-New/Ask-m...
All optimizely, VWO and other such services provide is WYSIWYG editor and a redirect script. Some pretty (but meaningless) graphs and lots of bullshitting.
More importantly, I'm sure they have people who know that their "A/B tests" most definitely do not work as advertised, so they are misleading their customers on purpose.
Interesting that Optimizely is positioning themselves as the over arching discipline as "Statistics reinvented for the internet age". My guess is to parry against the onslaught of A/B testing and optimization platforms for web and mobile from all directions. Of course with their stable of PHD statisticians and data scientists, Optimizely is the answer.
No, if you read their technical paper, it's frequentist sequential testing with false discovery rate control, which is a fairly recent development (I mean, 25 years old is pretty new in statistics).
I think all OP is trying to point out is that it either agrees with bayesian methods or it's wrong... so at best it's not materially new, and at worst it's using questionable assumptions.
Fantastic to see Optimizely changing their stats model. The more education that is done on web experimentation the better as there certainly still is a lot of snake oil being sold out there!
Their chosen technique is one way of solving the problem of communicating statistics to non-technical audiences however the interpretation of the results may suffer here. I can imagine that this technique will lead to overestimations of the effect size in situations where the threshold is reached early in an experiment as it will reward extreme values observed when the experiment is under powered.
You do bring up a good point. Even though a sequential test is able to be called much earlier than a fixed horizon test (note this only happens when the effect size is large enough to still guarantee Type I error control), it does not change the fact that estimates of the effect size are more variable when there are fewer visitors. The way we are addressing this is to make confidence intervals more prevalent in our platform. The width of confidence intervals represents our uncertainty in the magnitude of the true effect size with the information currently available. They correctly get more narrow as the experiment goes on as there is increased information from more visitors.
a lot of people are denigrating the site based on its content, but i'll, fully expecting to be down-voted, go out on a limb and say what I think we are all thinking: "i just don't like that guy's sweater".
It's something I've heard both ways. Whether it's correct in both cases ... I'm not sure. "I end up feeling like a statistic" is a singular form, for sure, but I often hear/read "statistics" referenced singularly as well.