Hacker Newsnew | past | comments | ask | show | jobs | submit | syntacticsalt's commentslogin

A frequentist interpretation of inference assumes parameters have fixed, but unknown values. In this paradigm, it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

I do not think it is accurate to portray the author as someone who does not understand asymptotic statistics.


> it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

Nope. The correct way is rather something like "the measurements/polls/statistics x ± ε are consistent with this parameter's true value to be zero", where x is your measured value and ε is some measurement error, accuracy or statistical deviation. x will never really be zero, but zero can be within an interval [x - ε; x + ε].


As you yourself point out, a consistent estimator of a parameter converges to that parameter's value in the infinite sample limit. That limit is zero or it's not.


Reporting effect size mitigates this problem. If observed effect size is too small, its statistical significance isn't viewed as meaningful.


Sure (and of course). But did you see the effect size histogram in the OP?


Are you referring to the first figure, from Smith, et al, 2007? If so, I couldn't evaluate whether gwern's claim makes sense without reading that paper to get an idea of, e.g., sample size and how they control for false positives. I don't think it's self-evident from that figure alone.

One rule of thumb for interpreting (presumably Pearson) correlation coefficients is given in [0] and states that correlations with magnitude 0.3 or less are negligible, in which case most of the bins in that histogram correspond to cases that aren't considered meaningful.

[0]: https://pmc.ncbi.nlm.nih.gov/articles/PMC3576830/table/T1/


I’m not arguing that there’s something fundamentally wrong with mathematics or the scientific method. I’m arguing that the social norms around how we do science in practice have some serious flaws. Gwern points out one of them. One that IMHO is quite interesting.

EDIT: I also get the feeling that you think it’s okay to do an incorrect hypothesis test (c > 0), as long as you also look at the effect size. I don’t think it is. You need to test the c > 0.3 hypothesis to get a mathematically sound hypothesis test. How many papers do that?


My opinion of Gwern's piece is that some of the arguments he makes don't require correlations. For example, A/B tests of differences in means using a zero difference null hypothesis will reject the null, given enough data.

In that A/B testing scenario, I think if someone wants to test whether the difference is zero, that's fine, but if the effect size is small, they shouldn't claim that there's any meaningful difference. I believe the pharma literature calls this scenario equivalence testing.

Assuming a positive difference in means is desirable, I think testing for a null hypothesis of a change of at least some positive value (e.g., +5% of control) is a better idea. I believe the pharma literature calls this scenario superiority testing.

I believe superiority testing is preferable to equivalence testing, and in professional settings, I have made this case to managers. I have not succeeded in persuading them, and thus do the equivalence testing they request.

I don't think the idea of a zero null hypothesis is necessarily mathematically unsound. In cases like the difference in means, a zero null hypothesis is well-posed. However, I agree with you that there are better practices, like a null hypothesis incorporating a nonzero effect.

I don't entirely agree with the arguments Gwern puts forth in the Implications section because some of them seem at odds with one another. Betting on sparsity would imply neglecting some of the correlations he's arguing are so essential to capture. The bit about algorithmic bias strikes me as a bizarre proposition to include with little supporting evidence, especially when there are empirical examples of algorithmic bias.

What I find lacking about Gwern's piece is that it's a bit like lighting a match to widespread statistical practice, and then walking away. Yes, I think null hypothesis statistical testing is widely overused, and that statistical significance alone is not a good determinant of what constitutes a "discovery". I agree that modeling is hard, and that "everything is correlated" is, to an extent, true because the correlations are not literally or exactly zero. But if you're going to take the strong stance that null hypothesis statistical testing is meaningless, I believe you need to provide some kind of concrete alternative. I don't think Gwern's piece explicitly advocates an alternative, and it only hints the alternative might be causal inference. Asking people who may not have much statistics training to leap from frequentist concepts taught in high school to causal inference would be a big ask. If Gwern isn't asking that, then I'd want to know what a suggested alternative would be. Notably, Gwern does not mention testing for nonzero positive effects (e.g., in the vein of the "c > 0.3" case above). If there isn't an alternative, I'm not sure what the argument is. Don't use statistics, perhaps? It's tough to say.


Thanks for the extensive answer.

> I don't think the idea of a zero null hypothesis is necessarily mathematically unsound. In cases like the difference in means, a zero null hypothesis is well-posed. However, I agree with you that there are better practices, like a null hypothesis incorporating a nonzero effect.

I don’t think a zero null hypothesis is mathematically unsound of course. But I think it is unsound to do one and then look at the effect size as a known quantity. It’s not a known quantity, it’s a point estimate with a lot of uncertainty. The real underlying correlation may well be a lot lower than the point estimate.

And of course it’s hard to get people in charge interested in better hypothesis testing. That testing will result in fewer conclusions being drawn / fewer papers being published. It’s just another symptom of the core issue: it’s quite convenient to be able to buy the conclusions you want with money.


I don't disagree with the title, but I'm left wondering what they want us to do about it beyond hinting at causal inference. I'd also be curious what the author thinks of minimum effect sizes (re: Implication 1) and noninferiority testing (re: Implication 2).


Permutation tests don't account for family-wise error rate effects, so I'm curious why you would say that "it doesn't overcorrect like traditional methods".

I'm also curious why you say those "cover every case", because permutation tests tend to be underpowered, and also tend to be cumbersome when it comes to constructing confidence intervals of statistics, compared to something like the bootstrap.

Don't get me wrong -- I like permutation tests, especially for their versatility, but as one tool out of a bunch of methods.


> Most companies don't cost peoples' lives when you get it wrong.

True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.

We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.

As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.

> A lot of companies are, arguably, _too rigorous_ when it comes to testing.

My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.

Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.

> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.

Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.

> I do like their proposal for "peeking" and subsequent testing.

What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.

> We're shipping software. We can change things if we get them wrong.

That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.

> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.

While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.

> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.

I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.


I think you're being overly pedantic here. I'm not a data scientist, just an engineering manager who is frustrated with data scientists ;)

That said, I do appreciate your corrections, but I don't think anything you said fundamentally changes my philosophical approach to these problems.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: