"...a certainty of three sigma, so there’s about one chance in 100 the result is a fluke."
This is a very incorrect way of summarizing the result.
When will popsci writers ever get this right? Hypothesis testing cannot tell you the probability that a hypothesis is true. It also cannot tell you if the result is a "fluke." What it can tell you is the probability of seeing the data assuming the null hypothesis is true. Without knowledge of the prior probability of the hypothesis, you have no right to say anything about the probability of the result being a "fluke" or not.
This is commonly misunderstood. Perhaps I can illuminate it with an example.
Suppose I take a coin out of your pocket--a good old American quarter. I then proceed to flip it six times and egads, I get six heads. The chance that a fair coin flips the same way six times in a row is 2/(2^6), or 1/32--close to 3%. Whoo hoo, that's more than a two sigma result.
Does that mean there's only a 3% chance the coin is actually fair? NO! If you assume the coin is fair, there is a 3% chance the coin flips the same way six times in a row. There's a huge distinction in flipping the conditional. Going by the infinitesimal number of quarters in circulation that wouldn't flip fairly, we'd say before any experiment that there's an overwhelming chance the coin is truly fair. So there isn't a 3% chance the result "is a fluke," if by "is a fluke" we mean the coin really is fair AND the coin flipped in a surprising way. We KNOW the coin flipped in a surprising way, and if we know that at least 99% of quarters are fair, there's probably a near-100% chance that the data can be called a "fluke."
This is all to say, if the prior probability of your null hypothesis is high, one unlikely result should not impress. You are merely demonstrating that unlikely things happen--and with so many coins and people flipping them they should be, from time to time.
> What it can tell you is the probability of seeing the data assuming the null hypothesis is true.
Sigh. When will internet explainers ever get this right? :-) Your explanation cannot be literally correct because, if the null hypothesis is true (the coin is fair), the "probability of seeing the data" is exactly the same for all possible sequences of six coinflips. What hypothesis testing actually does is arbitrarily pick a class of data that's "at least as extreme" as what was actually seen, and report its total probability.
To quote Eliezer Yudkowsky rephrasing Steven Goodman's example:
> So lo and behold, I flip the coin six times, and I get the result TTTTTH. Is this result statistically significant, and if so, what is the p-value - that is, the probability of obtaining a result at least this extreme? Well, that depends. Was I planning to flip the coin six times, and count the number of tails? Or was I planning to flip the coin until it came up heads, and count the number of trials? In the first case, the probability of getting "five tails or more" from a fair coin is 11%, while in the second case, the probability of a fair coin requiring "at least five tails before seeing one heads" is 3%.
> arbitrarily pick a class of data that's "at least as extreme" as what was actually seen
You are absolutely correct, in my haste to explain one misinterpretation of p-values, I stumbled into another gross oversimplification. The correct way of saying it must always involve some language about the deviation or extremity of the data being at least as great as observed.
It goes to show that proper reporting of p-value statistics takes a lawyerly craft with language; while writing the post I had to consider what the mathematical interpretation of "is a fluke" should be. I think the definition I chose is what most people will interpret it as. However, the word fluke just means "unlikely chance occurrence" on its own, so saying "there is a 1% chance the result is an unlikely chance occurrence" is uninformative if taken literally. It has to imply one of two things:
1. You are referring to the null hypothesis as the unlikely occurrence, i.e. "There is a 1% chance the null hypothesis is true [therefore making this data surprising]."
2. You are temporarily assuming it is true to make such a statement, i.e. "There is a 1% chance of seeing data at least this surprising [if the null hypothesis were true]."
Someone that does know what a p-value is might be generous and think you meant the latter, which is a correct statement. However, I think that most people hear it the first way.
The fact that p-value reporting (when properly done) involves thinking through double negatives and conditions that are easily ignored probably indicates that it's time for other measures of significance to become better accepted.
This is a very incorrect way of summarizing the result.
When will popsci writers ever get this right? Hypothesis testing cannot tell you the probability that a hypothesis is true. It also cannot tell you if the result is a "fluke." What it can tell you is the probability of seeing the data assuming the null hypothesis is true. Without knowledge of the prior probability of the hypothesis, you have no right to say anything about the probability of the result being a "fluke" or not.
This is commonly misunderstood. Perhaps I can illuminate it with an example.
Suppose I take a coin out of your pocket--a good old American quarter. I then proceed to flip it six times and egads, I get six heads. The chance that a fair coin flips the same way six times in a row is 2/(2^6), or 1/32--close to 3%. Whoo hoo, that's more than a two sigma result.
Does that mean there's only a 3% chance the coin is actually fair? NO! If you assume the coin is fair, there is a 3% chance the coin flips the same way six times in a row. There's a huge distinction in flipping the conditional. Going by the infinitesimal number of quarters in circulation that wouldn't flip fairly, we'd say before any experiment that there's an overwhelming chance the coin is truly fair. So there isn't a 3% chance the result "is a fluke," if by "is a fluke" we mean the coin really is fair AND the coin flipped in a surprising way. We KNOW the coin flipped in a surprising way, and if we know that at least 99% of quarters are fair, there's probably a near-100% chance that the data can be called a "fluke."
This is all to say, if the prior probability of your null hypothesis is high, one unlikely result should not impress. You are merely demonstrating that unlikely things happen--and with so many coins and people flipping them they should be, from time to time.
http://en.wikipedia.org/wiki/P-value#Misunderstandings
http://invasiber.org/EGarcia/papers/MatthewsFinancialTimes.h...
tldr; what the article says --> P(Fluke|Data) != P(Data|Fluke) <-- what the statistic means