By Aaron Krol
November 11, 2013 | As the problem of irreproducible results in the scientific literature gains attention, a paper published yesterday in The Proceedings of the National Academy of Sciences suggests that many irreproducible findings may be the result of a too-lenient standard for establishing statistical significance, rather than more malicious factors like poorly-designed tests, bias, or outright fraud. The paper’s methods indicate that researchers should expect as many as 20% of findings published with p-values of .05 to be irreproducible, and that more stringent standards are needed to confidently reject null hypotheses.
The “p=.05” standard has become entrenched in scientific literature of all types, but Valen Johnson, the author of the PNAS paper and a professor in the Department of Statistics at Texas A&M University, was concerned that the statistical tests involved neglect an important piece of information. A p-value indicates the probability, assuming the null hypothesis is true, of seeing results at least as extreme as those observed; but because these tests are not Bayesian1, the p-value does not directly comment on the probability that either the null or an alternative hypothesis is true. Up to now it has been difficult to compare classical tests with Bayesian tests, because the classical tests are often uniformly most powerful tests2, and there has been no Bayesian equivalent. Johnson designed a uniformly most powerful Bayesian test (UMPBT) to bring the two methods into accordance. “This research makes it possible to convert a p-value into a probability that a null hypothesis is actually true,” said Johnson in an email to Bio-IT World.
Johnson’s method shows that, under certain a priori assumptions about the null hypothesis, a p-value of .05 is not nearly equivalent to 95% confidence in an alternative hypothesis. Supposing that half of null hypotheses are likely to be true before an experiment is performed – probably a conservative estimate – inputs that give a p-value of .05 in a classical test would only assign a roughly 80% probability to the alternative hypothesis in Johnson's UMPBT. Attempts to reproduce such experiments would be expected to fail about 20% of the time, a figure that is not out of line with some surveys of the scientific literature. Higher a priori probabilities for the null hypothesis, of course, produce even greater divergence between p-values and Bayesian probabilities.
“This means that we need to raise the bar for statistical significance if we want to improve the reproducibility of scientific research,” said Johnson, who does not advocate scrapping p-value tests altogether, but rather that p-values of .005 to .001 would be more in line with a decisive rejection of the null hypothesis.
1 A “Bayesian” test, as opposed to a classical or frequentist test, assigns prior probabilities to both the null hypothesis and the alternative hypothesis, and then reevaluates those probabilities in light of the experiment’s outcome. In other words, a Bayesian test assumes that much more powerful evidence is needed to validate a result if, before the experiment was performed, such a result would have been seen as highly unlikely. One advantage of a Bayesian test is that it delivers probabilities for both the null and alternative hypotheses, rather than commenting on the likelihood that the results seen were generated by chance alone.
2 “Uniformly most powerful” indicates that a test is designed to yield as few false negatives as possible. In a uniformly most powerful test, the goal is to pre-specify a tolerated chance of a false positive – generally 5%, equivalent to p=.05 – and minimize the chance of calling a false negative under that constraint. Given the inevitable trade-offs between sensitivity and specificity, these tests may in fact have much higher rates of false negatives than false positives, but they will never call more false negatives than necessary to deliver the desired rate of false positives.