Wednesday, December 8, 2010

Getting beyond the p-value

Update 12/8/10, 9:30 AM: I just got an e-mail from Steve Goodman, MD, MHS, PhD, from Johns Hopkins about this post. Firstly, my apologies for getting his role at the Annals wrong -- he is the Senior Statistical Editor for the journal, not merely a statistical reviewer. I am happy to report that he added more fuel to the p-value fire, and you are likely to see more posts on this (you are overjoyed, right?). So, thanks to Dr. Goodman for his input this morning!

Yesterday I blogged about our preference to avoid false positive associations at the expense of failing to detect some real associations. The p value conundrum, where the arbitrary statistical significance is set at <0.05 has bothered me for a long time. I finally got curious enough to search out the origins of the p value. Believe it or not, the information was not easy to find. I have at lest 10 biostatistics or epidemiology textbooks on the shelves of my office -- not one of them went into the history of the p value threshold. But Professor Google came to my rescue, and here is what I discovered.

Using a carefully crafted search phrase, I found a discussion forum on WAME, World Association of Medical Editors, which I felt represented a credible source. Here I discovered a treasure trove of information and references to what I was looking for. Specifically, one poster referred to Steven Goodman's work, which I promptly looked up. And by the way, Steven Goodman, as it turns out is a statistical reviewer the Senior Statistical Editor for the Annals of Internal Medicine and a member of WAME. So, I went to this gem in the journal Epidemiology from May 2001, called unpretentiously "Of P-values and Bayes: A Modest Proposal". I have to say that some of the discussion was so in the weeds that even I have to go back and reread it several times to understand what the good Dr. Goodman is talking about. But here are some of the more salient and accessible points.

The author begins by stating his mixed feelings about the p-value:
I am delighted to be invited to comment on the use of P-values, but at the same time, it depresses me. Why? So much brainpower, ink, and passion have been expended on this subject for so long, yet plus ca change, plus c'ést le meme chose- the more things change, the more they stay the same. The references on this topic encompass innumerable disciplines, going back almost to the moment that P-values were introduced (by R.A. Fisher in the 1920s). The introduction of hypothesis testing in 1933 precipitated more intense engagement, caused by the subsuming of Fisher's significance test into the hypothesis test machinery.1-9 The discussion has continued ever since. I have been foolish enough to think I could whistle into this hurricane and be heard. 10-12 But we (and I) still use P-values. And when a journal like Epidemiology takes a principled stand against them, 13 epidemiologists who may recognize the limitations of P-values still feel as if they are being forced to walk on one leg. 14
So, here we learn that the p-value is something that has been around for 90 years and was brought into being by the father of frequentist statistics R.A. Fisher. And the users are ambivalent about it, to say the least. So, why, Goodman asks, continue to debate the value of the p-value (or its lack)? And here is the reason: publications.
Let me begin with an observation. When epidemiologists informally communicate their results (in talks, meeting presentations, or policy discussions), the balance between biology, methodology, data, and context is often appropriate. There is an emphasis on presenting a coherent epidemiologic or pathophysiologic story, with comparatively little talk of statistical rejection or other related tomfoolery. But this same sensibility is often not reflected in published papers. Here, the structure of presentation is more rigid, and statistical summaries seem to have more power. Within these confines, the narrative flow becomes secondary to the distillation of complex data, and inferences seem to flow from the data almost automatically. It is this automaticity of inference that is most distressing, and for which the elimination of P-values has been attempted as a curative.
This is clearly a condemnation of the way we publish: it demands a reductionist approach to the lowest common denominator, in this case the p-value. Much like our modern medical paradigm, the p-value does not get at the real issues:
I and others have discussed the connections between statistics and scientific philosophy elsewhere, 11,12,15-22 so I will cut to the chase here. The root cause of our problem is a philosophy of scientific inference that is supported by the statistical methodology in dominant use. This philosophy might best be described as a form of naïve inductivism,23 a belief that all scientists seeing the same data should come to the same conclusions. By implication, anyone who draws a different conclusion must be doing so for nonscientific reasons. It takes as given the statistical models we impose on data, and treats the estimated parameters of such models as direct mirrors of reality rather than as highly filtered and potentially distorted views. It is a belief that scientific reasoning requires little more than statistical model fitting, or in our case, reporting odds ratios, P-values and the like, to arrive at the truth. [emphasis mine]
Here is a sacred scientific cow that is getting tipped! You mean science is not absolute? Well, no, it is not, as the readers of this blog are amply aware. Science at best represents a model of our current understanding of the Universe, it builds upon itself usually in one direction, and it rarely gives an asymptotic approximation of what is really going on. Merely our current understanding of reality, given the tools we have at our disposal. Goodman continues to drive home the naïveté of our inductivist thinking in the following paragraph:
How is this philosophy manifest in research reports? One merely has to look at their organization. Traditionally, the findings of a paper are stated at the beginning of the discussion section. It is as if the finding is something derived directly from the results section. Reasoning and external facts come afterward, if at all. That is, in essence, naïve inductivism. This view of the scientific enterprise is aided and abetted by the P-value in a variety of ways, some obvious, some subtle. The obvious way is in its role in the reject/accept hypothesis test machinery. The more subtle way is in the fact that the P-value is a probability - something absolute, with nothing external needed for its interpretation.
In fact the point is that the p-value is exactly NOT absolute. The p-value needs to be judged relative to some other standard of probability, for example the prior probability of an event. And yet what do we do? We worship at the altar of the p-value without giving any thought to its meaning. And this is certainly convenient for those who want to invoke evidence of absence of certain associations, such as toxic exposures and health effects, for example, when the reality simply indicates absence of evidence.

The point is that we need to get beyond the p-value and develop a more sophisticated, nuanced and critical attitude toward data. Furthermore, regulatory bodies need to get a more nuanced way of communicating scientific data, particularly data evidencing harm, in order not to lose credibility with the people. Most importantly, however, we need to do a better job training researchers on the subtleties of statistical analyses, so that the p-value does not become the ultimate arbiter of the truth.

No comments:

Post a Comment