Digging into p Values

Abstract

Back in the January to February 2016 issue of this journal, I discussed p value and the increased need to report effect sizes along with p value. Since some time has passed and p value remains an important aspect of statistical reporting, I thought it wise to revisit the topic.

To illustrate some points, we will refer to the article from this issue entitled “COVID-19: Social Distancing and Physical Activity in United Kingdom Residents with Visual Impairment,” by Strongman, Swain, Chung, Merzbach, and Gordon. The authors of this article made a number of t test comparisons where they are comparing the mean of one group to the mean of another group. If you cast your mind back, you will remember that, in the social sciences, we generally have a cutoff for “statistical significance” of .05 for such comparisons. This measure of significance means that, if the p value or significance level, is < .05, the difference in means between the two groups is deemed “statistically significant.” Statistical significance means that there is less than a 5% chance that the observed difference is due to chance. It is the accepted level of chance that experimenters are willing to accept in the social sciences, where data tend to be a little more noisy or hard to measure accurately than in something like physics.

Let us unpack this matter a little more. A number of the comparisons in the article I am using as an example today has p values close to .05, either slightly more or less. How meaningful is it to claim that a comparison with a p value of .051 is not statistically meaningful while one with a p value of .049 is? This question is the reason why I made the case in 2016 that we should also include a measure of effect size when reporting the results of statistical tests so that the magnitude of the difference can also be known.

In 1994, Jacob Cohen, a big name in statistics circles, wrote a piece entitled, “The Earth is Round (p < .05),” in which he summarized a long history of people noting that null hypothesis significance testing (which is what you are doing when you rely on the p level) is a dangerous game. Let us take this suggestion step by step. In null hypothesis significance testing (or NHST, for short), we start with the null hypothesis that the groups we are comparing are not different, or are drawn from the same larger population. If the p value from our statistical comparison is less than our cutoff (which is often .05), we “fail to accept the null hypothesis,” which leads one to want to say that the two groups are different. As Jacob Cohen notes,

What we want to know is “Given these data, what is the probability that the H₀ is true?” But as most of us know, what it tells us is “Given that H₀ is true, what is the probability of these (or more extreme) data?” (p. 997)

In other words, we are never able to prove the null hypothesis, that there is no difference between our groups. It is always the case that more data, better data, or different data will give a result that is opposite to our finding. But what we can say is that if the groups we are comparing are truly drawn from the same larger population, what is the likelihood that we would end up with the result that we have before us. This reason is why we can put it in terms of accepting a certain likelihood (often 5%) that our data or our statistical result is due to chance. We must always be on guard, however, that when we fail to disprove the null hypothesis, that we do not try to argue that this means the null hypothesis is true.

Readers need to keep their logical hats on when reading statistics and not be led solely by the significance levels. Increasing a study's sample size will decrease the confidence intervals around a study's measures, which will also decrease significance levels. As sample size increases, the spread of scores around a mean will decrease and, mathematically, it will lead to lower p levels. But how meaningful is this result? Everything is connected to everything else in some way and, if you collect a large enough sample, you can find a significant connection between any two variables. Thus, in addition to the statistical results, a reader must pay attention to how large a sample is (not so small that no inferences can be drawn but not so large that everything is going to be significant) and the meaningfulness of what is being studied or compared.

In a 1990 article, Cohen gave an example of a highly significant correlation between height and intelligence in a sample of 14,000 school children that indicated a child's IQ could be raised from 100 to 130 by increasing their height by 14 feet. Statistically correct, but logically meaningless. Readers need to be looking for connections in data and analysis that, while not as bizarre as the height and IQ example, might still be clinically not as meaningful as the analysis suggests.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Cohen

. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312. https://doi.org/10.1037/0003-066X.45.12.1304

Cohen

. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997