Abstract

The title (and topic) of this editorial was popularized by and often attributed to Mark Twain, who attributed it to the British Prime Minister Benjamin Disraeli (Twain, 1906). There is no direct evidence that Disraeli ever made this statement. There are other luminaries who could have originated the phrase, but the point is well recognized that one of the most effective means to deceive others is by the manipulation of numbers and finding differences where none exist (or vice-versa). I am not implying that it is the intent of the scientists who publish in these pages to mislead readers by their use of statistics, but I submit that the misuse of statistics, whether intentional or otherwise, creates confusion and error. Many readers are no doubt aware of the work of John Ioannidis at Stanford and his proposition that most published research findings are false (Ioannidis, 2005). The misuse of statistics, or at least the erroneous interpretation of statistical results, is one of several factors behind this conclusion. In March of last year, the American Statistical Association (ASA) issued a statement on statistical significance and p values (American Statistical Association, 2016). In this short release, the ASA listed six principles addressing both misconceptions and misuses of p values. The points made are neither novel nor new, and I would suggest that every contributor to this journal has heard about them as part of their graduate training. Nevertheless, they are worth revisiting in the current high-pressure world of scientific publishing. We all acknowledge that data with statistically significant outcomes are more likely to be accepted for publication than those without (although this journal invites and publishes well-designed studies with “negative” results). Hence, there is a temptation to emphasize statistically significant findings as ends in themselves. I will summarize and condense under three broad categories the points made by the ASA and others from the perspective of what this editor has seen in manuscripts submitted to this journal.
The shotgun approach to data analysis
The ASA refers to this variously as “p-hacking” or “data dredging,” and I have seen it in numerous manuscripts submitted for review. Essentially, the authors generate a large data set and then apply statistical analyses to the entire set. The Results section then becomes a catalog of “statistically significant” differences within and between groups. What is lacking in this approach is any assessment of the biological plausibility of these differences. In other words, in the biological world, are these differences meaningful? Are they clinically relevant? Are they real? The ASA put it this way: “The p-value was never intended to be a substitute for scientific reasoning” (American Statistical Association, 2016). Remember that p values are estimates of the probability that the null hypothesis (no difference) is true. Since the days of RA Fisher, considered the father of modern biological statistics (biometrics) (Fisher, 1948), the default cut-off for the probability that the null hypothesis is not true is 1 in 20 (0.05). I have yet to see a report that sets a different threshold for statistical significance. But also remember that a p < 0.05 is not “truth.” All it says is that if all the initial assumptions regarding the data are correct (e.g. normal distribution, independent observations, control for bias), that the probability of accepting the null hypothesis of “no difference” when it is false is less than 5%. It tells you nothing about whether the “significant” difference means anything. (In scientific writing, I reserve the word “significant” for statistics.) The p value is only one of several tools useful in evaluating data. It takes scientific reasoning to go to the next level and to evaluate the results for biological importance. This is why a strong theoretical basis for doing a particular study is so essential. It is this scientific reasoning that I see missing in many reports.
The non-difference difference
Admit it. Almost all scientists have stated (or have been tempted to state) something like “the mean of Group A was greater than that of Group B, but the difference was not statistically significant.” With very few exceptions (which I will mention below), this statement is nonsense. The investigators have decided to use (hopefully appropriate) statistics to evaluate their data for differences. Failing to show a difference but then claiming that there is one anyway (i.e. A > B) represents a failure to understand both the purpose and the meaning of the outcome. What the statistics are indicating when the p-value is greater than 0.05 is that there is “no difference” between group A and group B. For most of such cases, this should be the biological interpretation as well. But as mentioned, there can be exceptions, but these must be clearly stated and discussed. One example could be a small clinical trial (small sample size) of patients (heterogeneous population) resulting in large variances in the data. Statistical analyses may fail to show significant differences between treatment groups, but the differences, while not statistically significant may be clinically important. These cases are uncommon and require good medical and scientific expertise and rationale to defend. The overall conclusion of a study can hinge on the proper interpretation of no difference between treatment groups versus the wishful thinking that there is a difference despite evidence to the contrary.
The statistical model
This is an entire course in statistics and can only be highlighted here. Choosing the correct statistical model for the study design, the type of data collected, and the size of the treatment groups is imperative. Always design, execute, and evaluate a study with help from a statistician. Every statistical model has assumptions built into it, and non-normal distributions, repeated measures on the same study population, and unequal group sizes, for example, may complicate or even invalidate the statistical analyses and interpretation of the results if these violate the model assumptions. This journal will reach out to statisticians as reviewers if there are questions about study designs and the statistical models that support them or if the Statistics section of the Material and Methods is vague or incomplete. We all want to conduct, report, and publish good science—science that advances knowledge. Statistics is a tool in this endeavor. Not an end in itself.
