Abstract

We applaud the forward-looking nature of the Commentary provided by Kragel et al. (2021) on our article, “What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis” (Elliott et al., 2020). We fully agree with their emphasis on the importance of avoiding overgeneralization when considering measurement reliability in task-functional MRI (task-fMRI). Because no single reliability estimate can capture the multitude of possible task-fMRI measures, statements such as “every brain activity study you’ve ever read is wrong” (Cohen, 2020) are misleading and unnecessarily undermine our joint efforts to improve task-fMRI. In fact, that is why we addressed this very point in our article (see p. 801). Nevertheless, we take this opportunity to clarify three subtle but meaningful ways that our perspective diverges from that promoted by Kragel et al. in their Commentary.
First, as we embrace the future, we must also account for and build on the past while being realistic about the state of the present. Kragel et al. point out the exciting potential of “multivariate measures optimized using machine learning” that they claim are “commonly used for biomarker discovery” (p. 622). While we agree that multivariate measures are becoming more widespread and should continue to be developed and explored (see p. 802 of our original article), such measures are still far from being universal in task-fMRI biomarker research. Because psychological science is a cumulative enterprise, criticism and honest assessment of the current state of the science are essential to the continued advancement of the field. In this vein, we surveyed the reliability of region-of-interest-based task-fMRI activation, which is one of the most commonly adopted measures reported in the literature over the past 2 decades. Our meta-analysis directly provided evidence for this continued use, as approximately half of the reliability estimates we found had been published in the previous 5 years. These measures are not relics of the past; they are in common use today and still frequently incorporated as primary measures in large-scale, state-of-the-art imaging efforts focused on biomarker development and individual-differences research. For example, the Human Connectome Project, UK Biobank, and the Adolescent Brain Cognitive Development study all have incorporated fMRI tasks designed to activate particular brain areas and circuits (Casey et al., 2018; Miller et al., 2016; Van Essen et al., 2013). These are large-scale, expensive projects creating MRI data sets for future neuroscience research. Thus, the poor reliability reported in our article is critical for not only past but also present biomarker research using traditional task-fMRI activation. We hope that by reevaluating standard practices in light of the reliability limitations detailed in our article, we can guard against repeating and perpetuating these limitations in such future research.
Second, we would like to highlight an important distinction between the main aim of our article and several of the examples offered by Kragel et al. in their Commentary. The central concern addressed in our article was whether commonly used measures of task-fMRI activation are reliable enough for individual-differences research and brain biomarkers. To answer this question, we assessed the test-retest reliability of task activation in the tradition of Cronbach’s so-called correlational discipline of scientific psychology (Cronbach, 1957). However, Kragel et al. include examples of both between-subjects reliability per the correlational discipline and within-subjects replicability per the experimental discipline. For example, Figure S1c in Kragel et al.’s Supplemental Material demonstrates the replicability of machine-learning weights, across independent samples, that were used to classify faces and shapes across experimental conditions. While the ability of task-fMRI to decode experimental conditions may be of scientific interest, it is fundamentally a within-subjects experimental effect, falling within Cronbach’s “experimental” discipline of scientific psychology. As we pointed out in our original article, “Within-subjects robustness is . . . often inappropriately invoked to suggest between-subjects reliability, despite the fact that reliable within-subjects experimental effects at a group level can arise from unreliable between-subjects measurements” (Elliott et al., 2020, p. 802). For example, contrasting faces with shapes consistently elicits amygdala activation within a group of individuals, despite poor test-retest reliability of the same amygdala activation between individuals (see http://haririlab.com/vid/ReliabilityTutorial.mp4). It is critical to preserve this often-confused distinction in order to ensure that reliability metrics are appropriately applied and interpreted within the research framework (Fröhner et al., 2019; Hedge et al., 2018).
Third, and related to the second point, we agree with Kragel et al. that different types of biomarkers require different demonstrations of reliability highlighted in their Figure 1a. However, we disagree with their description of COVID-19 tests (and diagnostic biomarkers more generally) as an example of a biomarker that does not need high test-retest reliability. In fact, a COVID-19 test desperately requires high test-retest reliability; however, it must be investigated over an appropriate timescale. Critically, COVID-19 tests must validly track changes in the underlying construct of interest (i.e., SARS-CoV-2 viral load). Therefore, when administered to an infected individual, a COVID-19 test should be capable of repeatedly returning positive results over minutes, hours, and even days until the moment that the disease status changes. More generally, although diagnostic biomarkers are naturally expected to change over time, they must be reliable within states (e.g., infected) so that deviations can be unambiguously attributed to a change in state (e.g., convalescence). Similarly, the time interval of reliability studies in task-fMRI should be calibrated to the putative timescale of stability within the underlying constructs of interest (e.g., hippocampal activity related to cognitive decline a year later). In hindsight, we should have more clearly stated in our original article that when we use the term “biomarker” we are referring to a specific class of between-subjects traitlike biomarkers useful for long-term prognostication.
In conclusion, we share the optimism of Kragel et al. about the potential for task-fMRI. In our own research, we have and will continue to enthusiastically work toward advancing reliable functional brain biomarkers. As other areas of fMRI (e.g., Elliott et al., 2019; Noble et al., 2017; Zuo et al., 2019) and biomedical research (e.g., Sugden et al., 2020) grapple with measurement challenges, we hope that further discussion of reliability in task-fMRI will similarly bear fruit in the form of reliable measures that help build a stronger, more cumulative science. Indeed, it is high time that neuroscience and psychometric theory come together in research, teaching, and training.
