Abstract
Medical imaging serves many roles in patient care and the drug approval process, including assessing treatment response and guiding treatment decisions. These roles often involve a quantitative imaging biomarker, an objectively measured characteristic of the underlying anatomic structure or biochemical process derived from medical images. Before a quantitative imaging biomarker is accepted for use in such roles, the imaging procedure to acquire it must undergo evaluation of its technical performance, which entails assessment of performance metrics such as repeatability and reproducibility of the quantitative imaging biomarker. Ideally, this evaluation will involve quantitative summaries of results from multiple studies to overcome limitations due to the typically small sample sizes of technical performance studies and/or to include a broader range of clinical settings and patient populations. This paper is a review of meta-analysis procedures for such an evaluation, including identification of suitable studies, statistical methodology to evaluate and summarize the performance metrics, and complete and transparent reporting of the results. This review addresses challenges typical of meta-analyses of technical performance, particularly small study sizes, which often causes violations of assumptions underlying standard meta-analysis techniques. Alternative approaches to address these difficulties are also presented; simulation studies indicate that they outperform standard techniques when some studies are small. The meta-analysis procedures presented are also applied to actual [18F]-fluorodeoxyglucose positron emission tomography (FDG-PET) test–retest repeatability data for illustrative purposes.
Keywords
1 Introduction
Medical imaging is useful for physical measurement of anatomic structures and diseased tissues as well as molecular and functional characterization of these entities and associated processes. In recent years, imaging has increasingly served in various roles in patient care and the drug approval process, such as for staging,1,2 for patient-level treatment decision-making, 3 and as clinical trial endpoints. 4
These roles will often involve a quantitative imaging biomarker (QIB), a quantifiable feature extracted from a medical image that is relevant to the underlying anatomical or biochemical aspects of interest. 5 The ultimate test of the readiness of a QIB for use in the clinic is not only its biological or clinical validity, namely its association with a biological or clinical endpoint of interest, but also its clinical utility, in other words, that the QIB informs patient care in a way that benefits patients. 6 But first, the imaging procedure to acquire the QIB must be shown to have acceptable technical performance; specifically, the QIB it produces must be shown to be accurate and reliable measurements of the underlying quantity of interest.
Evaluation of an imaging procedure’s technical performance involves assessment of a variety of properties, including bias and precision and the related terms repeatability and reproducibility. For detailed discussion of metrology terms and statistical methods for evaluating and comparing performance metrics between imaging systems, readers are referred to several related reviews in this journal issue.5,7,8 A number of studies have been published describing the technical performance of imaging procedures in various patient populations, including the test–retest repeatability of [18F]-fluorodeoxyglucose (FDG) uptake in various primary cancer types such as nonsmall cell lung cancer and gastrointestinal malignancies,9–13 and agreement between [18F]-fluorothymidine (FLT) uptake and Ki-67 immunohistochemistry in lung cancer patients, brain cancer patients, and patients with various other primary cancers. 14 Given that studies assessing technical performance often contain as few as 10–20 patients9,13,15 and the importance of understanding technical performance across a variety of imaging technical configurations and clinical settings, conclusions about technical performance of an imaging procedure should ideally be based on multiple studies.
This paper describes meta-analysis methods to combine information across studies to provide summary estimates of technical performance metrics for an imaging procedure. The importance of complete and transparent reporting of meta-analysis results is also stressed. To date, such reviews of the technical performance of imaging procedures have been largely qualitative in nature. 16 Narrative or prose reviews of the literature are nonquantitative assessments that are often difficult to interpret and may be subject to bias due to subjective judgments about which studies to include in the review and how to synthesize the available information into a succinct summary, which, in the case of an imaging procedure’s technical performance, is a single estimate of a performance metric. 17 Systematic reviews such as those described by Cochrane 18 improve upon the quality of prose reviews because they focus on a particular research question, use a criterion-based comprehensive search and selection strategy, and include rigorous critical review of the literature. A meta-analysis takes a systematic review of the extra step to produce a quantitative summary value of some effect or metric of interest. It is the strongest methodology for evaluating the results of multiple studies. 17
Traditionally, a meta-analysis is used to synthesize evidence from a number of studies about the effect of a risk factor, predictor variable, or intervention on an outcome or response variable, where the effect may be expressed in terms of a quantity such as an odds ratio, a standardized mean difference, or a hazard ratio. Discussed here are adaptations of meta-analysis methods that are appropriate for use in producing summary estimates of technical performance metrics. Challenges in this setting include limited availability of primary studies and their typically small sample size, which often invalidates approximate normality of many performance metrics, an assumption underlying standard methods, and between-study heterogeneity relating to the technical aspects of the imaging procedures or the clinical settings. For purposes of illustration, meta-analysis concepts and methods are discussed in the context of an example of a meta-analysis of FDG positron emission tomography (FDG-PET) test--retest data presented in de Langen et al. 10
The rest of the paper is organized as follows. Section 2 gives an overview of the systematic review process. Section 3 describes statistical methodology for meta-analyses to produce summary estimates of an imaging procedure’s technical performance given technical performance metric estimates at the study level, including modified methods to accommodate nonnormality of these metric estimates. Section 4 describes statistical methodology for meta-regression, namely meta-analysis for when study descriptors that may explain between-study variability in technical performance are available. In both Sections 3 and 4, techniques are presented primarily in the context of repeatability for purposes of simplicity. In Section 5, results of meta-analyses of simulated data and of FDG-PET test--retest data from de Langen et al.10 using the techniques described in Sections 3 and 4 are presented. Section 6 describes meta-analysis techniques for when patient-level data, as opposed to just study-level data as is the case in Sections 3 and 4, are available. Like in Sections 3 and 4, the concepts in Section 6 are also presented primarily in the context of repeatability. Section 7 describes the extension of the concepts presented in Section 2 to 4, and Section 6 to other aspects of technical performance, including reproducibility and agreement. Section 8 presents some guidelines for reporting results of the meta-analysis. Finally, Section 9 summarizes the contributions of this paper and identifies areas of statistical methodology for meta-analysis that would benefit from future research to enhance their applicability to imaging technical performance studies and other types of scientific investigations.
2 Systematic reviews of technical performance
Meta-analysis of an imaging procedure’s technical performance requires a rigorous approach to ensure interpretability and usefulness of the results. This requires careful formulation of the research question to be addressed, prospective specification of study search criteria and inclusion/exclusion criteria, and use of appropriate statistical methods to address between-study heterogeneity and compute summary estimates of performance metrics. Figure 1 displays a flowchart of the overall process. The following sections elaborate on considerations at each step.
Flowchart of the general meta-analysis process. *The term “studies” includes publications and unpublished data sets.
2.1 Formulation of the research question
Careful specification of the question to be addressed provides the necessary foundation for all subsequent steps of the meta-analysis process and maximizes interpretability of the results. The question should designate a clinical context, class of imaging procedures, and specific performance metrics. Clinical contexts may include screening in asymptomatic individuals or monitoring treatment response or progression in individuals with malignant tumors during or after treatment. As an example, if FDG-PET is to be used to assess treatment response,19,20 then determining a threshold above which changes in FDG uptake indicate true signal change rather than noise would be necessary. 9 Characteristics of the specific disease status, disease severity, and disease site may influence performance of the imaging procedure. For example, volumes of benign lung nodules might be assessed more reproducibly than malignant nodule volumes. Imaging procedures to be studied need to be specified by imaging modality and usually additional details including device manufacturer, class or generation of device, and image acquisition settings.
A specific metric should be selected on the basis of how well it captures the performance characteristic of interest and with some consideration of how likely it is that it will be available directly from retrievable studies or can be calculated from information available from those studies. The clinical context should also be considered in the selection of the metric. For instance, the repeatability coefficient (RC) not only is appropriate for meta-analyses of test–retest repeatability, but also may be particularly suitable if the clinical context is to determine a threshold below which changes can be attributed to noise. RC is a threshold below which absolute differences between two measurements of a particular QIB obtained under identical imaging protocols will fall with 95% probability.21,22 Thus, changes in FDG uptake greater than the RC may indicate treatment response.
In specifying the research question, one must be realistic about what types of studies are feasible to conduct. Carrying out a performance assessment in the exact intended clinical use setting or assessing performance under true repeatability or reproducibility conditions will not always be possible. It may be ethically inappropriate to conduct repeatability studies of an imaging procedure that delivers an additional or repeated radiation dose to the subject or relies on use of an injected imaging agent for repeat imaging within a short time span. Besides radiation dose, use of imaging agents also requires consideration of washout periods before repeat scans can be conducted. Biological conditions might have changed during this time frame and affect measures of repeatability. True assessment of reproducibility and influence of site operators, different equipment models for one manufacturer or scanners from different manufacturers require subjects travelling to different sites to undergo repeat scans, which is rarely feasible. Assessment of bias may not be possible in studies involving human subjects due to difficulties in establishing ground truth. Extrapolations from studies using carefully crafted phantoms or from animal studies may be the only options. The degree of heterogeneity observed among these imperfect attempts to replicate a clinical setting may in itself be informative regarding the degree of confidence one can place in these extrapolations. Best efforts should be made to focus meta-analyses on studies conducted in settings that are believed to yield performance metrics that would most closely approximate values of those metrics in the intended clinical use setting.
There are trade-offs between a narrowly focused question versus a broader question to be addressed in a meta-analysis. For the former, few studies might be available to include in the meta-analysis, whereas for the latter, more studies may be available but extreme heterogeneity could make the meta-analysis results difficult to interpret. If the meta-analysis is being conducted to support an investigational device exemption, or imaging device clearance or approval, early consultation with the Food and Drug Administration (FDA) regarding acceptable metrics and clinical settings for performance assessments is strongly advised.
2.2 Study selection process
After carefully specifying a research question and clinical context, one must clearly define search criteria for identification of studies to potentially include in the meta-analysis. For example, for a meta-analysis of test–retest repeatability of FDG uptake, the search criteria may include test–retest studies where patients underwent repeat scans with FDG-PET with or without CT, without any interventions between scans.
Once study selection criteria are specified, an intensive search should be conducted to identify studies meeting those criteria. The actual mechanics of the search can be carried out by a variety of means. Most published papers will be identifiable through searches of established online scientific literature databases. For example, with the search criteria de Langen et al. defined for their meta-analysis, they performed systematic literature searches on Medline and Embase using search terms “PET,” “FDG,” “repeatability,” and “test–retest,” which yielded eight studies. 10 The search should not be limited to the published literature, as the phenomenon of publication bias, namely the tendency to preferentially publish studies that show statistically significant or extreme and usually favorable results, is well known in biomedical research.
Some unpublished information may be retrievable through a variety of means. Information sources might include meeting abstracts and proceedings, study registries such as ClinicalTrials.gov, 23 unpublished technical reports which might appear on websites maintained by academic departments, publicly disclosed regulatory summaries such as FDA 510(K) summaries of clearance or summary of safety and effectiveness data in approval of devices and summary review of approval of drugs, device package inserts, device labels, or materials produced by professional societies. Internet search engines can be useful tools to acquire some of this information directly or to find references to its existence.
Personal contact with professional societies or study investigators may help in identifying additional information. If an imaging device plays an integral role for outcome determination or treatment assignment in a large multicenter clinical trial, clinical site qualification and quality monitoring procedures may have been in place to ensure sites are performing imaging studies according to high standards. Data sets collected for particular studies might contain replicates that could be used to calculate repeatability or reproducibility metrics. Data from such evaluations are typically presented in internal study reports and are not publicly available, but they might be available from study investigators upon request. 24 Any retrieved data sets will be loosely referred to here as “studies,” even though the data might not have been collected as part of a formal study that aimed to evaluate the technical performance of an imaging procedure as is the case with these data collected for ancillary qualification and quality monitoring purposes. Increasingly, high volume data such as genomic data generated by published and publicly funded studies are being deposited into publicly accessible databases. Examples of imaging data repositories include the Reference Image Database to Evaluate Response, the imaging database of the Rembrandt Project, 25 The Cancer Imaging Archive, 26 and the image and clinical data repository of the National Institute of Biomedical Imaging and Bioengineering (NIBIB). 27 The more thoroughly the search is conducted, the greater the chance one can identify high quality studies of the performance metrics of interest with relatively small potential for important bias.
Unpublished studies present particular challenges with regard to whether to include them in a meta-analysis. While there is a strong desire to gather all available information relevant to the meta-analysis question, there is a greater risk that the quality of unpublished studies could be poor because they have not been vetted by peer review. Data from these unpublished studies may not be permanently accessible, but also access might be highly selective since not all evaluations provide information relevant to the meta-analysis. These factors may result in a potential bias toward certain findings in studies for which access is granted. These points emphasize the need for complete and transparent reporting of health research studies to maximize the value and interpretability of research results. 28
The search criteria allow one to retrieve a collection of studies that can be further vetted using more specific inclusion and exclusion criteria to determine if they are appropriate for the meta-analysis. Some inclusion and exclusion criteria might not be verifiable until the study publications or data sets are first retrieved using broader search criteria, at which point they can then be examined in more detail. Additional criteria might include study sample size, language in which material is presented, setting or sponsor of the research study (e.g. academic center, industry-sponsored, government-sponsored, community-based), quality of the study design, statistical analysis, and study conduct, and period during which the study was conducted. Such criteria may be imposed, for example, to control for biases due to differences in expertise in conducting imaging studies, differences in practice patterns, potential biases due to commercial or proprietary interests, and the potential for publication bias (e.g. small studies with favorable outcome are more likely to be made public than small studies with unfavorable outcomes). Any given set of study selection rules may potentially introduce some degree of bias in the meta-analysis summary results, but clear prespecification of the search criteria at least offers transparency. As an example, in their meta-analysis of the test–retest repeatability of FDG uptake, de Langen et al. used four inclusion/exclusion criteria: (a) repeatability of 18 F-FDG uptake in malignant tumors; (b) standardized uptake values (SUVs) used; (c) uniform acquisition and reconstruction protocols; (d) same scanner used for test and retest scan for each patient. This further removed three of the eight studies identified through the original search. 10
Incorporation of study quality evaluations in the selection criteria is also important. If a particular study has obvious flaws in its design or in the statistical analysis methods used to produce the performance metric(s) of interest, then one should exclude the study from the meta-analysis. An exception might be when the statistical analysis is faulty but is correctable using available data; inclusion of the corrected study results in the meta-analysis may be possible. Examples of design flaws include a lack of blinding of readers to evaluations of other readers in a reader reproducibility study and confounding of important experimental factors with ancillary factors such as order of image acquisition or reading or assignment of readers to images. Statistical analysis flaws might include use of methods for which statistical assumptions required by the method like independence of observations or constant variance are violated. Additionally, data from different studies may overlap, so care should be taken to screen for, and remove, these redundancies as part of assembling the final set of studies. There may be some studies for which quality cannot be judged. This might occur, for example, if study reporting is poor and important aspects of the study design and analysis cannot be determined. These indeterminate situations might best be addressed at the analysis stage, as discussed briefly in Section 9.
For meta-analyses of repeatability and reproducibility metrics, it is particularly important to carefully examine the sources of variability encompassed by the metric computed for each retrieved study. Repeatability metrics from multiple reads from each of several acquired images will reflect a smaller amount of variation than the variation expected when the full image acquisition and interpretation process is repeated. Many different factors, such as clinical site, imaging device, imaging acquisition process, image processing software, or radiologist or imaging technician, can vary in reproducibility assessments. Selection criteria should explicitly state the sources of variation that are intended to be captured for the repeatability and reproducibility metrics of interest. Compliance testing for all these factors with regards to the Quantitative Imaging Biomarkers Alliance (QIBA) profile claim is included in the respective profile compliance sections. Specific tests for factors such as software quality using standardized phantom data and/or digital reference objects are developed and described in the QIBA profile. 29
If a meta-analysis entails assessment of multiple aspects of performance such as bias and repeatability, one must decide whether to include only studies providing information relevant to both aspects or to consider different subsets of studies for each aspect. Similar considerations apply when combining different performance metrics across studies, such as combining a bias estimate from one study with a variance estimate from a different study to obtain a mean square error estimate. Because specific imaging devices may be optimized for different performance aspects, such joint or combined analyses should be interpreted cautiously. 30
2.3 Organizing and summarizing the retrieved study data
Estimates and standard errors of RC of FDG-PET mean SUV associated with each study from de Langen et al., 10 along with study descriptors.
A popular graphical display is a forest plot in which point estimates and confidence intervals for the quantity of interest, in our case the performance metric, from multiple sources are vertically stacked. As an example, Figure 2 is a forest plot of the RC of the FDG-PET mean SUV associated with each study from de Langen et al.
10
Such figures and tables might also include annotations with extracted study descriptors. The goal is to provide a concise summary display of the information from the included studies that is pertinent to the research question.
Forest plot of the repeatability coefficient (RC) of FDG-PET mean SUV associated with each study in the meta-analysis of de Langen et al.
10
Points indicate RC estimates whereas the lines flanking the points indicate 95% confidence intervals.
3 Statistical methodology for meta-analyses
Suppose that, through the systematic review procedures described in Section 2, K suitable studies are identified. Also suppose that in the hth study, the investigators obtained ph measurements using the imaging procedure for each of nh patients, with
This section and the following one describe methodology for when patient-level measurements
Standard fixed-effects and random-effects meta-analysis techniques rely on the approximate normality of the study-specific technical performance metric estimates. Many common technical performance metrics, including the intra-class correlation (ICC), mean squared deviation (MSD), and the RC will indeed become approximately normally distributed when the sample sizes of each of these studies, denoted
The performances of standard meta-analysis techniques will suffer when some of the studies are small because of the resulting nonnormality of the technical performance metric estimates. Kontopantelis and Reeves 33 present simulation studies indicating that when study-specific test statistics in a meta-analysis are nonnormal and each of the studies is small, the coverage probabilities of confidence intervals from standard meta-analysis techniques are less than the nominal level; simulation studies, presented in Section 5.1, confirm these findings. One possible modification would be to use the exact distribution of the metric estimates, if it is analytically tractable, in place of the normal approximation, similar to what van Houwelingen et al. suggest. 34 For a few of these metrics, the exact distribution is analytically tractable; for example, as mentioned before, the squared RC has a gamma distribution. Such modified techniques are described in subsequent sections.
For the remainder of this section, it is assumed that study descriptors that can explain variability in the study-specific technical performance metrics are unavailable. Methodology for meta-analysis in the presence of study descriptors, or meta-regression, is described in Section 4. Figure 3 depicts the statistical methodology approach for meta-analysis of a technical performance metric in the absence of study descriptors.
Meta-flowchart for statistical meta-analysis methodology in the absence of study descriptors. Boxes with dashed borders indicate areas where future development of statistical methodology is necessary.
3.1 Tests for homogeneity
A test for homogeneity is represented by a test of the null hypothesis
However, if the normality assumption for
B simulations
A rejection of the null hypothesis of homogeneity should indicate that the fixed-effects meta-analysis techniques should not be used. However, failure to reject the hypothesis merely indicates that there is insufficient evidence to refute the assumption that the fixed-effects model is correct. Due to the heterogeneity inherent in QIB studies, it is recommended to always use random-effects models such as those described in Section 3.3 for QIB applications, even though fixed-effects models are computationally simpler and are more efficient than random-effects approaches when the fixed-effects assumption is truly satisfied.
A limitation of a test for the existence of heterogeneity is that it does not quantify the impact of heterogeneity on a meta-analysis. Higgins and Thompson
37
and Higgins et al.
38
give two measures of heterogeneity, H and I2, and suggest that the two measures should be presented in published meta-analysis in preference to the test for heterogeneity. H is the square root of the heterogeneity statistic Q divided by its degrees of freedom. That is,
3.2 Inference for fixed-effects models
For standard meta-analysis under the fixed-effects model, where
The standard error of
A Bayesian approach can also be applied to estimate θ. A prior distribution for θ could be a normal distribution, specifically
In practice,
However, if the metric estimates
3.3 Inference for random-effects models
Under the random-effects model, it is assumed that the study-specific actual technical performance metrics
The estimator for θ under these conditions is
The standard error of
To obtain an estimate for η, one option is the method of moments estimator from DerSimonian and Laird
40
Bayesian techniques can be used under the random-effects model. Prior distributions of θ and
If study sizes preclude the normality approximation to the likelihoods, then exact likelihoods can be used in place of their normal approximations in these Bayesian techniques similarly as for fixed-effects meta-analysis if the form of the distributions of
Alternatively, one can relax any assumptions of the distribution of
For inferences on θ under this setup, van Houwelingen et al. propose the Expectation-Maximization (EM) algorithm described in Laird.
45
The algorithm begins with initial guesses for
Adapting this procedure to accommodate nonnormally distributed study-specific technical performance metric estimates
To obtain confidence intervals for θ based on this method, the nonparametric bootstrap can be used. The study-specific technical performance estimates
Analogous semiparametric Bayesian techniques can be used for inferences on G. Ohlssen et al. describe Dirichlet processes 46 that are applicable both for when the Th are normally distributed and when they are nonnormally distributed but have a known, familiar parametric form. 47
4 Meta-regression: Meta-analysis in the presence of study descriptors
In some cases, study descriptors may explain a significant portion of the variation among the study-specific actual performance metrics Meta-flowchart for statistical meta-regression methodology in the presence of study descriptors. Boxes with dashed borders indicate areas where future development of statistical methodology is necessary.
4.1 Fixed-effects meta-regression
Fixed-effects meta-regression extends fixed-effects meta-analysis by replacing the mean,
Inferences here involve weighted least squares estimation of the coefficients
The standard errors of the estimators,
However, this approach assumes that the study-specific technical performance estimates Th are approximately normally distributed, which is reasonable if each study contains a sufficiently large number of patients; recall that in Section 3 it was suggested that if the technical performance metric is RC, the normal approximation is satisfactory if all studies contain 80 or more patients. Knapp and Hartung introduced a novel variance estimator of the effect estimates and an associated t-test procedure in random-effects meta-regression (see Section 4.2). 51 The test showed improvement compared to the standard normal-based test and can be applied to fixed-effects meta-regression by setting the variance of random effects to zero.
If the exact distribution of Th is analytically tractable, then the relationship between Th and xh may be represented by a generalized linear model. For example, if Th is a RC, then given the gamma distribution of
4.2 Random-effects meta-regression
Fixed-effects meta-regression models utilizing the available study-level covariates as described in Section 4.1 are sometimes inadequate for explaining the observed between-study heterogeneity. Random-effects meta-regression can address this excess heterogeneity analogously to the way that random-effects meta-analysis (Sections 3.2 and 3.3) can be used as an alternative to fixed-effects meta-analysis.
Standard random-effects meta-regression assumes that the true effects are normally distributed with mean equal to the linear predictor
An iterative weighted least squares method can be applied to estimate the model parameters.
53
Under the proposed model, the variance of Th is Conditional on these current estimates of Estimate the weights Use the estimated weights to update the estimates for
In STATA,
54
this algorithm can be accessed with the command metareg.
55
In R,
56
the metafor package
57
has a function called rma that can fit random-effects meta-regression models. Note that for the case of one covariate, for step 3, the estimates of
Unbiased and nonnegative estimators of the standard errors of the weighted least-square estimators
Because standard random-effects meta-regression inference techniques rely on approximate normality of the study-specific performance metric estimates
Some alternative random-effects meta-regression approaches have been suggested for the situation where normality assumptions are deemed inappropriate. Knapp and Hartung proposed an improved test by deriving the nonnegative invariant quadratic unbiased estimator of the variance of the overall treatment effect estimator.
51
They showed this approach to yield more appropriate false-positive rates than approaches based on asymptotic normality. Higgins and Thompson confirmed these findings with more extensive simulations.
58
Alternatively, if the exact distribution of Th is of a known and tractable form, one may be able to apply inference techniques for generalized linear mixed models making use of the fact that the model of Th for random-effects meta-analysis has the form of a linear mixed model.53,59 For example, if Th is the RC for the hth study, then given the gamma distribution of
5 Application of statistical methodology to simulations and actual examples
Statistical meta-analysis techniques described in Section 3 were applied to simulated data to examine their performances for inference for the RC under a variety of settings in which factors such as numbers of studies in the meta-analysis, sizes of these studies, and distributional assumptions were varied. Coverage probabilities of 95% confidence intervals for RC produced using standard fixed-effects and random-effects meta-analysis were frequently less than 0.95 when some of the primary studies had a small number of patients. In comparison, 95% confidence intervals produced through techniques such as fixed-effects meta-analysis using the exact likelihood in place of the normal approximation or the EM algorithm approach for random-effects meta-analysis had improved coverage, often near 0.95, even in situations where some of the primary studies were small. All of the random-effects meta-analysis techniques described in Section 3.2 produced 95% confidence intervals with coverage probabilities less than 0.95 when the number of studies was very small. Further details of the simulation studies are presented in Section 5.1.
Statistical meta-analysis techniques described in Sections 3 and 4 were also applied to the FDG-PET uptake test–retest data from de Langen et al. 10 for purposes of illustration. Those results are presented in Section 5.2.
5.1 Simulation studies
Data were simulated for each of K studies, where the data in the hth study consisted of ph repeat QIB measurements, all of which were assumed to have been acquired through identical image acquisition protocols, for each of nh subjects. The performance metric of interest was θ = RC.
Fixed-effects meta-analysis techniques were examined first under the ideal scenario of a large number of studies, all of which had a large number of subjects. For each of
Simulation studies for
Fixed-effects meta-analysis simulation studies results; coverage probabilities of 95% confidence intervals from each technique, computed over 1000 simulations.
Similar simulation studies were performed to examine the random-effects meta-analysis techniques. The process to simulate the data here was identical to before, except that the within-patient QIB measurement variance used to generate the repeat QIB measurements
Random-effects meta-analysis simulation studies results with normally distributed study effects
Random-effects meta-analysis simulation studies results with nonnormally distributed study effects
Regardless of the distribution of
For both normally distributed (Table 3) and nonnormally distributed
The EM algorithm approach using the normal approximation to the likelihood produced 95% confidence intervals whose coverage probabilities approached 0.95 when the meta-analysis contained at least 15 studies, all studies contained at least 45 patients, and
5.2 FDG-PET SUV test–retest repeatability example
A systematic literature search on Medline and Embase was conducted by de Langen et al using search terms: “PET,” “FDG,” “repeatability,” and “test–retest” and excluded identified studies through four criteria, specifically (a) repeatability of
18
F-FDG PET uptake in malignant tumors; (b) SUVs used; (c) uniform acquisition and reconstruction protocols; (d) same scanner used for test and retest scan for each patient. Their search retrieved
The authors of this manuscript reviewed available data and results from these studies and produced study-specific estimates for the RC of SUVmean, maximized over all lesions per patient for reasons of simplicity; this sidestepped the issue of clustered data as three of the studies involved patients with multiple lesions. Fixed-effects and random-effects meta-analysis techniques from Sections 3.2 and 3.3 were performed, as well as univariate fixed-effects meta-regression techniques from Section 4.1 using median SUVmean, median tumor volume, and proportion of patients with thoracic lesions versus abdominal as study-level covariates. Random-effects meta-regression was not performed due to limitations from the small number of studies.
Summary statistics and study descriptors from these studies are given in Table 1. RC estimates ranged from 0.516 to 2.033. Aside from Velasquez et al., 11 which contained 45 patients, none of which had thoracic lesions, the studies enrolled between 10 and 21 patients, between 81 and 100% of which had thoracic lesions. 10 Aside from Minn et al., 15 which stood out for its large tumors (median tumor volume of 40 cm3) and high uptakes (median SUVmean of 8.8), median tumor volumes and median SUVmean ranged from 4.9 to 6.4 cm3 and 4.5 to 6.8 cm3, respectively. 10
Estimates of the median or common RC θ, with 95% confidence intervals, for the FDG-PET test–retest data from de Langen et al., 10 using various meta-analysis techniques.
Estimates of the slope and intercept parameters for fixed-effects meta-regression with 95% confidence intervals for the FDG-PET test–retest data from de Langen et al., 10 where meta-regressions were univariate upon median SUVmean, median tumor volume in cm3, and proportion of thoracic versus abdominal patients.
This analysis was presented to illustrate the application of the techniques from this manuscript to actual data rather than to provide new results about the repeatability of FDG uptake and how it varies as a function of study or patient characteristics. For a more comprehensive meta-analysis and discussion, the reader is referred to de Langen et al. 10
6 Individual patient-level meta-analysis of technical performance
An alternative to the study-level meta-analysis techniques described in Sections 3 and 4 is individual patient-level meta-analyses, where patients, rather than studies, are the unit of analysis. One approach is to use the patient-level data to compute the study-specific technical performance metrics
Various simulation studies and meta-analyses of actual data indicate that using individual patient-level approaches often does not result in appreciable gains in efficiency.62,63 However, patient-level data allow direct computation of summary statistics that study investigators may not have considered in their analyses. This bypasses the need to extract the technical performance metric of interest from existing summary statistics or to exclude studies entirely if this metric was not calculable from the reported summary statistics.
Using patient-level data provides advantages in meta-regression when the technical performance is a function of characteristics that may vary at the patient level rather than the study level such as contrast of tumor with surrounding tissue, complexity of lesion shape, baseline size of tumor itself, baseline mean uptake, and physiological factors such as breath hold and patient motion.10,64–67 In this case, performing study-level meta-regression with summary statistics of these patient-level characteristics such as median baseline tumor size or median baseline mean SUV as the covariates will result in substantially reduced power in testing the null hypothesis that
7 Extension to other metrics and aspects of technical performance
The general process to formulate the research question, identify appropriate studies for the meta-analysis, and to organize the relevant data presented in Section 2 for a variety of technical performance metrics is similar to that described for repeatability. The exposition of the methodology and examples in Sections 3 through 6 has been in the context of RC, and these aspects will differ for other technical performance metrics. RC was selected for purposes of simplicity because not only does the study-specific RC estimate become approximately normally distributed as the size of the study gets large, but the exact distribution of the squared RC is analytically tractable. In principle, the methods presented in Sections 3 through 6 could be modified to conduct meta-analyses of other repeatability metrics as well as reproducibility, bias, linearity, and agreement metrics, even though the meta-analysis itself may be noticeably more computationally and analytically difficult.
Standard meta-analysis techniques in the literature rely largely on the approximate normality of the study-specific estimated technical performance metrics Th. The simulation studies shown in Section 5.1 demonstrated how this assumption can adversely affect the performance of these methods for many technical performance metrics for which the exact distribution of Th is nonnormal. Even though many of them, including ICC for repeatability, reproducibility coefficient for reproducibility, and MSD and 95% total deviation index (TDI) for agreement, do indeed converge to normality as the study size increases,70–73 studies assessing technical performance often are small.
An alternative approach when the normal approximation is not satisfactory is to use the exact likelihood in place of the normal approximation in standard meta-analysis techniques as described in Sections 3 and 4. For RC, for which study-specific estimates have a gamma distribution, this modification led to an improvement in coverage probabilities of the 95% confidence intervals when the sample sizes were small or when the meta-analysis contained small studies in addition to larger ones. Unfortunately, estimates for most other technical performance metrics will not have such an analytically tractable distribution, making this option often infeasible. However, estimates for some metrics may converge rapidly to normality; for example, Lin showed that a normal approximation to the distribution of the 95% TDI was valid for sample sizes as small as 10. 73 If this is the case, standard meta-analysis techniques should be appropriate even when some studies are small.
If the exact likelihood is intractable and the convergence to normality is slow, then fully nonparametric meta-analysis techniques may be the only option. Nonparametric meta-analysis techniques have received very little attention in the literature thus far.
8 Reporting the results of a meta-analysis
Meta-analyses should be reported in a complete and transparent fashion in order to ensure proper interpretation and dissemination of the results. High quality reporting allows evaluation of the context in which the conclusions of the meta-analysis apply and to assess for potential biases. Reporting guidelines have been proposed for other types of health research meta-analyses, including Quality of Reporting of Meta-Analyses (QUOROM) for randomized trials 74 and its update, Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA),75,76 which applies to a broader range of studies, but particularly to studies involving some type of intervention, and Meta-Analysis of Observational Studies in Epidemiology (MOOSE) 77 for observational epidemiologic studies. Key reporting elements are assembled into checklists, which can serve several functions. They can aid journal editors and referees who review submitted papers reporting on meta-analyses. Investigators can consult the checklist when they are planning a meta-analysis to be reminded of all of the study design and analysis issues that should be considered because they can have an impact on the quality and interpretability of findings. In addition, meta-analysis reporting guidelines can provide a framework for organization of information that is useful to regulatory authorities, funding agencies, and third party payers who need to evaluate a body of evidence for performance of a particular QIB.
Development of reporting guidelines for a particular class of health research studies has traditionally been a multiyear process involving a team of experts. Evidence for the need for improved reporting of systematic reviews and meta-analyses in radiology was provided by a recent study that demonstrated an association of study quality with completeness of reporting of such studies in major radiology journals. 78 The analysis also demonstrated that there remains substantial room for improvement in study reporting in radiology. 79 Reporting guidelines specific to imaging procedure technical performance studies have not been proposed to date.
Checklist of items to report for meta-analyses of performance assessments of quantitative imaging biomarkers.
9 Discussion
Meta-analysis methods for summarizing results of studies of an imaging procedure’s technical performance were presented in this paper. Such technical performance assessments are important early steps toward establishing clinical utility of a QIB. Conclusions drawn from multiple technical performance studies will generally be more convincing than those drawn from a single study, as a collection of multiple studies overcomes limitations of small sample sizes of individual studies evaluating the technical performance of an imaging procedure and provides the opportunity to examine the robustness of the imaging procedure’s technical performance across varied clinical settings and patient populations.
One challenge in the meta-analysis of the technical performance of an imaging procedure is that completed studies that specifically evaluate technical performance of an imaging procedure are limited in number, although one may still be able to extract estimates of some performance metrics from data and results of a study in which assessing technical performance was not the primary objective, provided the study design and image analysis procedure allow it. Another challenge is extreme heterogeneity that is possible due to studies being performed under widely differing conditions. Another is that normality assumptions underlying many standard meta-analysis techniques are often violated due to the typically small sample sizes, together with the mathematical form of many performance metric estimates.
Modifications to the standard meta-analysis approaches in the context of nonnormally distributed performance metrics were described in the context of the RC. Application of statistical techniques for meta-analysis presented in this paper to simulation studies indicated that these modified techniques outperformed standard techniques when the study sizes were small. However, even with such modifications, the performances of random-effects meta-analysis techniques suffered when the number of studies was small; this was not surprising since a large number of studies would be necessary for inferences on between-study variation in technical performance. Theoretical results and additional simulation studies to further examine the characteristics of these modifications are an area of future research.
It is important to recognize that, in any meta-analysis, the quality of the primary studies will have an impact on the reliability of the meta-analysis results. Inclusion of fundamentally flawed data into a meta-analysis will only diminish the reliability of the overall result. More often there will be studies of questionable quality and there will be uncertainty about whether to include them in the meta-analysis. Sensible approaches in this situation might include evaluating the impact of including the questionable studies in the meta-analysis through sensitivity analyses or statistically down-weighting their influence in the analysis. A full discussion of these approaches is beyond the scope of this paper.
Many of the concepts, approaches, and challenges discussed extend to other technical performance characteristics besides repeatability, though in practice, meta-analyses of these characteristics may be substantially more difficult. Study selection for meta-analyses of reproducibility and agreement is more complicated as studies in the literature assessing the reproducibility of an imaging procedure or its agreement with standard methods of measuring the quantity of interest are more heterogeneous than those assessing repeatability. For reproducibility studies, sources of variation between repeat scans for each patient such as imaging device, image acquisition protocol, and time point at which each scan takes place may differ. The reference standard or alternative method against which the agreement of the imaging procedure is assessed also often varies among studies, potentially making accumulation of a reasonably homogeneous set for meta-analyses of agreement difficult. Furthermore, exact distributions of most reproducibility and agreement metrics, and many repeatability metrics for that matter, are not analytically tractable, which makes approaches such as fixed-effects meta-analysis using the exact likelihood or the EM algorithm approaches in random-effects meta-analysis infeasible. Modifications of meta-analysis techniques for this scenario are an area of future research.
The methodology described focused on the meta-analysis of a single QIB, but it is worth noting that the same clinical image is often used for multiple tasks, such as detection of a tumor, localization of a tumor, and measurement and characterization of a tumor, each of which involves a different QIB. While each such QIB could be analyzed on its own using the methods described in Sections 3, 4, and 6, a joint analysis would require a multivariate approach to take correlations between QIBs into account. Although such an approach will be methodologically more complex, it may potentially yield more accurate estimators of the technical performance of each individual QIB. 80
The challenges identified throughout this discussion of meta-analysis methods for imaging procedure technical performance suggest several recommendations. First, investigators should be encouraged to conduct and publish imaging procedure performance studies so that more information will be available, from which reliable conclusions could be drawn. These studies must also be reported in complete and transparent fashion so that they are appropriately interpreted. In addition, greater coordination among investigators and perhaps recommendations from relevant professional societies regarding the types of studies that should be performed would help to promote greater comparability among studies and facilitate combination of results across studies. Finally, these discussions have identified fertile ground for interesting statistical problems, and statistical researchers are encouraged to pursue further work in this area.
Footnotes
Acknowledgements
The authors acknowledge and appreciate the Radiological Society of North America and NIH/NIBIB contract # HHSN268201000050C for supporting two workshops and numerous conference calls for the authors' Working Group. The authors would also like to thank Huiman Barnhart and Daniel Sullivan from Duke University and Gene Pennello, Norberto Pantoja-Galicia, Robert Ochs, Shing Chun Benny Lam, and Mary Pastel from the FDA for their expert advice and comments on this manuscript.
This effort was motivated by the activities of QIBA, 81 whose mission is to improve the value and practicality of QIBs by reducing variability across devices, patients, and time.
Conflicts of interest
The following authors would like to declare the following conflicts of interest: Paul Kinahan (research contract, GE Healthcare), Anthony P. Reeves (Co-inventor on patents and pending patents owned by Cornell Research Foundation, which are nonexclusively licensed to GE and related to technology involving computer-aided diagnostic methods, including measurement of pulmonary nodules in CT images; research support in the form of grants and contracts from NCI, NSF, American Legacy Foundation, Flight Attendants' Medical Research Institute; stockholder of Visiongate Inc. a company which is developing optical imaging technology for the analysis of individual cells; owner of D4Vision Inc. (a company that licenses software for image analysis), Alexander R. Guimaraes (expert witness, Siemens speakers bureau), Gudrun Zahlmann (employee of F. Hofmann-La Roche, Ltd).
