Abstract
Introduction:
This meta-review aims to evaluate meta-analyses on adherence of diagnostic test accuracy (DTA) studies to the STAndards for Reporting of Diagnostic Accuracy Studies (STARD) 2015 checklist.
Methods:
We searched MEDLINE and EMBASE and included reviews using STARD 2015 to evaluate completeness of reporting of primary DTA studies in humans in any field of research and presented their results quantitatively. We extracted data independently, in duplicate. Random-effects meta-analysis models using restricted maximum likelihood estimation were used to determine overall mean and 95% confidence intervals (CI) of adherence to STARD 2015, compare fields of research, and evaluate adherence over time. Subgroup analyses to compare adherence across fields of research (diagnostic imaging, lab/biomarker, and other), were performed with pairwise differences between subgroups tested using meta-regression models with field of research as a categorical moderator.
Results:
A total of 14 reviews evaluating 1115 primary DTA studies were included. The range of primary studies evaluated in each review was 6 to 158. Included reviews were from Canada (n = 6), Germany (n = 3), Australia (n = 1), USA (n = 2), Spain (n = 1), and South Korea (n = 1). Field of research of included reviews were diagnostic imaging (n = 7), lab/biomarker (n = 4), and other (n = 3). The overall mean STARD 2015 adherence was 53.2% (95% CI: 45.9-60.5). Mean adherence was higher among diagnostic imaging studies (61.7% [95% CI: 56.8-66.6]) compared to lab/biomarker studies (48.5% [95% CI: 33.5-63.4]; P = .02) and other fields of research (39.4% [95% CI: 23.9-54.9]; P < .001). Time trends analysis found that STARD adherence did not change over time (P = .28).
Conclusion:
Adherence of primary DTA studies to STARD 2015 was evaluated to be incomplete.
Introduction
Diagnostic test accuracy (DTA) studies compare an index test to a reference standard to evaluate its ability to detect a target condition. The original Standards for Reporting of Diagnostic Accuracy Studies (STARD) statement was published in 20031,2 due to growing evidence of incomplete reporting and poor methodology in DTA studies.3,4 It contained a 25-item checklist considered critical to ensure complete reporting of DTA studies.
A 2013 review by Korevaar et al evaluated reviews that investigated adherence of primary DTA studies to the original STARD checklist. 5 They included 16 reviews that analyzed 1496 studies and found suboptimal adherence (median 12.8/25 [51%] checklist items reported; range: 9.1/25 [36%] to 14.3/25 [57%]), with a small improvement (1.41 items [95% CI: 0.65-2.18]) after the original STARD checklist was published. The overall conclusion was that reporting quality was consistently moderate and that adherence to STARD should be further promoted.
STARD was updated in 2015 to include 30 items, to further address issues with incomplete and transparent reporting, and to facilitate implementation of the checklist.6,7 Since then, the updated guideline has been cited over 3000 times, and numerous knowledge translation strategies have been implemented in an effort to improve reporting adherence. 6 A 2024 publication by Heus et al found that the number of journals explicitly mentioning the STARD guidelines in the instructions for authors increased from 82 (24% of n = 341) in 2017 to 111 (33% of n = 341) in 2022. 8 Furthermore, while several STARD adherence studies have since been published, a meta-review to evaluate the adherence of DTA studies to STARD 2015 has not been conducted. An updated assessment on completeness of reporting of DTA studies is needed to evaluate whether the implemented knowledge translation strategies are effective and if reporting in DTA studies has improved since 2013.
This systematic review will evaluate all reviews that investigated adherence of primary DTA studies to STARD 2015. The primary objective is to evaluate adherence of primary DTA studies to the STARD 2015 checklist. The secondary objectives are to evaluate if the completeness of reporting has improved after the introduction of STARD 2015 and to assess differences in adherence across fields of research.
Methods
The study protocol was developed a priori and has been posted on Open Science Framework: osf.io/v4cu9/files/pzkem. 9
Search and Selection
We searched MEDLINE and EMBASE for all articles published until and inclusive of April 29, 2024 without any restrictions for language or study type. We included reviews that primarily aimed to examine the completeness of reporting of primary DTA studies in humans in any field of research, by evaluating their adherence to the STARD 2015 checklist. The literature search strategy can be found in Appendix A.
We excluded: (1) systematic reviews that primarily aimed to evaluate the diagnostic accuracy of a single test or set of tests for a specific target condition and used a STARD checklist to assess completeness of reporting in included articles; (2) reviews that evaluated adherence to the original STARD checklist, STARD for Abstracts, or STARD for Registration rather than adherence to STARD 2015; (3) reviews that did not present their results quantitatively as a mean of overall STARD adherence or item-specific STARD adherence.
Eligible studies were independently screened by in duplicate by study authors (SK, MKA, and HD), based on titles and abstracts, to identify potentially eligible reviews. If at least one author identified an abstract as potentially eligible, the full text of the review was assessed by both authors. Disagreements were resolved through discussion, whenever possible. If agreement could not be reached, the case was discussed with a third author (DK/MM). One author (SK) screened reference lists of included reviews for additional relevant papers.
Data Collection
An extraction form was created before the literature search was performed and was piloted on 3 known eligible reviews. Study authors (SK, HD, HO, DA, and NI) independently extracted relevant data from the included reviews in duplicate. Disagreements were resolved through discussion. If necessary, a third author (DK/MM) made the final decision.
Of each included review, we extracted the first author, country of corresponding author, year of publication, journal, field of research, inclusion and exclusion criteria, number of primary DTA studies included, publication date range of included primary DTA studies, number of STARD 2015 items evaluated, and overall and item-specific STARD 2015 adherence. In addition, we retrieved adherence comparisons between primary DTA studies published after versus before STARD 2015 was published, if available.
Data Analysis
All analyses were performed using R (version 4.3.1; R Foundation for Statistical Computing) with the ‘meta’ and ‘metafor’ packages.10-12
Overall Adherence to STARD 2015 Checklist
For each included review, we extracted (or calculated, if per-item STARD 2015 adherence was provided) the overall STARD 2015 adherence, defined as the mean number (and standard deviation) of items reported by primary DTA studies included in that review. Reviews that did not report a standard deviation were excluded from the meta-analysis.
We scaled all counts of STARD 2015 adherence to reflect a percentage in order to address varying total number of STARD 2015 items evaluated across different reviews. The percentage was calculated using the total number of evaluated items as the denominator (eg, if a review determined a specific item was not applicable and thus not evaluated, or if a review counted sub-items separately, this was reflected in the denominator). The standard error for each review was first computed as the standard deviation divided by the square root of the number of primary studies, and then similarly scaled to a percentage.
To obtain a summary estimate and the corresponding 95% CI of the mean adherence to STARD 2015, we conducted a random-effects meta-analysis using restricted maximum likelihood (REML) estimation on the scaled mean adherence percentages 13 on the scaled percentage of mean adherence. We explored statistical heterogeneity using the I2 statistic and τ2.14,15
Adherence to STARD 2015 Before and After Launch
To obtain a summary estimate and the corresponding 95% CI of the difference in adherence before and after the launch of STARD 2015, we used a random-effects meta-analysis approach. 13 Only reviews specifically reporting pre-STARD 2015 and post-STARD 2015 results were eligible for this analysis.
Adherence to STARD 2015 Across Fields of Research
A subgroup analysis was performed to evaluate overall adherence to STARD 2015 of reviews within each specialty (eg, radiology, laboratory medicine). A random-effects model using REML estimation was fitted independently for each subgroup to obtain summary estimates and 95% CIs. To formally test for differences between subgroups, meta-regression models were fitted with field of research as a categorical moderator, and pairwise comparisons were conducted between all 3 subgroup combinations (diagnostic imaging vs lab/biomarker, diagnostic imaging vs other, and lab/biomarker vs other). The P-values reported for subgroup differences were derived from the test of moderators (QM statistic) from each respective pairwise meta-regression model.
Adherence to STARD 2015 Over Time
A meta-regression was also performed to explore STARD 2015 adherence (%) over time using (A) the mid-point and (B) the last year of the publications included in each review. For the mid-point year analysis, the mid-point was calculated as the mean of the reported start and end years of primary DTA studies included in each review; reviews for which either year was unavailable were excluded from this analysis. The last year analysis served as a sensitivity analysis using the end year of included publications in place of the mid-point year. For both analyses, predicted adherence values with 95% CIs were plotted across the observed range of years to visualize the time trend.
Results
Review Characteristics
A total of 476 reviews were initially identified through the search. After screening for eligibility, the final analysis included 14 reviews evaluating adherence to STARD 2015 of 1115 primary DTA studies. The selection process and further details can be found in Figure 1.

Review selection flow chart.
The number of primary DTA studies evaluated in each review ranged from 6 to 158. Included reviews were from Canada (n = 6), Germany (n = 3), Australia (n = 1), USA (n = 2), Spain (n = 1), and South Korea (n = 1). The medical field of included reviews were diagnostic imaging (n = 7), lab/biomarker (n = 4), and other (psychiatry n = 1, orthopedics n = 1, medical informatics n = 1). Table 1 provides the descriptive characteristics of included reviews.
Characteristics of Included Reviews.
Adherence to STARD 2015 Checklist
Of the 14 included reviews, the summary estimate of mean adherence to the STARD 2015 checklist across included reviews was 53.2% (95% CI: 45.9-60.5, range: 29.4-71.2), which scales to approximately 16.0 items (95% CI: 13.8-18.1) out of 30 STARD 2015 items. Table 1 provides a detailed breakdown of the number of STARD 2015 items evaluated in each review. There was notable heterogeneity across reviews (I2 = 99.4%; τ2 = 189.57), indicating that the large majority of variance in pooled adherence scores reflected true between-study differences rather than sampling error. This heterogeneity likely stems from differences in clinical domains, publication years, and variation in how review authors scored adherence. Figure 2 summarizes adherence across reviews.

Forest plot showing meta-analysis of mean adherence to STARD 2015 checklist and 95% confidence intervals (CIs) across all included reviews and by field of research (diagnostic imaging, lab/biomarker, and all other fields).
Subgroup analysis by field of research found that mean adherence to STARD 2015 was higher among diagnostic imaging studies (61.7% [95% CI: 56.8-66.6]) when compared to lab/biomarker studies (48.5% [95% CI: 33.5-63.4]; P = .02), and when compared to studies in other fields of research (39.4% [95% CI: 23.9-54.9]; P < .001). There was no difference in mean adherence when comparing lab/biomarker studies to those in other fields of research (P = .42). P-values were obtained from the QM statistic of each pairwise meta-regression model.
Figure 2 depicts the adherence to STARD 2015 in each included review, the mean adherence across all included reviews, and the mean adherence grouped by field of research.
Time trends analysis found that STARD adherence did not change over time when considering the mid-point year of publications included in reviews (P = .28) and when considering the last year of publications included in reviews (P = .28). Figure 3 presents STARD 2015 mean adherence across the 14 included reviews over time.

STARD 2015 adherence (%) across 14 included reviews over time using (A) the mid-point and (B) the last year of the publications included in each review. The solid line shows the fitted random-effects meta-regression; the shaded region indicates 95% confidence intervals. Each bubble represents one review, where bubble size and labels indicate the number of primary studies (n) included in each review.
The change in overall mean adherence before versus after the publication of STARD 2015 could not be investigated because the included reviews did not provide the data required to evaluate this.
Discussion
Our study found an overall mean adherence of 53.2% to the STARD 2015 checklist among 14 reviews evaluating 1115 primary DTA studies, indicating incomplete reporting in primary DTA studies. This is similar to the findings of the 2013 study published by Korevaar et al which identified a median adherence of 51% to the original STARD 2003 checklist. 5
A 2016 systematic review by Hong et al evaluating STARD 2015 checklist adherence for diagnostic accuracy imaging studies found an overall adherence of 55%. 16 A 2025 systematic review by Kashif Al-Ghita et al assessing STARD 2015 checklist adherence among primary DTA studies in imaging journals found that overall adherence was 61%. 17 The findings in Kashif Al-Ghita et al are comparable to our study, as our subgroup analysis by field of research found that mean adherence to STARD 2015 was 62% among diagnostic imaging reviews. We also found that the mean adherence to STARD 2015 was higher among diagnostic imaging reviews compared to lab/biomarker reviews and reviews in other fields of research. There may be fewer journals in fields other than imaging that have adopted STARD 2015, and there may be less awareness of the STARD 2015 checklist among authors submitting to such journals.
Various knowledge translation strategies have been employed to improve completeness of reporting and reproducibility of research studies. Examples of such interventions include the creation of reporting guidelines and training on their use, enhancing adherence to reporting guidelines at the peer review level, and guidelines implemented by journals.18,19 In a 2025 scoping review investigating open science interventions aimed at improving reproducibility of research, Dudda et al 18 found the most commonly studied intervention was implementation of policy guidelines by publishers and journals such as data sharing, trial registration, and reporting guidelines (49 of 105 included studies) as well as the use of reporting guidelines in general (15 of 105 included studies). Other studied interventions were those on open methodology, data and materials, rewards and incentives, open science tools, and training and education. 18 Since the 2013 study by Korevaar et al evaluating adherence to STARD 2003, numerous journals such as Radiology 20 and Journal of Magnetic Resonance Imaging 21 have adopted the STARD 2015 checklist and advise submitting authors of primary DTA studies to use the checklist. Despite these steps taken by journals, the lack of improvement in completeness of reporting persists. A 2016 study by Agha et al found that after implementing mandatory compliance of reporting guidelines in the International Journal of Surgery, compliance to the STROBE guideline increased by 12%. 22 While authors submitting reporting checklists may improve transparency, without a threshold number of checklist items this will not necessarily improve the completeness of reporting. This indicates different strategies to improve adherence are necessary.
There is limited research at this time investigating the efficacy of the aforementioned interventions, and further investigation is needed. Future considerations for knowledge translation strategies include journals further enforcing adherence to reporting guidelines by having reviewers, editors, or artificial intelligence software 23 to formally assess adherence to appropriate reporting guidelines as a part of the peer review process.
Our study had several strengths. A broad search was used to maximize the breadth of diagnostic accuracy reviews included. Duplicate screening and data extraction, and a pilot data extraction, were employed to ensure the inclusion of appropriate reviews and accurate data extraction, respectively.
This study also had several limitations. Since we included reviews of studies, we did not directly evaluate the adherence of each primary DTA study to the STARD 2015 checklist, but relied on the adherence evaluations provided by authors of the included reviews. Although the STARD 2015 Explanations and Elaborations paper helps limit subjectivity when evaluating adherence to the checklist, 6 it is still possible that the review authors across our included reviews had variability in their methods of evaluating study adherence. Furthermore, each review had a different interpretation of the total number of STARD 2015 items; some reviews considered there to be a total of 34 items including sub-items, some reviews considered there to be a total of 30 items, and some reviews excluded certain STARD 2015 items if the authors determined were not applicable to a given study. For example, the included review by Stahl et al published in 2023 considered STARD 2015 as a 30 item checklist and evaluated a total of 29 items after excluding item 11 (rationale for choosing the reference standard) providing the explanation that it was not possible to determine if authors forgot to include this information in the manuscript or if it was ignored as there were no alternative reference standards. On the other hand, the included review by Phua et al published in 2023 considered STARD 2015 as a 34 item checklist as they assessed sub-items such as 12a and 12b to be their own distinct items. In order to address varying total number of STARD 2015 items evaluated across different reviews, we scaled all counts of STARD 2015 adherence to reflect a percentage. However, these limitations likely contribute to the notable heterogeneity in adherence across reviews, which limits our findings. In addition to this, the majority of included reviews were in the field of diagnostic imaging, limiting generalizability of our findings to DTA reviews in other fields. An additional limitation was the lack of available granularity of data. Not all reviews reported item-specific adherence, so we were unable to provide insight into the rate of adherence on a per item basis. Finally, there may be instances of duplication of primary DTA studies included across systematic reviews and meta-analysis results due to reviews with overlapping date ranges and inclusion criteria; this potential double-counting of primary studies may have biased our findings.
Adherence of primary DTA studies to STARD 2015 was evaluated to be incomplete. Improved completeness of reporting improves the overall quality and reproducibility of the diagnostic accuracy research. In turn, this drives clinical decisions regarding optimal diagnostic modality and leads to accurate disease detection. Further research is needed to identify more effective knowledge translation strategies. New knowledge translation strategies such as journals implementing adherence checklists during the peer-review process can be considered to improve completeness of reporting of DTA studies.
Footnotes
Appendix A
Literature search was performed on April 30, 2024.
Embase Classic+Embase <1947 to 2024 April 29>
Ovid MEDLINE(R) ALL <1946 to April 29, 2024>
Medline
Embase
Acknowledgements
The authors thank Risa Shorr, Librarian at The Ottawa Hospital, for assistance in designing the literature search strategy, and Josée Skuce, Library technician at The Ottawa Hospital, for providing access to literature search results.
Abbreviations
DTA: Diagnostic test accuracy
STARD: STAndards for Reporting of Diagnostic Accuracy Studies
Ethical Considerations
As this project used only secondary published data, Research Ethics Board (REB) ethical approval was not required.
Author Contributions
Sakib Kazi: Methodology, Investigation, Data Curation, Writing—Original Draft, Writing—Review & Editing. Mohammed Kashif Al-Ghita: Investigation, Writing—Review & Editing. Haben Dawit: Investigation, Writing—Review & Editing. Hoda Osman: Investigation, Writing—Review & Editing. Danyaal Ansari: Investigation, Writing—Review & Editing. Nabil Islam: Investigation, Writing—Review & Editing. Eric Lam: Methodology, Formal analysis, Writing—Review & Editing. Daniël A. Korevaar: Methodology, Validation, Writing—Review & Editing. Patrick M. Bossuyt: Methodology, Validation, Writing—Review & Editing. Jérémie F. Cohen: Methodology, Validation, Writing—Review & Editing. Matthew D. F. McInnes: Conceptualization, Methodology, Validation, Writing—Review & Editing, Supervision, Project administration, Funding acquisition.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Matthew D. F. McInnes is supported by the University of Ottawa Department of Radiology Research Stipend Program.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The project protocol has been posted on Open Science Framework doi.org/10.17605/OSF.IO/V4CU9. All data used in the analysis is based on previously published scientific articles. The compiled data is available upon request.
