The STARD-AI Reporting Guideline for Diagnostic Accuracy Studies Using Artificial Intelligence

Abstract

The number of artificial intelligence (AI) diagnostic accuracy studies in radiology is rapidly increasing. However, the quality and completeness of reporting in these studies have not kept pace. High diagnostic performance metrics are often presented without sufficient detail regarding data sources, validation strategies, or how models perform across different clinical settings and patient populations. As a result, reported diagnostic performance may appear higher than it would be in routine clinical practice. This limits the reader’s ability to assess reliability and clinical relevance.

This is not a new problem. Reporting guidelines such as the Standards for Reporting of Diagnostic Accuracy Studies (STARD) 2015 were developed to improve transparency and reproducibility in diagnostic accuracy research.¹ However, STARD was designed as a general framework for diagnostic accuracy studies and does not fully capture several AI-specific considerations, including dataset curation, model development, and validation strategies. Adherence to STARD has historically been inconsistent, with studies demonstrating that a substantial proportion of recommended items are not reported.² Although modest improvements have been observed over time, reporting gaps remain common, limiting the ability to critically appraise study design and methodological rigor.

These limitations are particularly important in AI diagnostic accuracy studies. Machine learning-based studies introduce additional methodological complexity. In particular, inadequate description of study populations or absence of external validation can lead to overestimation of model performance and limit performance in new clinical settings. Prior work has emphasized that transparent reporting of these elements is essential for accurate interpretation of model performance.³ Without clear reporting, even well-designed studies may be difficult to interpret in practice.

The recently developed STARD-AI guideline represents an important step forward toward addressing these issues.⁴ As an extension of STARD, it provides tailored recommendations for reporting AI-based diagnostic accuracy studies, with emphasis on transparent dataset description, rigorous validation, and explicit consideration of bias, generalizability, and equity. By addressing these AI-specific reporting gaps, STARD-AI supports more reliable interpretation of model performance, enabling safer clinical implementation.

Improving reporting is not solely the responsibility of authors. Journals play a critical role in shaping expectations for transparency and reproducibility. Prior work has demonstrated that while many imaging journals endorse open science practices, adherence at the study level remains limited, highlighting a persistent gap between policy and practice.⁵ Incorporation of reporting guidelines into submission and peer review processes has improved reporting quality in other domains,⁶ and similar expectations are needed to ensure that published AI research is rigorously evaluated and appropriately integrated into clinical workflows.

Ultimately, the value of AI in radiology depends not only on model performance, but on the quality of the evidence supporting it. Without consistent adherence to reporting standards such as STARD-AI, the clinical promise of AI will often be undermined by evidence that is difficult to interpret, compare, and apply in practice.

Footnotes

ORCID iDs

Eric William John Hutfluss

Kelly Harper

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dr. McInnes is supported by CIHR operating grant.

References

Cohen

Korevaar

Altman

, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799. doi:10.1136/bmjopen-2016-012799

Kashif Al-Ghita

Dawit

Kazi

, et al. Evaluation of imaging research adherence to the STARD 2015 reporting guideline: update 9 years after implementation and baseline assessment. Can Assoc Radiol J. 2025;76(4):631-645. doi:10.1177/08465371251324090

Gong

Soyer

McInnes

MDF

Patlas

MN.

Elements of a good radiology artificial intelligence paper. Can Assoc Radiol J. 2023;74(2):231-233. doi:10.1177/08465371221101195

Sounderajah

Guni

Liu

, et al. The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence. Nat Med. 2025;31(10):3283-3289. doi:10.1038/s41591-025-03953-8

Kashif Al-Ghita

Cobey

Moher

, et al. Cross-sectional evaluation of open science practices at imaging journals: a meta-research study. Can Assoc Radiol J. 2024;75(2):330-343. doi:10.1177/08465371231211290

Salameh

Moher

McGrath

, et al. Assessing adherence to the PRISMA-DTA guideline in diagnostic test accuracy systematic reviews: a five-year follow-up analysis. J Appl Lab Med. 2025;10(2):416-431. doi:10.1093/jalm/jfae117