Abstract

The number of artificial intelligence (AI) diagnostic accuracy studies in radiology is rapidly increasing. However, the quality and completeness of reporting in these studies have not kept pace. High diagnostic performance metrics are often presented without sufficient detail regarding data sources, validation strategies, or how models perform across different clinical settings and patient populations. As a result, reported diagnostic performance may appear higher than it would be in routine clinical practice. This limits the reader’s ability to assess reliability and clinical relevance.
This is not a new problem. Reporting guidelines such as the Standards for Reporting of Diagnostic Accuracy Studies (STARD) 2015 were developed to improve transparency and reproducibility in diagnostic accuracy research. 1 However, STARD was designed as a general framework for diagnostic accuracy studies and does not fully capture several AI-specific considerations, including dataset curation, model development, and validation strategies. Adherence to STARD has historically been inconsistent, with studies demonstrating that a substantial proportion of recommended items are not reported. 2 Although modest improvements have been observed over time, reporting gaps remain common, limiting the ability to critically appraise study design and methodological rigor.
These limitations are particularly important in AI diagnostic accuracy studies. Machine learning-based studies introduce additional methodological complexity. In particular, inadequate description of study populations or absence of external validation can lead to overestimation of model performance and limit performance in new clinical settings. Prior work has emphasized that transparent reporting of these elements is essential for accurate interpretation of model performance. 3 Without clear reporting, even well-designed studies may be difficult to interpret in practice.
The recently developed STARD-AI guideline represents an important step forward toward addressing these issues. 4 As an extension of STARD, it provides tailored recommendations for reporting AI-based diagnostic accuracy studies, with emphasis on transparent dataset description, rigorous validation, and explicit consideration of bias, generalizability, and equity. By addressing these AI-specific reporting gaps, STARD-AI supports more reliable interpretation of model performance, enabling safer clinical implementation.
Improving reporting is not solely the responsibility of authors. Journals play a critical role in shaping expectations for transparency and reproducibility. Prior work has demonstrated that while many imaging journals endorse open science practices, adherence at the study level remains limited, highlighting a persistent gap between policy and practice. 5 Incorporation of reporting guidelines into submission and peer review processes has improved reporting quality in other domains, 6 and similar expectations are needed to ensure that published AI research is rigorously evaluated and appropriately integrated into clinical workflows.
Ultimately, the value of AI in radiology depends not only on model performance, but on the quality of the evidence supporting it. Without consistent adherence to reporting standards such as STARD-AI, the clinical promise of AI will often be undermined by evidence that is difficult to interpret, compare, and apply in practice.
