Abstract
Background
Valvular heart disease (VHD) is a growing global health problem. Artificial intelligence (AI) models show promise for improving their diagnosis and management, but their black box nature limits transparency, making doctors hesitant to trust them. Explainable AI (XAI) aims to address this, but its application in VHD has not been systematically mapped.
Methods
We conducted a systematic review, searching for studies that applied XAI techniques to any type of VHD. From 374 records, 52 studies were included. Data were extracted on VHD types, AI/XAI methods, and the evaluation of explanations.
Results
Most research has focused on aortic stenosis and mitral regurgitation, using either structured patient data or imaging like echocardiograms. Shapley Additive Explanations was the dominant XAI method (66% of the studies), primarily for feature importance ranking. Although model performance was often strong, rigorous evaluation of explanations was rare. Only a few studies involved clinicians in assessing usefulness or used quantitative metrics to test reliability.
Conclusion
XAI is an active area of research in VHD, mainly for feature attribution. However, the field is still developing. To make XAI truly useful, future work must move beyond explanation generation to validating it with clinicians and ensuring they are stable and trustworthy across different patient populations.
Keywords
Introduction
Valvular heart disease (VHD) refers to a heterogeneous group of disorders affecting the cardiac valves, resulting in impaired unidirectional blood flow through the heart. The spectrum of VHD includes stenotic and regurgitant lesions of the cardiac with coexistence of multivalvular involvement. 1 VHD may be congenital or acquired. Congenital, structurally malformed valves can predispose individuals to early dysfunction, whereas acquired causes include age-related degenerative changes, infections, inflammatory conditions, and traumatic injury. Degenerative calcification is a leading cause in high-income countries, while infections such as rheumatic fever and infective endocarditis continue to contribute substantially to the global burden. 2
According to the Global Burden of Disease (GBD) study, non-rheumatic VHD accounted for 29.5 million cases, 191,000 deaths, and 3.43 million disability-adjusted life years (DALYs) worldwide in 2023. While age-standardized mortality has remained stable, prevalence continues to rise. Rheumatic heart disease (RHD) remains a major contributor in low- and middle-income countries. 3 In India, the burden of VHD is caused by RHD affecting vulnerable populations despite the overall epidemiological transition. It has resulted in 3.7 million DALYs and over 100,000 deaths in 2017, with absolute numbers increasing despite declining age-standardized rates. 4
Despite advances in imaging and clinical management, the evaluation and management of VHD require integration of multimodal data, longitudinal follow-up, and expert interpretation. Variability in imaging, delayed diagnosis, and challenges in risk stratification and timing of intervention continue to impact outcomes. These challenges, along with the growing burden, have driven interest in data-driven approaches.
Artificial intelligence (AI) is increasingly integrated into cardiovascular care, supporting diagnosis, risk stratification, and patient monitoring. In VHD, AI has shown promise in imaging like echocardiography, where deep learning (DL) models enable automated valve segmentation, functional assessment, and severity classification. Applications also extend to cardiac computed tomography (CT) for evaluating valvular anatomy and calcification. Machine learning (ML) models are being used for predicting outcomes and optimizing intervention timing.5, 6
Despite the growth of AI in cardiovascular care, many ML models function as “black boxes,” providing predictions without transparent reasoning. This lack of interpretability is a barrier to clinician trust, regulatory acceptance, and accountability, raising concerns about bias and safe clinical deployment. 7 Explainable artificial intelligence (XAI) is a response to these limitations. Rather than replacing complex models, XAI aims to make its reasoning more transparent and clinically interpretable. It enables clinicians to understand why a model arrived at a particular prediction and how individual features influenced that output. XAI includes models that are inherently interpretable or post hoc techniques that provide explanations for more complex algorithms. 8 Table 1 tabulates the major categories of XAI methods and their potential applications.
Overview of Explainable Artificial Intelligence Methodologies in Valvular Heart Disease.
Valvular heart disease often presents with nonspecific symptoms such as breathlessness and fatigue, which are often ignored until the condition becomes severe, leading to delayed diagnosis and referral. XAI can provide patient-specific insights by supporting severity classification, identifying key features influencing predictions, and clarifying risk following interventions. This may improve early detection, risk stratification, and clinical decision-making.
As part of our broader work on XAI in cardiovascular care, this review focuses on its application in VHD. The objectives of this systematic review are:
To systematically identify and synthesize studies applying XAI in VHD. To characterize the valvular conditions, data modalities, and clinical tasks in which XAI has been employed. To examine the range of XAI techniques and their role in enhancing model interpretability and transparency. To identify limitations, methodological gaps, and challenges in the clinical translation of XAI in VHD.
Methods
Study Design and Protocol
We conducted a systematic review to identify and synthesize evidence on the application of XAI in VHD. We followed Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines (Supplemental Material S1). 9 The study protocol was prospectively registered with PROSPERO (International Prospective Register of Systematic Reviews).
Literature Search Strategy
A literature search was performed in MEDLINE (via PubMed) and Scopus, with secondary searches in Google Scholar and Dimensions. Reference lists of included articles and relevant review papers were manually screened to identify further eligible studies. The search covered the period from database inception to February 2026.
The search strategy combined keywords and medical subject heading (MeSH) terms from three primary themes: VHD, AI, and XAI. Within each theme, keywords were combined using the Boolean operator “OR,” while the three themes were integrated using “AND.” Truncation and wildcard operators were applied according to the syntax requirements of each database. The full search strategies are provided in Supplemental Material S2. No publication year restrictions were applied. Records were imported into Rayyan for screening.
Selection Criteria
Study selection was structured using the SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research Type) framework 10 and is tabulated in Table 2.
SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research Type) Framework.
We included original, peer-reviewed studies in human populations or using VHD-related datasets (electronic health records [EHR], cardiac imaging, electrocardiograms, wearable data, or laboratory parameters) that evaluated AI/ML/DL models with an explainability component. Clinical applications included diagnosis, severity assessment, risk prediction/stratification, prognosis, procedural planning, and patient monitoring. For clinical specificity, this review focused on VHD alone. Only articles published in English were included.
We excluded non-human studies, non-original research (editorials, commentaries, and narrative reviews), conference abstracts without full methodology and outcomes, and purely technical AI papers without clinical application.
Screening
We identified and removed duplicates. JS and HK independently screened articles (titles and abstracts), followed by final full-text screening. Differences were resolved through discussion with BA and AM.
Data Extraction
Data were extracted using a predefined form. We collected study characteristics, such as sample size, external validation, handling of missing data, class imbalance, geographic region, clinical application, valvular pathology, and dataset source, along with data modalities and model details (type, algorithms, and development/validation approaches).
For XAI, we documented the methods used, their rationale, and the timing of explanations. Evaluation metrics (quantitative and qualitative assessment), comparisons between methods, and model performance (discrimination, calibration, and decision curve analysis [DCA]) were also collected. Extracted data were cross-checked for accuracy.
Quality Assessment
JS and HK independently assessed methodological quality and risk of bias (ROB) for all included studies. Disagreements were resolved through discussion with BA or AM. Due to the heterogeneity of study designs, quality assessment was tailored to the methodology of each study. PROBAST-AI (Prediction Model Risk of Bias Assessment Tool-Artificial Intelligence) for prediction models, 11 QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2) for diagnostic accuracy studies, 12 and QUIPS (Quality in Prognosis Studies) for prognostic outcome studies. 13
Due to the variability in aims, analytical approaches, datasets, and explainability methods, a uniform quantitative grading framework, such as GRADE (Grading of Recommendations Assessment, Development, and Evaluation), was not applied. Instead, the certainty of evidence was assessed qualitatively based on the methodology, consistency, and clinical relevance.
Results
A total of 374 records were identified from databases, preprint servers, and gray literature. A total of 165 duplicates were removed, and the remaining 209 records were screened based on the title and abstract. Of these, 114 were excluded. The full texts of 93 reports were then assessed for eligibility, and following the application of the exclusion criteria, 52 studies were included in the final analysis, of which three were preprints. Figure 1 shows the PRISMA 2020 flow diagram used for the selection of the articles.
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Flow Diagram of Searching and Screening. 9
The characteristics of the included studies are presented in Supplemental Material S3.14-65
Quality Assessment
Risk of bias was assessed using the appropriate tool for each study’s design: PROBAST-AI (n = 42), QUADAS-2 (n = 9), and QUIPS (n = 1). Domain-level judgments are visually represented as traffic light plots using robvis 66 in Supplemental Material S4. Among the 42 studies evaluated with PROBAST-AI, the overall ROB was high in 12, unclear in 14, and low in 16. Common limitations included a lack of external validation, inadequate sample size relative to predictors or events, poor reporting of class imbalance handling, and potential overfitting due to extensive feature selection or hyperparameter tuning without robust validation. Of the nine studies assessed with QUADAS-2, most had unclear ROB (n = 6), while three were rated low. The single prognostic study 39 was assessed using the QUIPS tool and judged to be at low risk across all domains.
Clinical Scope
Across 52 studies, XAI in VHD showed substantial heterogeneity in clinical objectives, methods, and evaluation. The clinical spectrum included diagnostic classification (n = 20), postoperative risk prediction (n = 24), prognostic outcome modeling (n = 8), treatment response prediction (n = 1), screening applications (n = 2), and unsupervised phenogroup discovery (n = 1). Aortic stenosis (AS) and mitral regurgitation (MR) were the most frequently investigated valvular pathologies, as isolated conditions or within multi-class frameworks.
Data Modalities and Model Architectures
Data modality showed a predominance of structured tabular data from EHR and clinical registries (n = 24, 45.3%), followed by signal-based modalities such as phonocardiograms (PCGs) (n = 12, 22.6%), echocardiographic imaging (n = 8, 15.1%),18, 25, 31, 39, 50, 56, 64, 65 and multimodal approaches (n = 7, 13.2%).14, 26, 30, 45, 51, 54, 57 Multimodal approaches combined data sources such as electrocardiography (ECG) with PCG, echocardiography with clinical variables, waveform data with structured features, or multi-view echocardiography. The rise in imaging-based studies reflects the increasing use of DL for direct valve assessment using echocardiography and CT.
Model types were mainly ML algorithms (n = 28, 52.8%), followed by DL (n = 21, 39.6%), and hybrid approaches (n = 5, 9.4%).18, 20, 45, 52, 54 Among DL models, convolutional neural networks (CNNs) were common for image and signal analysis,17, 25, 40, 45, 50, 64, 65 while transformer-based architectures appeared in more recent studies.15, 52, 57 Ensemble methods (XGBoost and Random Forest) were widely used in ML models due to their ability to handle tabular clinical data and provide feature importance metrics.
Explainable Artificial Intelligence Method Implementation and Motivation
Shapley Additive Explanations (SHAP) was the most frequently used XAI technique (n = 35, 66.0%). Its applications included both tree-based models (TreeExplainer) and deep neural networks (DeepExplainer or KernelExplainer). Other methods included gradient-based or perturbation-based approaches, including Grad-class activation mapping (CAM) and saliency maps (n = 12, 22.6%),14-17, 31, 39, 40, 45, 50, 54, 64, 65 local interpretable model-agnostic explanations (LIME) (n = 4, 7.5%),21, 43, 44, 47 attention mechanisms (n = 4, 7.5%),41, 45, 52, 57 prototype-based explanations (n = 1, 1.9%), 25 and unsupervised clustering with SHAP (n = 1, 1.9%). 63 SHAP was mainly applied to tree-based models, while Grad-CAM was used with CNNs for imaging and signal data. Recent studies have begun combining multiple XAI methods within a single framework.14, 15, 45, 54, 63
Explainable artificial intelligence was primarily employed to enhance clinical trust, support biomarker discovery, and improve model transparency. The explanations were mainly post hoc (n = 49, 92.5%), applied after model training to interpret black-box predictions. Intrinsic or ante hoc interpretability was observed in only four studies—prototype-based reasoning in ProtoASNet, 25 attention mechanisms,41, 45 and wavelet convolution layers for transparent feature extraction. 52
The scope of model explanations varied, with 36 studies providing both global feature importance and local individual predictions, 8 focusing only on global explanations, and 9 on local explanations only. Global interpretations were mainly used for feature ranking and biomarker discovery, whereas local explanations were used for individualized clinical decision support. However, local explanations were predominantly feature attribution-based (SHAP values), without providing decision rationales or counterfactual explanations, limiting their depth for clinical application.
Evaluation Rigor and Validation Practices
Quantitative XAI evaluation (faithfulness, stability, or completeness) was performed in only two studies. Rohr et al. quantified explanatory power using overlap ratios and intersection-over-union between predictions and expert segmentations. 45 Althaph and Challa employed a similar overlap analysis between Grad-CAM heatmaps and expert annotations. 15 No study evaluated explanation stability, which is a key gap for clinical deployment.
Human evaluation using clinician assessment of explanation usefulness or clinical plausibility was reported in three studies. Alqudah and Alfraihat collaborated with cardiologists to evaluate model predictions and explanations. 14 Vafaeezadeh et al. assessed Grad-CAM visualizations against cardiologist findings, identifying ResNeXt50 as the best explainable model based on attention to relevant regions. 50 Huang et al. used expert review of attention mechanisms, concluding that “with the integration of attention mechanisms, the network demonstrated an increased capacity to concentrate on key areas relevant to different types of MR.” 64 No studies reported formal user studies with standardized protocols or inter-rater reliability assessment. Forty-one studies reported alignment between XAI-derived features and known pathophysiology. However, this was post hoc corroboration rather than prospective validation of explanation utility.
Multi-XAI approaches were used in several studies: Alqudah and Alfraihat compared SHAP, Grad-CAM, Integrated Gradients, and cross-modal attention 14 ; Althaph and Challa integrated Grad-CAM, attention maps, and SHAP 15 ; Xu et al. combined SHAP, occlusion-based importance, and nomogram visualization 54 ; Rohr et al. utilized saliency maps, feature importance, and attention mechanisms 45 ; Bernard et al. combined unsupervised clustering with SHAP for phenogroup characterization. 63
Decision curve analysis was reported in eight studies (15.1%).18, 33, 37, 51, 58, 60-62 Bibi et al. demonstrated the superior net benefit of their ensemble model compared to EuroSCORE I across clinically relevant thresholds. 18 Wang et al. reported that their support vector machine (SVM) model consistently yielded higher net benefits in both the development and validation sets compared to alternative models. 54 Itelman et al. and Russo et al. both incorporated DCA to demonstrate the clinical utility of their transcatheter aortic valve replacement (TAVR) outcome prediction models.61, 62
Discussion
Summary of Results
This review included 52 studies on XAI in VHD, covering applications such as diagnosis, postoperative risk prediction, prognostic modeling, and screening. Most studies focused on AS and MR using structured clinical data, echocardiography, PCGs, and multimodal approaches. SHAP was the most common technique used, while attention mechanisms, Grad-CAM, and LIME were less frequently used. The primary goals of XAI were to enhance clinical trust, identify key features, and support individualized decision-making. However, rigorous evaluation was limited, with few studies performing quantitative assessments or clinician reviews. Most relied on post hoc feature attribution without prospective validation.
These findings align with prior reviews of XAI in cardiovascular imaging and healthcare. Haupt et al. reported that greater use of saliency-based methods in cardiovascular imaging, with lower use of feature-attribution methods such as SHAP. 67 Hoghooghi Esfahani et al. identified SHAP as the most commonly used method, followed by LIME and Grad-CAM, with a modality-dependent pattern. 68 The limited evaluation of XAI outputs observed is a challenge across the field. Prior reviews also highlight that most studies rely on qualitative or visually intuitive explanations without standardized or quantitative evaluation frameworks.67, 68
Interpretation of Explainable Artificial Intelligence Outputs for Clinicians
To facilitate clinician understanding, we present illustrative examples of XAI methods using a synthetic dataset and a representative echocardiographic image in Figure 2. These examples demonstrate how model predictions in VHD can be interpreted clinically.

Figure 2A (SHAP) provides a global view of feature importance. Key variables such as aortic valve area (AVA), mean gradient, and peak velocity show a strong influence on predictions. Lower AVA and higher gradients/velocities shift predictions toward higher risk, while parameters like preserved left ventricular ejection fraction (LVEF) contribute toward lower risk. Figure 2B (LIME) explains an individual prediction. Features such as elevated peak velocity and reduced AVA support a severe disease classification, whereas lower gradients or SBP may oppose it. Figure 2C shows the original echocardiographic image, while Figure 2D (Grad-CAM) highlights regions influencing the model’s decision. The heatmap localizes around the valvular region and the flow jet.
Beyond supporting clinicians, XAI may also improve patient communication. Transparent outputs can help clinicians explain risk estimates and management decisions more clearly, helping in shared decision-making. However, this application remains unexplored and warrants further study.
Limitations of Included Studies
Over-reliance on SHAP was observed in 35 studies, with 15 using SHAP as the only explanation method without justification or discussion of SHAP’s limitations regarding feature independence assumptions and computational stability. In several cases, XAI integration was superficial, with SHAP summary plots presented without meaningful clinical interpretation, limiting their translational value. Small sample sizes (<500 patients) were common, raising concerns about the stability of both predictive models and explanations. External validation was performed in 18 studies, limiting generalizability claims. Among those with external validation, performance degradation was frequently observed but rarely analyzed. Missing data handling was inadequately reported in 15 studies (28.3%), despite its importance for both model performance and explanation validity.
Limitations of This Review
Our review has several limitations. Despite including preprints, some relevant studies may remain unpublished, under review, or not yet indexed. We excluded conference abstracts lacking full methodological details, which may have omitted recent findings. Additionally, the included studies were highly heterogeneous in design, data types, models, and XAI methods, limiting direct comparisons and preventing a quantitative synthesis. Finally, inconsistent reporting of key methodological elements may have affected the reliability of our conclusions.
Future Directions
First, studies on pediatric patients and RHD are limited,31, 45, 50, 54 despite their contribution to the global burden of VHD. Future research should prioritize these populations to ensure the applicability of XAI models.
Second, the literature is dominated by feature-attribution methods, while clinically meaningful approaches, like counterfactual explanations, are unexplored. These frameworks can demonstrate how minimal changes in patient parameters may alter predictions and have strong potential for improving decision-making and personalized care.
Third, there is a need for standardized reporting guidelines for XAI in healthcare. Current studies show considerable variability in application and reporting, with limited use of quantitative evaluation metrics. The development of consensus-based reporting standards would improve transparency.
Fourth, most studies rely on retrospective datasets, highlighting the need for prospective validation. Integrating XAI into clinical workflows and evaluating its impact on clinician decision-making, patient outcomes, and healthcare efficiency are essential for meaningful translation into practice.
Conclusion
This review presents an overview of XAI applications in VHD, synthesizing evidence from 52 studies. The field is methodologically active but clinically immature, with a gradual shift from black-box predictions toward more transparent models that offer insight into decision-making. However, most proposed systems remain retrospective, limiting their clinical applicability. Stakeholders can use these findings as a guide for prioritizing clinically meaningful approaches. Through this review, we highlight the need for prospective validation, explanation stability testing, standardized reporting, clinician-centered design, and broader investigation of understudied populations. Without prospective testing and clinician input, XAI will remain an academic exercise with no patient impact.
Footnotes
Acknowledgement
I extend my gratitude to Dr. Umashri Sundararaju and Dr. Shanmathi Subramanian, whose friendship, encouragement, and support became my anchor during difficult times and a source of joy in this journey.
Authors’ Contributions
Jayapradha Sathish: Conceptualization, methodology, investigation, data curation, formal analysis, writing—original draft, visualization.
Hamrish Kumar Rajakumar: Conceptualization, methodology, investigation, data curation, formal analysis, writing—original draft, visualization.
Bhavnadhan Adiththan Abhilash: Conceptualization, methodology, validation, investigation, writing—review and editing, supervision.
Arun Murugan: Validation, writing—review and editing, supervision, project administration.
Data Availability
All data generated or analyzed during this study are included in this published article and its supplemental information files.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest regarding the research, authorship, and/or publication of this article.
Ethical Approval
This type of study does not require ethical approval. The protocol of this systematic review was registered in the International Prospective Register of Systematic Reviews (PROSPERO) of the National Institute of Health Research, available at
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Patient Consent
Not applicable.
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
