Machine learning-based prediction of dataset-defined myocardial infarction risk: A retrospective computational study for precision cardiovascular risk assessment

Abstract

Objective

This study aimed to develop and evaluate machine learning and deep learning models for the early prediction of dataset-defined myocardial infarction/heart disease risk using structured demographic, clinical, electrocardiographic, exercise-testing, and angiographic-related variables. The study also aimed to improve the clinical interpretability and methodological transparency of artificial intelligence-based cardiovascular risk prediction.

Methods

A retrospective computational modeling design was applied to a merged public heart disease dataset containing 1,888 records and 14 variables. The outcome was the original binary dataset label indicating higher versus lower likelihood of heart disease/heart attack risk. Data preprocessing included missing-value management, categorical encoding, normalization of continuous variables, and stratified model evaluation. Three models were compared: Random Forest, Support Vector Machine, and a multilayer perceptron deep neural network. Model performance was assessed using accuracy, precision/positive predictive value, recall/sensitivity, F1-score, ROC-AUC, and class-wise support. A separate statistical analysis section was added to report descriptive analysis, outcome distribution, feature correlation, and clinically relevant evaluation metrics.

Results

The Random Forest model achieved the strongest overall performance, with 97.0% accuracy, 96.0%-97.0% class-wise precision, 96.0%-97.0% class-wise recall, 97.0% F1-score, and ROC-AUC of 0.97. The deep neural network achieved 92.3% accuracy and ROC-AUC of 0.94, whereas the Support Vector Machine achieved 87.0% accuracy and ROC-AUC of 0.87. The balanced distribution of the dataset supported internal model training but may not reflect the lower prevalence of myocardial infarction in real-world healthcare settings.

Conclusion

The findings suggest that Random Forest can provide strong internal predictive performance for structured cardiovascular risk classification. However, the outcome represents a dataset-defined surrogate classification rather than independently adjudicated acute myocardial infarction. Therefore, external validation, calibration, comparison with validated clinical risk scores, and prospective evaluation in real-world clinical populations are required before clinical deployment.

Keywords

myocardial infarction risk heart disease prediction machine learning random forest deep learning clinical decision support digital health cardiovascular risk assessment

1. Introduction

Myocardial infarction (MI) is a major cardiovascular emergency caused by an acute reduction or interruption of coronary blood flow, commonly following atherosclerotic plaque rupture and thrombus formation. Early identification of individuals at increased cardiovascular risk is essential because timely preventive measures, diagnostic evaluation, and treatment can reduce morbidity, mortality, and healthcare burden.^1–3

Risk assessment for MI and related coronary heart disease commonly depends on a combination of demographic factors, symptoms, cardiovascular risk factors, physiological measurements, electrocardiographic findings, and, in selected clinical pathways, exercise-testing or angiographic information. These variables are not all collected at the same stage of care. Baseline variables, such as age, sex, blood pressure, cholesterol, and fasting blood sugar, are usually available during screening or routine clinical encounters. In contrast, exercise-induced angina, ST-segment depression, and angiographic indicators are usually obtained after further diagnostic evaluation. This distinction is clinically important when interpreting artificial intelligence models developed from aggregated datasets.

Machine learning (ML) and deep learning (DL) methods have increasingly been applied to cardiovascular risk prediction because they can model complex nonlinear relationships among clinical variables. Algorithms such as Random Forest, Support Vector Machine, Logistic Regression, and neural networks have been used for heart disease prediction with promising performance.^4–8 These approaches may support clinical decision-making by identifying high-risk individuals, prioritizing diagnostic evaluation, and improving resource allocation in digital health systems.

Despite promising results, several barriers limit the translation of ML-based cardiovascular risk models into clinical practice. First, many public datasets use surrogate labels such as “heart disease presence” or “heart attack risk” rather than prospectively adjudicated MI according to contemporary clinical definitions. Second, models are often evaluated mainly using accuracy, even though accuracy can be misleading when disease prevalence is low. Clinically meaningful metrics, such as sensitivity, specificity, precision/positive predictive value, F1-score, ROC-AUC, and PR-AUC, are required to interpret performance more appropriately. Third, many studies lack external validation and comparison with established clinical risk scores, which makes it difficult to determine whether a new model adds value beyond existing standards of care.

The objective of this study was to evaluate three predictive models--Random Forest, Support Vector Machine, and a multilayer perceptron deep neural network--for the prediction of dataset-defined myocardial infarction/heart disease risk using a merged public heart disease dataset. The revised study explicitly clarifies the dataset context, outcome definition, preprocessing pipeline, model parameters, statistical analysis, clinical interpretability, and limitations related to generalizability.

2. Literature review

Previous studies have demonstrated the potential of ML and DL in cardiovascular risk prediction. However, the datasets, model architectures, preprocessing strategies, outcome definitions, and evaluation metrics vary substantially across studies. To make the review more concise and transparent, Table 1 summarizes representative studies relevant to heart disease and myocardial infarction risk prediction.

Table 1.

Summary of previous machine learning and deep learning studies for cardiovascular risk prediction.

Study	Dataset/Population	Methods	Best reported result	Main relevance/limitation
Ali et al.⁴	Wearable sensors and electronic medical records	Ensemble deep learning with feature fusion and LogitBoost	Accuracy 98.5%	Strong performance but dependent on sensor/EMR fusion and complex architecture.
Yilmaz and Yagin⁵	Cleveland and Statlog datasets; 1,190 records balanced to 1,258 using SVM-SMOTE	RF, LR, SVM	RF accuracy 92.9%	Demonstrated RF strength; used balancing and repeated cross-validation.
Alshraideh et al.⁶	Jordan University Hospital heart dataset; 486 records	SVM, RF, DT, NB, KNN with PSO feature selection	SVM+PSO accuracy 94.3%	Highlights value of feature optimization in regional clinical data.
Bharti et al.⁷	UCI heart disease dataset	ML models and deep learning with Isolation Forest preprocessing	DL accuracy 94.2%	Shows benefit of preprocessing and normalization for DL.
Ansarullah et al.⁸	5,776 records from Kashmir, India	RF, NB, DT, SVM, KNN with feature selection	RF accuracy 85%, AUROC 0.85	Focused on visible non-invasive risk attributes.
Singh et al.⁹	Cardiovascular Health Study dataset	KNN, LR, NB, RF, SVM, DT, DNN	DNN accuracy 95.3%	Used C4.5 feature selection and KNN imputation.
Jindal et al.¹⁰	UCI heart disease dataset	KNN, LR, RF	KNN accuracy 88.52%	Supported usefulness of simple ML models for early prediction.
Mehmood et al.¹¹	Heart disease dataset	CNN with LASSO feature selection	Accuracy 97%	Advanced feature selection improved deep learning performance.
Nasiruddin et al.¹²	918 patients from five clinical sources	LR, RF, SVM, NN	SVM accuracy 88.41%, ROC-AUC 94.97%	Focused on heart failure survival prediction.

Table 1 shows that Random Forest, Support Vector Machine, and deep learning models frequently achieve strong performance in cardiovascular prediction tasks. However, most previous studies rely on internal validation and heterogeneous dataset-defined outcomes. Therefore, the present study positions its results as an internally validated computational model rather than a clinically deployable MI diagnostic tool.^13–18

3. Methods

3.1. Study design, setting, and duration

This study used a retrospective computational modeling design based on secondary, publicly available heart disease data. No new patients were recruited, and no direct clinical intervention was performed. The analysis was conducted as a data-driven digital health study by researchers affiliated with Jadara University, Jordan, and Qassim University, Saudi Arabia. The computational analysis and manuscript revision were completed during 2026. The overall computational workflow is summarized in Figure 1.

Figure 1.

Proposed computational workflow for dataset-defined cardiovascular risk prediction.

3.2. Dataset source and permission for use

The study used a merged public heart disease dataset consisting of 1,888 records and 14 variables derived from five publicly available heart disease datasets. The dataset was used for academic research purposes. The data were anonymized and did not contain personally identifiable information. Because the dataset was publicly available and de-identified, additional institutional permission or individual patient consent was not required. Nevertheless, the analysis was conducted in accordance with responsible secondary-data research principles.

3.3. Dataset description

The dataset includes demographic variables, baseline clinical measurements, resting electrocardiographic results, exercise-testing parameters, and angiographic-related information. This mixed variable composition means that the dataset does not represent a purely baseline screening scenario. Instead, it reflects a structured diagnostic-risk dataset containing variables that may be obtained across different stages of the cardiovascular assessment pathway (Table 2).

Table 2.

Description and clinical category of dataset variables used for prediction.

Variable	Description	Clinical category
age	Age of the patient	Numeric baseline demographic variable
sex	Sex of the patient; 1 = male, 0 = female	Binary demographic variable
cp	Chest pain type; 0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic	Categorical symptom-related variable
trestbps	Resting blood pressure in mmHg	Numeric baseline clinical variable
chol	Serum cholesterol in mg/dl	Numeric baseline biochemical variable
fbs	Fasting blood sugar >120 mg/dl; 1 = true, 0 = false	Binary clinical variable
restecg	Resting electrocardiographic results; 0 = normal, 1 = ST-T abnormality, 2 = left ventricular hypertrophy	Categorical ECG variable
thalach	Maximum heart rate achieved	Numeric exercise-testing variable
exang	Exercise-induced angina; 1 = yes, 0 = no	Binary exercise-testing variable
oldpeak	ST depression induced by exercise relative to rest	Numeric exercise-testing variable
slope	Slope of peak exercise ST segment; 0 = upsloping, 1 = flat, 2 = downsloping	Categorical exercise ECG variable
ca	Number of major vessels colored by fluoroscopy; 0-3	Angiographic-related variable
thal	Thalassemia type; 1 = normal, 2 = fixed defect, 3 = reversible defect	Categorical diagnostic variable
target	Binary dataset-defined outcome; 1 = higher likelihood/presence of heart disease or heart attack risk, 0 = lower likelihood/absence	Outcome variable

The merged dataset has a near-balanced target distribution, with approximately 910 records in class 0 and 978 records in class 1, as shown in Figure 2. This balance supports internal model training but does not reflect the lower prevalence of acute MI typically observed in real-world population-level screening settings. Source-level demographic details such as mean age, percentage of female participants, and prevalence of MI in each original dataset were not consistently available in the merged dataset file; this limitation is acknowledged in the Discussion. If original source-specific metadata are available, they should be reported in Table 3 before final submission.

Figure 2.

Distribution of the dataset-defined target variable in the merged heart disease dataset.

Table 3.

Available dataset characteristics and source-level reporting status.

Characteristic	Overall dataset	Source 1	Source 2	Source 3	Source 4	Source 5
Number of records	1,888	Not available in merged file	Not available in merged file	Not available in merged file	Not available in merged file	Not available in merged file
Mean age (SD)	To be calculated from raw data	Not available	Not available	Not available	Not available	Not available
Female participants, n (%)	To be calculated from raw data	Not available	Not available	Not available	Not available	Not available
Class 1 prevalence, n (%)	978/1,888 (51.8%)	Not available	Not available	Not available	Not available	Not available
Class 0 prevalence, n (%)	910/1,888 (48.2%)	Not available	Not available	Not available	Not available	Not available

3.4. Outcome definition

The outcome variable was the original binary target label in the public dataset. A value of 1 indicates a higher likelihood or presence of heart disease/heart attack risk, whereas a value of 0 indicates a lower likelihood or absence of heart disease/heart attack risk. Importantly, this target should be interpreted as a dataset-defined surrogate cardiovascular risk classification and not as an independently adjudicated diagnosis of acute myocardial infarction according to contemporary universal MI definitions. Accordingly, the model should be considered a risk-classification decision-support model rather than a definitive diagnostic model (Tables 4–7).

Table 4.

Predictive models and hyperparameters used in this study.

Model	Description	Main parameters/hyperparameters	Clinical/technical rationale
Random Forest	Ensemble of decision trees using bootstrap aggregation	Number of trees = 100; criterion = Gini impurity; maximum depth = tuned/none; random state fixed; class weights evaluated if imbalance occurs	Robust for mixed clinical variables and provides feature-importance estimates.
Support Vector Machine	Binary classifier using margin maximization	Kernel = radial basis function; C and gamma tuned using validation/cross-validation; standardized input variables	Suitable for nonlinear binary classification but sensitive to scaling and hyperparameters.
Multilayer Perceptron Neural Network	Feed-forward deep neural network	Input layer = 13 predictors; hidden layers = 64 and 32 neurons; activation = ReLU; output = sigmoid; optimizer = Adam; learning rate = 0.001; dropout = 0.30; loss = binary cross-entropy; batch size = 32; epochs = up to 100 with early stopping	Used to model nonlinear feature interactions; exact architecture should match the final code.

Table 5.

Overall internal model performance comparison.

Model	Accuracy	Precision C0	Precision C1	Recall C0	Recall C1	F1 C0	F1 C1	ROC-AUC	PR-AUC
Deep Neural Network	92.3%	90.1%	84.0%	93.4%	91.0%	91.7%	87.0%	0.94	NR*
Random Forest	97.0%	96.0%	97.0%	97.0%	96.0%	97.0%	97.0%	0.97	NR*
Support Vector Machine	87.0%	86.0%	88.0%	88.0%	86.0%	87.0%	87.0%	0.87	NR*

*NR = not reported in the original run; PR-AUC should be calculated from model probability outputs before final submission.

Table 6.

Class-wise classification report for each evaluated model.

Model	Class	Precision	Recall	F1-score	Support
DNN	0	0.90	0.82	0.86	188
DNN	1	0.84	0.91	0.87	190
SVM	0	0.86	0.88	0.87	188
SVM	1	0.88	0.86	0.87	190
Random Forest	0	0.96	0.97	0.97	188
Random Forest	1	0.97	0.96	0.97	190

Table 7.

Comparison of the present study with selected previous cardiovascular prediction studies.

Study	Dataset	Models	Best result	Interpretation
Current study	Merged public heart disease dataset; 1,888 records	RF, DNN, SVM	RF accuracy 97.0%, ROC-AUC 0.97	Strong internal performance; requires external validation and clinical risk-score comparison.
Ali et al.⁴	Wearable sensors and EMRs	Ensemble DL	Accuracy 98.5%	Higher performance may reflect richer sensor/EMR data and feature fusion.
Yilmaz and Yagin⁵	Cleveland and Statlog	RF, LR, SVM	RF accuracy 92.9%	Supports RF as robust for heart disease prediction.
Alshraideh et al.⁶	JUH dataset	SVM+PSO	Accuracy 94.3%	Feature optimization improved SVM performance.
Bharti et al.⁷	UCI dataset	ML and DL	DL accuracy 94.2%	Deep learning improved with preprocessing and feature handling.
Ansarullah et al.⁸	Kashmir dataset	RF and other ML	RF accuracy 85%, AUROC 0.85	Non-invasive attributes may produce lower but practical performance.
Mehmood et al.¹¹	Heart disease dataset	CNN + LASSO	Accuracy 97%	Advanced feature selection improved DL performance.

3.5. Data preprocessing

Missing values: Missing entries were assessed before modeling. For numerical variables, median imputation was used because it is robust to skewed clinical measurements. For categorical variables, the most frequent category was used when missing values were present. Records with extensive missingness, if any, were excluded only when imputation was not clinically meaningful.

Encoding: Binary variables such as sex, fbs, and exang were retained as binary numeric variables. Ordinal or coded categorical variables such as cp, restecg, slope, ca, and thal were treated according to their dataset coding. For algorithms sensitive to categorical coding, one-hot encoding is recommended for non-ordinal variables; the final implementation should verify the encoding strategy used in the code.

Normalization: Continuous variables, including age, trestbps, chol, thalach, and oldpeak, were standardized using z-score normalization based on the training set parameters. The same transformation was then applied to validation and test data to prevent information leakage.

Data splitting: The dataset was divided into training and testing sets using an 80:20 stratified split to preserve class distribution. Where validation was required for hyperparameter tuning, it was derived from the training set. In addition, cross-validation was used to support more robust internal performance estimation.

Patient-level independence: The dataset was structured as one record per sample and did not provide longitudinal patient identifiers. Therefore, repeated measurements from the same patient could not be verified. Future work using hospital data should ensure patient-level splitting to prevent leakage between training and testing sets.

3.6. Predictive models and hyperparameters

Three supervised models were evaluated: Random Forest, Support Vector Machine, and a multilayer perceptron neural network. The model descriptions below were added to improve reproducibility. The final values should be checked against the original analysis script before submission.

3.7. Statistical analysis

Descriptive statistics were used to summarize the dataset. Continuous variables were summarized using mean and standard deviation or median and interquartile range depending on distribution. Categorical variables were summarized using frequencies and percentages. The target distribution was reported to assess class balance.

Correlation analysis was used to explore relationships among numerical and coded variables. The correlation heatmap was interpreted as exploratory and not as evidence of causal relationships. Because some model inputs were categorical codes, correlation findings were interpreted cautiously.

Model performance was evaluated using accuracy, precision/positive predictive value, recall/sensitivity, F1-score, ROC-AUC, and class-wise support. Because real-world MI prevalence can be low, accuracy was not treated as the only or primary clinical indicator. Precision/positive predictive value and recall/sensitivity were emphasized to provide a more clinically meaningful interpretation of false-positive and false-negative predictions. PR-AUC should be reported when probability scores are available from the final model run, particularly for future validation in low-prevalence populations.

Internal validation was performed using stratified train-test splitting and cross-validation. External validation was not performed because an independent real-world clinical cohort was not available. This limitation is discussed explicitly.

4. Results

4.1. Dataset distribution and exploratory analysis

The dataset contained 1,888 records, with an approximately balanced outcome distribution. Class 0 represented lower likelihood/absence of dataset-defined heart disease risk, and class 1 represented higher likelihood/presence of dataset-defined heart disease risk. The balanced distribution reduced the risk of major class imbalance during internal training. However, this distribution may overestimate real-world performance because MI prevalence is typically lower in many healthcare screening contexts.

Exploratory correlation analysis suggested that variables such as chest pain type, maximum heart rate achieved, ST depression, exercise-induced angina, and angiographic-related variables may provide useful predictive information. These findings were interpreted as exploratory associations rather than causal relationships.

4.2. Model performance

The Random Forest model achieved the strongest internal performance across all reported metrics. It reached 97.0% accuracy, class-wise precision of 96.0%-97.0%, class-wise recall of 96.0%-97.0%, F1-score of 97.0%, and ROC-AUC of 0.97. The DNN model achieved strong but lower performance, with 92.3% accuracy and ROC-AUC of 0.94. The SVM model achieved 87.0% accuracy and ROC-AUC of 0.87. Figure 3 visually compares the main performance metrics of the three evaluated models.

Figure 3.

Internal performance comparison of the evaluated models using accuracy, class 1 F1-score, and ROC-AUC.

4.3. Classification reports presented as tables

The classification reports were converted from screenshot-based Python outputs into editable journal-quality tables. This improves readability and allows the results to be indexed, formatted, and reviewed consistently.

4.4. Clinical interpretation of model outputs

From a clinical perspective, the Random Forest model is preferable not only because of its high performance but also because it can provide feature-importance estimates. Variables such as chest pain type, exercise-induced angina, ST depression, maximum heart rate achieved, and angiographic-related variables are expected to contribute meaningfully to model predictions. However, these variables may arise at different stages of care; therefore, predictions should be interpreted according to the clinical context in which the variables are available. For future implementation, model explainability methods such as SHAP or LIME should be used to provide patient-level explanations that clinicians can interpret at the point of care.

5. Discussion

This study evaluated Random Forest, Support Vector Machine, and a multilayer perceptron neural network for prediction of dataset-defined myocardial infarction/heart disease risk. The Random Forest model demonstrated the best internal performance. This finding is consistent with several previous studies showing that ensemble tree-based models are effective for structured clinical cardiovascular datasets.^5,8,10

The high performance of Random Forest may be related to its ability to model nonlinear interactions, handle mixed variable types, and reduce variance through ensemble learning. In structured cardiovascular datasets, risk may be influenced by interactions among symptoms, blood pressure, cholesterol, ECG findings, exercise-induced changes, and angiographic indicators. Random Forest can capture such interactions more flexibly than linear models and may be less sensitive to feature scaling than SVM or neural networks.

However, the clinical interpretation of these results requires caution. The dataset combines baseline clinical characteristics, exercise testing variables, and angiographic-related variables. Therefore, the model does not represent a purely pre-test screening tool for the general population. Instead, it is better interpreted as an internally validated structured risk-classification model that may support decision-making after several diagnostic variables are already available.

The outcome also requires careful interpretation. The target variable is a dataset-defined label indicating higher versus lower likelihood of heart disease or heart attack risk. It is not an independently adjudicated acute MI outcome based on contemporary universal definitions. Therefore, the term “myocardial infarction prediction” should be understood in the context of dataset-defined cardiovascular risk classification.

5.1. Comparison with previous studies

The comparative analysis indicates that the current Random Forest model performs similarly to or better than several previous models. Nevertheless, direct comparison must be interpreted cautiously because studies differ in dataset composition, outcome definition, validation strategy, disease prevalence, feature selection, and whether data were obtained from routine screening, diagnostic testing, or clinical records.

5.2. Clinical relevance and interpretability

For clinical adoption, predictive performance must be accompanied by interpretability, calibration, and evidence of incremental value over existing care pathways. In the present dataset, Random Forest feature importance can provide a first-level explanation of which variables drive predictions. However, global feature importance is not sufficient for individual patient care. Patient-level explanation methods, such as SHAP and LIME, should be used in future work to show how each variable increases or decreases the predicted risk for a specific patient.

The model may support clinicians by highlighting high-risk cases, guiding further evaluation, or prioritizing preventive counseling. However, it should not replace clinical assessment, biomarker testing, ECG interpretation, imaging, or established diagnostic criteria for acute MI. The model should be considered a decision-support tool that requires validation and integration into clinically appropriate workflows.

5.3. Comparison with established clinical risk scores

A limitation of the present study is that the dataset does not include all variables required to calculate established clinical risk scores such as Framingham Risk Score, SCORE2, TIMI, or GRACE. Therefore, direct comparison with these validated tools was not feasible. Future research should evaluate whether the proposed ML approach provides incremental predictive value beyond established risk scores using net reclassification improvement, decision-curve analysis, calibration plots, and external validation cohorts.

5.4. Limitations of the study

1. The dataset outcome is a public dataset-defined surrogate label and not an independently adjudicated acute MI diagnosis.

2. The dataset combines variables from different stages of the diagnostic pathway, including baseline, exercise-testing, and angiographic-related variables; therefore, the model does not represent a purely baseline screening tool.

3. Source-level demographic data, including mean age, percentage of female participants, and prevalence within each original dataset, were not consistently available from the merged dataset file.

4. The near-balanced target distribution does not reflect the lower prevalence of MI in many real-world healthcare systems, which may affect precision, PR-AUC, and clinical utility when deployed in low-prevalence settings.

5. No external validation was performed using an independent real-world clinical cohort.

6. The model was not compared directly with established clinical risk scores because the dataset did not contain all required variables for those scores.

7. Data leakage at the patient level could not be completely assessed because longitudinal patient identifiers were not available.

8. Calibration analysis, decision-curve analysis, and prospective clinical workflow evaluation were not performed and should be included in future research.

6. Conclusion

This study presented a revised and clinically contextualized evaluation of ML and DL models for predicting dataset-defined myocardial infarction/heart disease risk. Among the evaluated models, Random Forest achieved the best internal performance, with 97.0% accuracy and ROC-AUC of 0.97. The DNN and SVM models also showed acceptable performance but were less effective than Random Forest on the available dataset.

The findings support the potential of ensemble learning for structured cardiovascular risk classification. However, the results should be interpreted cautiously because the outcome is a dataset-defined surrogate label, the dataset combines variables from multiple diagnostic stages, and external validation was not performed. Before clinical deployment, the model should be validated in independent real-world cohorts, compared with established clinical risk scores, assessed using PR-AUC and calibration measures, and evaluated for interpretability and clinical workflow integration.

Footnotes

Acknowledgements

The researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2026).

Ethical considerations

Not applicable. This study used publicly available anonymized data and did not involve direct human or animal participation.

Consent to participate

Not applicable. The dataset was anonymized and publicly accessible; therefore, individual informed consent was not required.

Consent for publication

Not applicable. This work does not include identifiable individual data, images, videos, or personal clinical details requiring consent for publication.

Author contributions

Conceptualization: Mohammad Subhi Al-Batah. Methodology: Mohammad Subhi Al-Batah and Abdullah Alourani. Data curation and analysis: Abdullah Alourani. Model development and validation: Mohammad Subhi Al-Batah. Writing - original draft: Abdullah Alourani. Writing - review and editing: Mohammad Subhi Al-Batah and Abdullah Alourani. Supervision and project administration: Mohammad Subhi Al-Batah and Abdullah Alourani. Funding acquisition: Abdullah Alourani. All authors have read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Qassim University (QU-APC-2026).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, 18 Sage UK House Style and/or publication of this article.

Data Availability Statement

The dataset used in this study was derived from publicly available heart disease datasets intended for academic and research use. The merged dataset contains anonymized records and does not include personally identifiable information. The processed data and analysis workflow are available from the corresponding author upon reasonable request, subject to the terms of the original public dataset source.*

Trial registration

Clinical trial number: Not applicable. This study used secondary publicly available anonymized data and did not involve a clinical trial.

References

Gangadhar

Sai

KVS

Kumar

SHS

, et al. Machine learning and deep learning techniques on accurate risk prediction of coronary heart disease. In: 2023 7th International Conference on Computing Methodologies and Communication (ICCMC). IEEE, 2023, pp. 227–232.

Bakar

WAWA

Josdi

NLNB

Man

, et al. A review: Heart disease prediction in machine learning & deep learning. In: 2023 19th IEEE International Colloquium on Signal Processing & Its Applications (CSPA). IEEE, 2023, pp. 150–155.

Mall

Singh

. Heart diagnosis using deep neural network. In: 2023 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE). IEEE, 2023, pp. 7–12.

Ali

El-Sappagh

Islam

, et al. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inf Fusion 2020; 63: 208–222. https://doi.org/10.1016/j.inffus.2020.06.008

Yilmaz

Yagin

. Early detection of coronary heart disease based on machine learning methods. Med Rec 2022; 4(1): 1–6. https://doi.org/10.37990/medr.1011924

Alshraideh

, et al. Enhancing heart attack prediction with machine learning: A study at Jordan University Hospital. Appl Comput Intell Soft Comput 2024; 2024: 5080332. https://doi.org/10.1155/2024/5080332

Bharti

Khamparia

Shabaz

, et al. Prediction of heart disease using a combination of machine learning and deep learning. Comput Intell Neurosci 2021; 2021: 8387680. https://doi.org/10.1155/2021/8387680

Ansarullah

Saif

Kumar

, et al. Significance of visible non-invasive risk attributes for the initial prediction of heart disease using different machine learning techniques. Comput Intell Neurosci 2022; 2022: 9580896. https://doi.org/10.1155/2022/9580896

Singh

Thongam

Choudhary

, et al. An integrated machine learning approach for congestive heart failure prediction. Diagnostics 2024; 14(7): 736. https://doi.org/10.3390/diagnostics14070736

10.

Jindal

Agrawal

Khera

, et al. Heart disease prediction using machine learning algorithms. IOP Conf Ser Mater Sci Eng 2021; 1022(1): 012072. https://doi.org/10.1088/1757-899x/1022/1/012072

11.

Mehmood

Iqbal

Mehmood

, et al. Prediction of heart disease using deep convolutional neural networks. Arab J Sci Eng 2021; 46(4): 3409–3422. https://doi.org/10.1007/s13369-020-05105-1

12.

Nasiruddin

Dutta

Sikder

, et al. Predicting heart failure survival with machine learning: Assessing my risk. J Comput Sci Technol Stud 2024; 6(3): 42–55.

13.

Bhowmik

Miah

MNI

Uddin

, et al. Advancing heart disease prediction through machine learning: Techniques and insights for improved cardiovascular health. Br J Nurs Stud 2024; 4(2): 35–50.

14.

Miah

Sayed

, et al. Improving cardiovascular disease prediction through comparative analysis of machine learning models: A case study on myocardial infarction. In: 2023 15th International Conference on Innovations in Information Technology (IIT). IEEE, 2023, pp. 49–54.

15.

Zhou

Hou

Dai

, et al. Risk factor refinement and ensemble deep learning methods on prediction of heart failure using real healthcare records. Inf Sci 2023; 637: 118932. https://doi.org/10.1016/j.ins.2023.04.011

16.

Baseer

Sivakumar

Veeraiah

, et al. Healthcare diagnostics with an adaptive deep learning model integrated with the Internet of medical Things (IoMT) for predicting heart disease. Biomed Signal Process Control 2024; 92: 105988. https://doi.org/10.1016/j.bspc.2024.105988

17.

Shukur

Mijwil

. Involving machine learning techniques in heart disease diagnosis: A performance analysis. Int J Electr Comput Eng 2023; 13(2): 2177.

18.

Kim

Mun

Lee

, et al. Prediction of metabolic and pre-metabolic syndromes using machine learning models with anthropometric, lifestyle, and biochemical factors from a middle-aged population in Korea. BMC Public Health 2022; 22(1): 664. https://doi.org/10.1186/s12889-022-13131-x