Progressive feature reduction with varied missing data and feature selection for arthritis disease prediction

Abstract

In the expansive domain of data-driven research, the curse of dimensionality poses challenges such as increased computational complexity, noise sensitivity, and the risk of overfitting models. Dimensionality reduction is vital to handle high-dimensional datasets effectively. The pilot study disease dataset (PSD) with 53 features contains patients with Rheumatoid Arthritis (RA) and Osteoarthritis (OA). Our work aims to reduce the dimension of the features in the PSD dataset, identify a suitable feature selection technique for the reduced-dimensional dataset, analyze an appropriate Machine Learning (ML) model, select significant features to predict the RA and OA disease and reveal significant features that predict the arthritis disease. The proposed study, Progressive Feature Reduction with Varied Missing Data (PFRVMD), was employed to reduce the dimension of features by using PCA loading scores in the random value imputed PSD dataset. Subsequently, notable feature selection methods, such as backward feature selection, the Boruta algorithm, the extra tree classifier, and forward feature selection, were implemented on the reduced-dimensional feature set. The significant features/biomarkers are obtained from the best feature selection technique. ML models such as the K-Nearest Neighbour Classifier (KNNC), Linear Discriminant Analysis (LDA), Logistic Regression (LR), Naïve Bayes Classifier (NBC), Random Forest Classifier (RFC) and Support Vector Classifier (SVC) are used to determine the best feature selection method. The results indicated that the Extra Tree Classifier (ETC) is the promising feature selection method for the PSD dataset because the significant features obtained from ETC depicted the highest accuracy on SVC.

Keywords

Autoimmune disease rheumatoid arthritis osteoarthritis feature reduction feature selection machine learning algorithms

1 Introduction

Autoimmune diseases, a major immune system disorder, can result from chronic, systemic destruction of innate immunity to auto-antigens [1]. Recent global increases in autoimmune diseases are primarily attributed to host genome, infections, environment, drugs, and antigenic events, as identified by research [2].

The pattern of destruction of the bones and joints defines a chronic autoimmune joint disease known as Rheumatoid Arthritis (RA) [3]. RA, a complex disease influenced by genetics and environment, increases mortality risk. Its prevalence and severity vary based on population and definition [4].

Autoimmune disorders, including RA, Crohn’s disease, Type 1 diabetes, multiple sclerosis, lupus, and psoriasis, affect 4% of the global population [5, 6]. The National Institute of Health (NIH) and Health and Human Services (HHS) recognize autoimmunity as a significant health issue for women, ranking as the ninth leading cause of death for those aged 15–64 and the fourth most common cause of disability in the US [7, 8]. Women are three times more likely to develop autoimmune diseases than men. In India, autoimmunity is a major cause of mortality and chronic disease for both genders across nearly all age groups.

Vitamin D deficiency is increasingly linked to the etiology and pathogenesis of RA, which is associated with early mortality due to chronic inflammation’s adverse effects on cardiovascular function [9, 10]. Environmental factors likely trigger RA in genetically predisposed individuals, leading to immune dysfunction and autoimmunity [11, 12].

The primary objective is to reduce the PSD dataset dimensionality by identifying relevant variables and creating a concise dataset. Secondly, to identify best features for diagnosing RA and OA patients using suitable feature selection method. Thirdly, utilize different feature selection algorithms to select optimal features. Finally, assess best features for each feature selection technique using ML algorithms and determine optimal ML algorithm for accurate RA and OA prediction and diagnosis.

The goal is to develop a predictive model to accurately diagnose RA and OA patients using PSD data. Employ machine learning algorithms for dimensionality reduction, feature selection, and predictive model building. Identify best features, feature selection algorithm, and ML algorithm for easy RA and OA prediction and diagnosis. In machine learning, features are variables describing data; in medical terms, independent features are biomarkers in disease datasets. Identifying best features implies finding significant biomarkers in disease datasets.

The remainder of our study is structured as follows: A literature review on imputation methods, dimensionality reduction with PCA, an Extra Tree classifier used as a classifier, and feature selection are emphasized in Section 2. The methodology of the proposed work, with its detailed explanation, is described in Section 3. This section also contains the PSD dataset description and brief content on the machine learning classifier and feature selection applied to it. Section 4 explains the experimental results and discussion obtained before and after using different feature selection methods. Section 5 presents a conclusion highlighting significant biomarkers discovered in the PSD dataset.

2 Related work

The first literature review explores imputation techniques for handling missing data in clinical research [13]. Selecting an appropriate imputation method is crucial to maintain accuracy in machine learning-based decision-making. The authors applied imputation before executing ML algorithms. In [14], the author examines missing value imputation (MVI) concerning its methods and evaluation schemes. The author examined selected studies from the decade, which show various issues in the literature, and suggests K-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), BPCA (Bayesian Principal Component Analysis), and Decision Tree (DT) are the top five indirect MVI assessment ML models.

In [15], researchers used Eptesicus fiscus used restriction site-associated DNA sequencing (RADseq) to simulate various demographic models with different percentages of missing data (1%, 10%, 20%) in Eptesicus fiscus. Non-random missing data can distort PCA results in non-model systems due to varying sample quality. High dimensional data affects ML algorithms’ learning, complexity, and accuracy [16]. Reducing dimensionality is a trending research topic to provide reliable, adaptable, and accurate computational tools. PCA is a popular feature extraction technique, with optimized PCA showing better accuracy, time, and space complexity.

Preprocessing generates complete data copies for statistical analysis [17]. Extra Tree Classifier (ETC), DT, and RF classified emails as ham or spam [18]. Metaheuristic feature selection such as particle swarm optimization (PSO), Binary PSO (BPSO), and genetic algorithm identified relevant features. BPSO with ETC achieved the highest accuracy.

In [19], ETC and SVM were combined to predict breast cancer. ETC discovered relevant attributes and ML algorithms such as Logistic Regression (LR), SVM, Multi-Layer Perceptron (MLP), DT, KNN, RF, Naive Bayes (NB), Extreme Gradient Boosting (XGBoost) and Adaptive Boosting (AdaBoost) were implemented using 10-fold stratified cross-validation. SVM with ETC feature selection achieved the highest accuracy. In [20], Parkinson’s disease detection used Boruta, RFE, and RF for feature selection and gradient boosting, XGBoost, bagging, and extra tree classification for prediction. Bagging with RFE outperformed other methods, accurately diagnosing Parkinson’s in 82.35% of cases.

The Kaggle PCOS dataset was evaluated using ensemble classifiers RF, Extra Tree, Adaptive Boosting (AdaBoost) and feature selection techniques such as Chi-square, Pearson, RF, Lasso Regression, sequential forward and backward selection [21]. Feature selection improved ML model accuracy, with Ensemble RF achieving the highest accuracy and sensitivity.

The Cleveland Heart dataset (UCI database) 115 cases, 72 variables were used to predict heart disease using NB, RFs, Extra Trees, and LR classifiers. LASSO and Ridge Regression feature selection significantly improved accuracy, with Lasso outperforming Ridge by 33.3% versus 30.73% [22].

In [23], white blood cells from three datasets were classified into four subtypes using transfer learning with ResNet50, DenseNet121, MobileNetv2, Inceptionv3, and Xception models. Extra trees classifier selected important features at an intermediate stage. Multi-class SVM classified ResNet50 features with 90.76% accuracy based on recall, precision, F-measure, and accuracy metrics.

In [24], Extra Tree SVM-RBF (ET-SVMRBF) was proposed for diagnosing coronary artery disease (CAD). Synthetic Minority Oversampling addressed class imbalance. SVM-Linear, K-NN (K-Nearest Neighbor), XGBoost, and SVM-Radial Basis Function were the main methods. Extra Tree selected relevant features, and GridSearch optimized hyperparameters. ET-SVMRBF achieved 95.16% accuracy.

In [25], ensemble methods such as Bagging, AdaBoost, and Gradient Boosting were applied to Radius Neighbors Classifier (RNC), Bernoulli Naïve Bayesian (BNB), Gaussian Naïve Bayesian (NB), Extra Tree Classifier (ETC), Passive aggressive classifier (PAC), and Linear Discriminant Analysis (LDA) classifiers, and feature importance was used for attribute selection. Experiments on UCI Skin Disease Center data (34 features, 366 cases) showed Gradient Boosting with RNC and feature importance achieved the highest precision of 99.68%.

3 Methodology

The proposed research framework (in Fig. 1) involves Feature Engineering techniques (imputation, oversampling, feature scaling) on the PSD dataset, a retrospective dataset with missing values. Random value imputer [26], used in real-time datasets [27], handles missing data. Oversampling addresses target variable imbalance, and feature scaling normalizes the dataset. PFRVMD (Progressive Feature Reduction with Varied Missing Data) reduces feature dimensionality by splitting the imputed PSD dataset into four scenarios based on missing value percentages (0–70%, 0–60%, 0–50%, all variables). PCA [28] extracts essential features for each scenario using PCA loading score [29], resulting in the best half of the features per scenario.

Fig. 1

The overall framework of the proposed research

The set Union (∪) operation combines each scenario’s best half features, ultimately achieving the reduced-dimensional feature set. The reduced-dimensional feature set contains relevant PSD dataset features. This progressive approach achieves dimensionality reduction by removing irrelevant features, followed by feature selection. Various feature selection techniques, such as backward feature selection (BFS) [21], the Boruta algorithm (BoA) [30], ETC [31], and forward feature selection (FFS) [20], were applied to the reduced-dimensional feature set.

Feature selection techniques provide lists of significant features. The best technique is determined by executing these features through supervised ML models such as KNNC [32], LDA [33], LR [34], NBC [35], RFC [31], and SVC [36]. Models are assessed using accuracy, F1 score, and ROC AUC score. Results show ETC optimally selects important PSD dataset features when applied to SVC.

3.1 Dataset description

The PSD dataset comprises 16 RA patients (experimental group) and 9 OA patients (control group). With conditional approval from the ethical committee, 25 patient records were collected from Apollo Reach Hospital, Karaikudi, for the pilot study. This research aims to identify significant biomarkers, and predict, and diagnose RA and OA patients. Tables 1 and 2 show the independent variables with their clinical category and index in the PSD dataset, respectively.

Table 1
Independent features on the PSD dataset with their clinical category

S. No. Clinical Data Category Independent Features

1 Demographic Data Gender, Age

2 History History of Diabetes Mellitus (HODM), History of Hypertension (HOH), History of Coronary Artery Disease (HOCAD), History of Asthma (HOA)

3 General Examination Temperature (Tmp), Pulse, High Blood Pressure (SysBP), Low Blood Pressure (DiaBP)

4 Biochemistry Albumin (Albu), Globulin (Glob), Alkaline Phosphatase (AlPh), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Bilirubin Conjugated (BilC), Bilirubin Unconjugated (BilUC), Total Bilirubin (Tbil), Chloride (Cl), Cholesterol (Chol), C-Reactive Protein (CRP), Creatinine (Crt), Glucose (Glu), High Density Lipoproteins Cholesterol (HDLC), Low Density Lipoproteins Cholesterol (LDLC), Potassium (K), Protein (Pr), Sodium (Na), Triglycerides (Trigly), Total Cholesterol (Rchol), Urea, Uric Acid (Uric_Acid)

5 Coagulation Activated Partial Thromboplastin Time (APTT), Bleeding Time (BT), Clotting Time (CT), Prothrombin Time (PT)

6 Haematology Haemoglobin (Hb), Neutrophils (Neu), Lymphocytes (Lym), Eosinophils (Esino), Monocytes (Mono), Packed Cell Volume (PCV), White Blood Cells (WBC), Platelet Count (PC), Red Blood Cells (RBC), Red Cell Distribution Width (RDW), Mean Corpuscular Volume (MCV), Mean Corpuscular Hemoglobin (MCH), Mean Corpuscular Hemoglobin Concentration (MCHC), Erythrocyte Sedimentation Rate (ESR), Rheumatoid Factor (RAF)

7 Urine Macroscopic Examination Specific Gravity (SG), Potential of Hydrogen (pH)

S. No.	Clinical Data Category	Independent Features
1	Demographic Data	Gender, Age
2	History	History of Diabetes Mellitus (HODM), History of Hypertension (HOH), History of Coronary Artery Disease (HOCAD), History of Asthma (HOA)
3	General Examination	Temperature (Tmp), Pulse, High Blood Pressure (SysBP), Low Blood Pressure (DiaBP)
4	Biochemistry	Albumin (Albu), Globulin (Glob), Alkaline Phosphatase (AlPh), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Bilirubin Conjugated (BilC), Bilirubin Unconjugated (BilUC), Total Bilirubin (Tbil), Chloride (Cl), Cholesterol (Chol), C-Reactive Protein (CRP), Creatinine (Crt), Glucose (Glu), High Density Lipoproteins Cholesterol (HDLC), Low Density Lipoproteins Cholesterol (LDLC), Potassium (K), Protein (Pr), Sodium (Na), Triglycerides (Trigly), Total Cholesterol (Rchol), Urea, Uric Acid (Uric_Acid)
5	Coagulation	Activated Partial Thromboplastin Time (APTT), Bleeding Time (BT), Clotting Time (CT), Prothrombin Time (PT)
6	Haematology	Haemoglobin (Hb), Neutrophils (Neu), Lymphocytes (Lym), Eosinophils (Esino), Monocytes (Mono), Packed Cell Volume (PCV), White Blood Cells (WBC), Platelet Count (PC), Red Blood Cells (RBC), Red Cell Distribution Width (RDW), Mean Corpuscular Volume (MCV), Mean Corpuscular Hemoglobin (MCH), Mean Corpuscular Hemoglobin Concentration (MCHC), Erythrocyte Sedimentation Rate (ESR), Rheumatoid Factor (RAF)
7	Urine Macroscopic Examination	Specific Gravity (SG), Potential of Hydrogen (pH)

Table 2

PSD dataset features and its corresponding index

Feature Index	Independent Variables	Feature Index	Independent Variables
0	Gender	27	Na
1	Age	28	Trigly
2	Tmp	29	Rchol
3	HODM	30	Urea
4	HOH	31	Uric_Acid
5	HOCAD	32	APTT
6	HOA	33	BT
7	SysBP	34	CT
8	DiaBP	35	PT
9	Pulse	36	Hb
10	Albu	37	Neu
11	Glob	38	Lym
12	Alph	39	Eosin
13	ALT	40	Mono
14	AST	41	PCV
15	BilC	42	WBC
16	BilUC	43	PC
17	Tbil	44	RBC
18	Cl	45	RDW
19	Chol	46	MCV
20	CRP	47	MCH
21	Crt	48	MCHC
22	Glu	49	ESR
23	HDLC	50	RAF
24	LDLC	51	SG
25	K	52	pH
26	Pr	53	Diagnosis (Dependent or target Variable)

Demographic data is the patient details like gender and age. History explains the previous diseases. General examination is specific to the patient’s temperature, pulse, and blood pressure. Biochemistry is the chemicals measured in blood, plasma, or urine samples that are compared to healthy individuals. Increases or decreases can identify diseases [37]. Rheumatoid factor (RAF), erythrocyte sedimentation rate (ESR), and C-reactive protein (CRP) [38] are eminent RA biomarkers Coagulation evaluates thrombin deficiency affecting blood clotting. Haematology essential for diagnosing autoimmune diseases affecting the blood. Urine Macroscopic Examination Urine appearance, specific gravity, and potential of hydrogen are selected for diagnosing diseases.

3.2 Imputation technique: Random Value Imputer

Table 3 displays the proportion of missing values determined in the pilot study disease dataset. The results showed that over 4 % to 76% of data had missing values in the feature. It indicates that the current PSD dataset is relatively incomplete and unreliable for the study. Accordingly, the imputation technique was enforced to convert the incomplete PSD dataset to a complete dataset. Utilizing a Random Value Imputer, the PSD dataset is imputed based on the degree of proximity [26]. After imputation, the PSD dataset was sufficiently comprehensive and acceptable for the research.

Table 3
Percentage of missing values in the PSD dataset features

S.No. Independent Features of the PSD Dataset % Of Missing Values

1 Gender, Diagnosis (Dependent Feature) 0

2 Age, SysBP, DiaBP, Pulse 4

3 Tmp 8

4 HOH, Hb 12

5 Neu, Lym, Eosin, Mono, WBC, HOA, HOCAD, HODM, PCV 16

6 Crt, RBC 20

7 PC, Urea 24

8 K, Na, Glu, Cl 28

9 AlPh, AST, ALT 32

10 Glob, Albu, Tbil, SG 40

11 CRP, Uric_Acid 44

12 Pr, Chol, ESR 48

13 CT, Rchol, Trigly, pH, BT 52

14 LDLC, HDLC 56

15 PT, APTT 60

16 RAF 64

17 RDW, MCV, MCH, MCHC 72

18 BilC, BilUC 76

S.No.	Independent Features of the PSD Dataset	% Of Missing Values
1	Gender, Diagnosis (Dependent Feature)	0
2	Age, SysBP, DiaBP, Pulse	4
3	Tmp	8
4	HOH, Hb	12
5	Neu, Lym, Eosin, Mono, WBC, HOA, HOCAD, HODM, PCV	16
6	Crt, RBC	20
7	PC, Urea	24
8	K, Na, Glu, Cl	28
9	AlPh, AST, ALT	32
10	Glob, Albu, Tbil, SG	40
11	CRP, Uric_Acid	44
12	Pr, Chol, ESR	48
13	CT, Rchol, Trigly, pH, BT	52
14	LDLC, HDLC	56
15	PT, APTT	60
16	RAF	64
17	RDW, MCV, MCH, MCHC	72
18	BilC, BilUC	76

3.3 Oversampling

The PSD dataset has an imbalanced classification for the target variable, ‘Diagnosis,’ namely RA (1) and non-RA (0) (Osteoarthritis), with 64% and 36%. Oversampling is fulfilled to reduce the overfitting and underfitting problems. In oversampling, the minority class would resample itself to have the exact count as the majority class [39]. Before oversampling, the ‘Diagnosis’ had two distinct values, ‘osteoarthritis’ and ‘rheumatoid arthritis,’ with counts of 9 and 16. Oversampling increased the minority class count from 9 to 16. As a result, 32 samples were used in the proposed study.

3.4 Feature scaling

Feature scaling, a preprocessing method in ML and data analysis standardizes the range of independent variables or features within the dataset. The PSD dataset contains both discrete and continuous features. Normalization is necessary before performing PCA to scale down the feature values between 0 and 1. Gender, Age, HODM, HOH, HOCAD, HOA, SysBP, DiaBP, Pulse, AlPh, ALT, AST, Cl, Chol, Glu, HDLC, LDLC, Na, Trigly, BT, CT, Neu, Lym, Eosin, Mono, PCV, PC, RDW, MCV, MCH, MCHC, and ESR are the discrete features in the PSD dataset.

The Z-Score normalization Z = (X - μ)/σ is implemented in the PSD dataset by importing StandardScaler from sklearn.preprocessing in Python. Where X are the input values of each feature, μ is the mean, and σ is the standard deviation of the input values of each feature.

3.5 Progressive feature reduction with varied missing data

3.5.1 Dimensionality reduction

Insufficient training samples and limited computational resources can lead to model overfitting by incorrectly learning relevant and redundant features, known as the curse of dimensionality. High-dimensional datasets make learning less efficient and more time-consuming. Feature reduction techniques, like dimensionality reduction, can address this challenge [27].

PCA, an unsupervised learning algorithm, transforms record samples into orthogonal principal components [41]. While PCA doesn’t directly select important features, it identifies and emphasizes them by transforming the original feature space. PCA loading scores, the coefficients used to construct principal components, signify the correlation between variables and principal components. Those are crucial for feature selection within PCA, helping identify important features, remove redundancy, and ensure feature independence [28, 41].

Random value imputation (RVI) substitutes missing data with available values from the same feature. However, RVI may not accurately represent true underlying patterns and can alter statistical properties like means, variances, and correlations. The PFRVMD approach addresses these limitations. The PSD dataset has Missing Completely at Random (MCAR) missingness [26]. Although RVI with > 5% missing values does not modify data variance or distribution [42], the PSD dataset has features with 4–76% missing values. Despite its implementation, RVI has the following limitations:

The original value of the missing data may be the minimum value. When using random value imputation, the missing value can be filled using the maximum value of the respective feature, causing significant distortion and potentially impacting predictions for the target variable.

Conversely, if the original missing value is the maximum value, random imputation may assign the minimum value of the feature, leading to potential misinterpretations that can affect predictive modelling.

Random value imputation introduces the probability of estimation bias, impacting the calculation of confidence intervals. It is a critical concern for researchers relying on accurate estimates.

Using random value imputation tends to skew the probability towards available values, potentially affecting the integrity of the dataset. Instead of using the whole RVI imputed data for dimensionality reduction, it is divided into four scenarios to respond to these challenges.

The PFRVMD approach addresses RVI limitations by dividing the imputed data into four scenarios based on the percentage of missing values in independent variables: 0–70%, 0–60%, 0–50%, and all variables. PCA is applied to each scenario to identify essential features using loading scores. Figure 2 shows the input and output features for each scenario before and after PCA and loadings.

Fig. 2

Proposed Model on Progressive Feature Reduction with Varied Missing Data (PFRVMD).

For each scenario, PCA is executed to identify the essential features using PCA loading score, and the outputs are the best half of the significant features of each scenario. The first PCA component loading score in each scenario is used to identify the best half of the important variables, as shown in Table 5. Finally, the Union (∪) operation combines each scenario’s best half features, ultimately achieving the reduced-dimensional feature set. The reduced-dimensional feature set contains the relevant features in the PSD dataset. This progressive approach accomplishes dimensional reduction on the PSD dataset by removing irrelevant features.

Scenario 1 output 27 Predictor Variables: [’Gender’, ‘Tmp’, ‘HOH’, ‘SysBP’, ‘DiaBP’, ‘Albu’, ‘Glob’, ‘ALT’, ‘BilC’, ‘BilUC’, ‘Chol’, ‘HDLC’, ‘LDLC’, ‘K’, ‘Pr’, ‘Na’, ‘Uric_Acid’, ‘CT’, ‘PT’, ‘Hb’, ‘Lym’, ‘Mono’, ‘PCV’, ‘RBC’, ‘MCV’, ‘MCH’, ‘pH’]

Scenario 2 output 24 Predictor Variables: [’Gender’, ‘Tmp’, ‘HOH’, ‘SysBP’, ‘DiaBP’, ‘Albu’, ‘Glob’, ‘Chol’, ‘Glu’, ‘HDLC’, ‘LDLC’, ‘Pr’, ‘Trigly’, ‘APTT’, ‘CT’, ‘PT’, ‘Hb’, ‘Neu’, ‘Lym’, ‘Mono’, ‘PCV’, ‘PC’, ‘RBC’, ‘RAF’]

Scenario 3 output 22 Predictor Variables: [’Gender’, ‘Tmp’, ‘HOH’, ‘SysBP’, ‘DiaBP’, ‘Pulse’, ‘Albu’, ‘Glob’, ‘ALT’, ‘Chol’, ‘HDLC’, ‘LDLC’, ‘Pr’, ‘Rchol’, ‘Uric_Acid’, ‘CT’, ‘Hb’, ‘Neu’, ‘Lym’, ‘Mono’, ‘PCV’, ‘RBC’]

Scenario 4 output 19 Predictor Variables: [’Gender’, ‘Tmp’, ‘SysBP’, ‘DiaBP’, ‘Pulse’, ‘Albu’, ‘Glob’, ‘ALT’, ‘Chol’, ‘Pr’, ‘Na’, ‘Urea’, ‘Hb’, ‘Eosin’, ‘Mono’, ‘PCV’, ‘WBC’, ‘RBC’, ‘SG’]

The output of the Progressive Feature Reduction with Varied Missing Data comprises a total of 39 Predictor Variables: [’Gender’, ‘SysBP’, ‘DiaBP’, ‘Pulse’, ‘Tmp’, ‘HOH’, ‘Hb’, ‘Neu’, ‘Lym’, ‘Eosin’, ‘Mono’, ‘WBC’, ‘PCV’, ‘RBC’, ‘PC’, ‘Urea’, ‘K’, ‘Na’, ‘Glu’, ‘ALT’, ‘Glob’, ‘Albu’, ‘SG’, ‘Uric_Acid’, ‘Pr’, ‘Chol’, ‘CT’, ‘Rchol’, ‘Trigly’, ‘pH’, ‘HDLC’, ‘LDLC’, ‘PT’, ‘APTT’, ‘RAF’, ‘MCV’, ‘MCH’, ‘BilC’, ‘BilUC’]. These significant features are used for further feature selection. After PFRVMD, out of 52 predictor variables, 13 predictor variables were found to be unimportant and removed from the PSD dataset. Thus, the dimensionality has been reduced. The remaining 39 predictor variables are used for further analysis. Table 4 represents loading scores for predictor variables in the four scenarios.

Table 4

Loading Scores [28] for Predictor Variables in Four Scenarios

Loading score of First Scenario output with 27 predictor variable index		Loading score of Second Scenario output with 24 predictor variable index		Loading score of Third Scenario output with 22 predictor variable index		Loading score of Fourth Scenario output with 19 predictor variable index
36 -0.310731	0 -0.158628	34 -0.254845	8 -0.179367	32 -0.306129	21 -0.182463	34 -0.403799	11 -0.169877
41 -0.302726	52 -0.154065	10 -0.253332	44 -0.164567	37 -0.305768	8 -0.175118	31 -0.383925	2 0.162328
44 -0.292616	31 0.151572	2 0.252963	7 -0.158563	40 -0.305706	0 -0.174518	26 -0.382144	32 0.159476
10 -0.262501	27 -0.141573	32 0.250142	20 -0.158185	10 -0.288434	34 -0.167450	10 -0.256559	8 -0.147005
11 -0.201846	23 -0.141133	39 -0.249494	35 0.155530	2 0.246611	36 0.158686	30 0.239728	24 0.141554
35 -0.193496	7 -0.140983	42 -0.239374	41 -0.154597	17 -0.230487	13 -0.140507	0 -0.237179	29 0.131199
16 0.193213	4 0.131410	17 -0.218129	30 0.153369	11 -0.217997	9 0.131472	36 -0.208726	9 0.126972
26 -0.186173	13 -0.116503	22 -0.218098	21 -0.148404	24 -0.217984	27 -0.121654	22 -0.190210	7 -0.122918
46 -0.185554	47 -0.114874	36 -0.205124	4 0.147574	31 0.212052	33 0.120040	17 -0.180829	23 -0.109347
19 -0.184921	40 0.113444	33 -0.205114	26 -0.143924	22 -0.203989	4 0.110057	13 -0.178637
2 0.175072	25 0.112301	24 -0.198671	0 -0.143344	7 -0.187416	29 0.094036
34 0.174041	38 -0.108481	11 –0.186180	38 0.140371
15 0.172395	8 -0.107331
24 -0.171117

Table 5

Performance Evaluation using ML Algorithms for different feature selection techniques

	K Nearest Neighbor Classifier	Linear Discriminant Analysis	Logistic Regression	Naïve Bayes Classifier	Random Forest Classifier	Support Vector Classifier
Before Feature Selection
Accuracy	80	80	90	80	80	80
ROC AUC Score	74	80	80	78	96	99
F1 Score	80	75	89	80	89	75
Backward Feature Selection
Accuracy	86	96	96	77	91	86
ROC AUC Score	99	99	99	99	99	95
F1 Score	92	96	96	82	96	88
Boruta Algorithm
Accuracy	50	90	80	60	80	50
ROC AUC Score	67	83	75	75	92	67
F1 Score	80	80	80	80	80	80
Extra Tree Classifier
Accuracy	90	90	90	50	90	99.9
ROC AUC Score	99	99	96	82	99	99.9
F1 Score	89	89	89	62	89	99.9
Forward Feature Selection
Accuracy	95	96	95	77	95	96
ROC AUC Score	99	99	99	95	99	96
F1 Score	93	92	96	84	90	96

3.6 Feature Selection Experiment Conducted on the output of the PFRVMD approach

3.6.1 Feature selection

Feature selection constitutes a vital stage in the data science life cycle, focusing on acquiring a subset of pertinent features for model training [43]. In feature selection, given an initial set of features, where F ={ X₁, X₂, X₃, …, X_n }, $\bar{F}$ is the subset of relevant features from F, $\bar{F} \subset F$ , then $\bar{F} = {{\bar{X}}_{1}, {\bar{X}}_{2}, {\bar{X}}_{3}, \dots, {\bar{X}}_{n}}$ . This $\bar{F}$ improves or maintains classification accuracy or simplifies classifier complexity.

With the main goal of improving classification accuracy, feature selection involves choosing the relevant attributes and discarding those irrelevant or redundant [44]. There are three categories of feature selection: filter methods, wrapper methods, and embedded methods. The wrapper and embedded methods are used to select features for the PSD dataset. Three feature selections are performed using the wrapper method: Backward feature selection, Boruta Algorithm, and Forward feature selection. Additionally, a single feature selection from the embedded method, called ETC, is implemented in the proposed work.

3.6.2 Wrapper method

Backward feature selection starts with the complete feature set and iteratively removes features that provide the smallest improvement or impact on accuracy, while maintaining ML model metrics. Boruta Algorithm (BoA), an RF-based feature selection method, iteratively removes statistically less relevant features by comparing real feature Z-scores to randomized shadow feature Z-scores, resulting in a stable selection of important and unimportant attributes [45]. Forward feature selection adds features one by one to an empty set until model metrics are no longer affected, stopping when no significant improvement is observed. Correlation analysis can also identify relevant features.

3.6.3 Embedded method

Extra Tree Classifier (Extremely Randomized Trees) is an ensemble method that improves accuracy by combining multiple Decision Trees (DTs) constructed on random subsets of data and features, making them uncorrelated. ETC assigns feature importance scores to indicate relevance to the target variable. Unlike Random Forest (RF), ETC uses all data records and selects features randomly at each node for splitting. ETC outperforms well with noisy features, such as those from the random value imputer used in the PSD dataset. ETCs use random split points, increasing algorithm variance, which grows as the ensemble size increases [24].

3.7 Machine learning classification algorithms

Machine learning algorithms are divided into parametric and non-parametric methods. Parametric methods make large assumptions about mapping input features to output attributes, are easier to train, and use less data, but may have reduced robustness [46]. This includes simple neural networks, LR, LDA, Perceptron, and NB. Our proposed work uses LR, LDA, and NB parametric methods.

Non-parametric methods make few assumptions about the objective function, require large amounts of data, and can produce more effective but complex models with slower trained data [45]. Our proposed work uses KNN, RF, and SVM non-parametric methods.

3.7.1 K-Nearest neighbours classifier

KNN is one of the earliest supervised learning models for classification and regression [32]. The number of nearest neighbours considered for making a prediction was five. The weight function utilized to calculate the distance between samples was ‘uniform’, implying that all neighbours were assigned equal weightage. Minkowski distance (p) was assigned its default value of two. Lastly, the distance metric employed for calculating the distance between samples was the Minkowski distance.

3.7.2 Linear discriminant analysis

LDA is a supervised technique to find a linear combination of features that characterize or separate two or more classes of objects or events. [33]. The default parameter used in LDA () in Python is ‘solver’, which specifies the algorithm for computing the Fisher Linear Discriminant. The default solver is Singular Value Decomposition (SVD). It functions by decomposing the covariance matrix of the data into a matrix of singular values and a matrix of singular vectors. The linear discriminants are then calculated as a linear combination of the singular vectors.

3.7.3 Logistic regression

As a supervised learning model, LR relies upon a labelled data set. LR is the most appropriate model for solving problems of binary classification [34]. The parameters used for LR are as follows: penalty: L2 regularization, inverse of regularization strength (C): 1.0, solver: Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm, maximum number of iterations: 100, and class weight assigned as no class weight.

3.7.4 Naïve bayes classifier

The Naïve Bayes classifier is an effective and simple ML algorithm for building fast models with quick predictions. This is a probabilistic classifier that predicts based on the probability of an object [35]. The default parameters were used for the Gaussian Naive Bayes (NB) classifier to highlight a few parameters’ prior probabilities, and the weight of individual samples was assigned none.

3.7.5 Random forest classifier

The RF belongs to the ensemble technique’s bagging category [31]. The number of estimators was 100. The function used to measure the quality of a split was the Gini impurity. The number of jobs to run in parallel was assigned as –1, which means parallelization was used. These were used as the parameters for RFC.

3.7.6 Support vector classifier

SVM is a supervised, kernel-based ML model. It is designed for classification and regression algorithms. It is primarily used for binary categorization. The hyperplane and the support vector define SVM models [36]. Setting ‘probability‘ to True enables the platt scaling heuristic, which estimates probabilities based on the SVM decision function. The kernel function for computing the similarity between data points was the radial basis function (RBF) kernel.

3.8 Performance evaluation

3.8.1 Accuracy

The evaluation metric for classification is the accuracy score. In classification, the accuracy score is the ratio of correct predictions to the total number of input data points [47].

$\begin{matrix} Accuracy score \\ = \frac{Number of correct prediction}{Total number of data points} X 100 % \end{matrix}$ (1)

The accuracy score is unreliable when the dataset has an uneven distribution of classes. From the Equation (1), the number of correct predictions equals True Positive and False Positive.

3.8.2 F1 score

$Precision = \frac{True Positive}{Total Number of Predicted Postive}$ (2)

Precision refers to the proportion of predicted positive cases that are positive. Recall is a measure of the proportion of actual positive cases that are correctly predicted. Based on Equation (2), the total number of predicted positives equals the total number of True Positives and False Positives. $Recall = \frac{True Positive}{Total Number of Actual Postive}$ (3)

Recall measures the error caused by false negatives, crucial in evaluating diseases like RA. The total actual positives equal true positives and false negatives as shown in Equation (3). The F1 score is a weighted average of precision and recall, making it a useful metric for imbalanced datasets as shown in Equation (4) [48]. $F 1 Score = 2 X \frac{Precision X Recall}{Precision + Recall}$ (4)

3.8.3 ROC AUC score

ROC curves plot fallout (FPR) against hit rate (TPR) to evaluate model performance. AUC score, derived from ROC, measures model quality and ranges from 0 to 1. An AUC score of 1 indicates acceptable performance, 0 indicates poor performance, and 0.5 indicates the most ineffective model performance [48].

4 Experimental results and discussion

Informative features from the PFRVMD approach were further analyzed using wrapper and embedded feature selection techniques (backward feature selection, Boruta algorithm, forward feature selection, ETC) to obtain significant features for easier RA and OA diagnosis. The best features from each technique were evaluated using ML models (KNNC, LDA [33], LR, NBC, RFC, SVC) and compared using accuracy, F1, and ROC AUC scores to determine the best-performing model and most significant predictive features.

4.1 Best features obtained from each feature selection technique

In backward feature selection, when ‘accuracy’ is the scoring parameter for the SequentialFeatureSelector class, the highest accuracy was obtained for the significant features, namely ‘Gender’, ‘Age’, ‘Tmp’, ‘HOH’, ‘Glob’, ‘Alph’, ‘BilUC’, ‘Cl’, ‘HDLC’, ‘LDLC’, ‘Pr’, ‘APTT’, ‘BT’, ‘PT’, and ‘Lym’. Similarly, when ‘F1 Score’ is the scoring parameter the following are the highest F1 score significant features ‘Gender’, ‘Age’, ‘Tmp’, ‘HODM’, ‘HOH’, ‘HOA’, ‘Glob’, ‘Alph’, ‘BilUC’, ‘Cl’, ‘HDLC’, ‘LDLC’, ‘Pr’, ‘APTT’, ‘BT’, ‘CT’, ‘PT’, and ‘Lym’, Using’ ROC AUC Score’ as the scoring parameter ‘Gender’, ‘HOCAD’, and ‘AST’ are the important features identified with highest ROC AUC Score.

The feature selector ranking of the Boruta algorithm is as follows for the 39 reduced-dimensional feature set: Ranking w.r.t index: [10 20 34 28 27 5 15 20 10 33 31 12 7 22 1 23 7 18 13 9 23 26 4 32 25 15 1 1 28 30 1 13 2 3 1 1 18 17 5]. The best features derived from the Boruta algorithm are ‘MCH’,’ MCV’, ‘PCV’, ‘Lym’, ‘Neu’, and ‘LDLC’.

The best features obtained from ETC are ‘Gender’, ‘MCH’,’ APTT’, ‘Lym’, ‘Glu’, ‘PC’, ‘pH’, ‘LDLC’, ‘RBC’ and ‘Pr’ as shown in Fig. 3.

Fig. 3

The best features obtained from Extra Tree Classifier.

Forward feature selection identified significant features based on different scoring parameters. For ‘accuracy’, ‘Gender’, ‘Trigly’, and ‘Neu’ yielded the highest accuracy. For ‘F1 Score’, ‘Gender’, ‘Tbil’, and ‘Neu’ achieved the highest F1 score. Using ‘ROC AUC Score’, ‘Gender’, ‘BilUC’, ‘Urea’, and ‘APTT’ were identified as important features. When multiple ML algorithms produced the same highest scores, the minimum significant features were considered optimum.

4.2 Evaluation of feature selection techniques in improving machine learning algorithms accuracy

The reduced-dimensional feature set, consisting of 39 attributes, was trained and tested using machine learning (ML) algorithms. The testing dataset produced the following results: LR presented the highest accuracy of 90%.

The elite features obtained using the BFS depicted LDA and LR as the optimal classifier models with the highest accuracy of 96%. The quality features obtained from the BoA FS signified LDA as a good classifier model with an exalted accuracy of 90%. The prominent features derived from the ETC FS indicated the SVC as the dominant model with the highest accuracy of 99.9%, as depicted in Fig. 4. Using the best features of FFS, LDA and SVC achieved the highest accuracy of 96%.

Fig. 4

Comparison of Feature Selection Techniques in enhancing ML Algorithms accuracy.

Among the evaluation of different feature selection techniques with respect to ML algorithms based on accuracy metrics, SVC emerged as the best classifier using the ETC FS.

4.3 Assessing the effectiveness of feature selection methods in machine learning algorithms using the F1 score

Through analysis of 39 attributes, LR and RFC yielded the highest F1 score of 89%, demonstrating their effectiveness in classification tasks.

BFS-best selected features proved optimal for LDA and LR, achieving a remarkable F1 score of 96%. Similarly, the best features selected from BoA FS identified LDA, LR, and RFC as suitable classifiers, obtaining an F1 score of 96%. The best features achieved from ETC FS identified SVC as the dominant model, producing the highest F1 score of 99.9%. Additionally, FFS-best selected features enabled LR and SVC to obtain the highest F1 score of 96%, as represented in Fig. 5.

Fig. 5

Comparison of Feature Selection Methods based on F1 Score.

When evaluating different feature selection techniques using the F1 score metric, SVC emerged as the superior classifier in conjunction with ETC FS.

4.4 Comparing feature selection methods with ML algorithms using ROC AUC Score

Among the models evaluated, SVC achieved the highest ROC AUC score of 99% using 39 attributes before applying feature selection techniques.

Except for SVC, all other classifier models exhibited the highest ROC AUC score of 99% for the optimal features obtained using the BFS. Using the optimal features obtained from the BoA FS, RFC emerged as the best classifier model, achieving a ROC AUC score of 95%. The prominent features selected by the ETC FS yielded SVC as the superior model, with the highest ROC AUC score of 99.9%, as highlighted in Fig. 6. When employing the optimal features from the FFS, KNNC, LDA, LR, and RFC, all achieved the highest ROC AUC score of 99%.

Fig. 6

Comparison of Feature Selection Methods based on ROC AUC Score.

In evaluating different feature selection techniques against machine learning algorithms based on ROC AUC score metrics, SVC emerged as the best classifier when utilizing the ETC FS.

5 Conclusion and future scope

The proffered research work suggests that ETC feature selection performs best along with the Support Vector Classifier algorithm for the clinical dataset (obtained from Apollo Reach Hospital, Karaikudi) after performing progressive feature reduction with varied missing data using PCA. The accuracy, F1, and ROC AUC scores reached the maximum of 99.9% for the SVC classifier. This proposed work has identified ten relevant biomarkers to predict and classify RA and OA patients for any given data point from the PSD dataset. Moreover, the significant biomarkers have been determined based on their clinical category and are depicted in Table 6. Furthermore, the dimensionality reduction achieved using the proposed work PFRVMD reduced the pilot dataset from 53 Independent to 39 Independent features. Without compromising the quality of the result, the PFRVMD has reduced the computational complexity by minimizing the number of features. The findings suggest that PFRVMD might decrease feature dimensions while retaining essential data.

Table 6

Significant features/biomarkers identified from the PSD dataset

S. #.	Clinical Data Category	Significant Features / Biomarkers Identified from the Proposed Study
1	Demographic Data	Gender
2	Biochemistry	Glu, LDLC, Pr
3	Coagulation	APTT
4	Haematology	Lym, PC, RBC, MCH
5	Urine Macroscopic Examination	pH

The future scope of this research lies in deploying the SVC model for predicting rheumatoid arthritis. This would have significant implications in assisting healthcare professionals in identifying individuals at risk of developing arthritis, thereby enabling early diagnosis and timely intervention.

Data Availability Statement

PSD dataset: Data is unavailable on request from the authors. The data required to reproduce the above findings cannot be shared at this time due to legal/ethical reasons.

Footnotes

Acknowledgments

This article has been published under AURF Seed Money Grant –2018 grant sanctioned vide letter No.

ALU: AURF Start-up Grant: 2018, Dt. 23.03.2018.

This article has been published under RUSA Phase 2.0 (II Installment) grant sanctioned vide letter No. F. 24-51/2014-U, Policy (TN Multi-Gen), Dept. of Edn. Govt. of India, Dt. 09.10.2018.

Department of Science and Technology, New Delhi, for the financial support in general and infrastructure facilities sponsored under PURSE 2^nd Phase Programme (Order No. SR/PURSE phase 2/38 (G) dated: 21.02.2017)

Conflict of Interest

The authors declare that they have no conflict of interest

References

Youssefi

, Tafaghodi

and Farsiani

, Helicobacter pylori infection and autoimmune disease’s; Is there an association with systemic lupus erythematosus, rheumatoid arthritis, autoimmune atrophy gastritis and autoimmune pancreatitis? A systematic review and meta-analysis study, J. Microbiol. Immunol. Infect. 54 (2021), 359–369. doi:10.1016/j.jmii.2020.08.011.

Sfriso

, et al. Infections and autoimmunity: The multifaceted relationship, J. Leukoc. Biol. 87 (2010), 385–395. doi:10.1189/jlb.0709517.

Gabriel

, Youinou

J.P.

and Saraux

, The environment, geo-epidemiology, and autoimmune disease: Rheumatoid arthritis, Autoimmun. Rev. 9 (2010), A288–A292.[Online]. Available: http://dx.doi.org/10.1016/j.autrev.2009.11.019.

Sharif

, Watad

, Bragazzi,

N.L.

, Lichtbroun,

, Amital

and Shoenfeld,

, Physical activity and autoimmune diseases: Get moving and manage the disease, Autoimmun. Rev. 17 (2018), 53–72. doi:10.1016/j.autrev.2017.11.010.

Pincus

, Callahan

L.F.

, Sale

W.G.

, Brooks,

A.L.

, Payne,

L.E.

and Vaughn,

W.K.

, Severe functional declines, work disability, and increased mortality in seventy-five rheumatoid arthritis patients studied over nine years, Arthritis Rheum. 27 (1984), 864–872. doi:10.1002/art.1780270805.

Hahn

Y.S.

and Kim

J.G.

, Pathogenesis and clinical manifestations of juvenile rheumatoid arthritis, Korean J. Pediatr. 53 (2010), 921–930. doi:10.3345/kjp.2010.53.11.921.

Autoimmune Disease NSCF, Available at: “https://nationalstemcellfoundation.org/glossary/autoimmune-disease(2023), Accessed 12 February 2023.

Shekhawat

D.K.

Autoimmune Disease in India, now an Epidemic!, Available at: “https://www.fammacademy.org/Autoimmune-Disease-in-IndiaAutoimmune-Disease-in-India, (2020), Accessed 12 February 2023.

Harrison

S.R.

, Li

, Jeffery

L.E.

, Raza

and Hewison

, Vitamin D, Autoimmune Disease and Rheumatoid Arthritis, Calcif. Tissue Int. 106 (2020), 58–75. doi:10.1007/s00223-019-00577-2.

10.

Mcfarlane

I.M.

, et al. Assessment of interstitial lung disease among black rheumatoid arthritis patients, Clinical Rheumatology 38 (2019), 3413–3424.. [Online]. Available: http://dx.doi.org/10.1007/s10067-019-04760-6.

11.

Simon

T.A.

, Kawabata

, Ray

, Baheti

, Suissa

and Esdaile

J.M.

, Prevalence of Co-existing Autoimmune Disease in Rheumatoid Arthritis: A Cross-Sectional Study, Adv. Ther. 34 (2017), 2481–2490. doi:10.1007/s12325-017-0627-3.

12.

Lindler

, Breanna Long,

, Katelyn

, Taylor , Nancy

, Lei, Use of Herbal Medications for Treatment of Osteoarthritis and Rheumatoid Arthritis, Medicines 7 (2020), 67[Online]. Available: http://dx.doi.org/10.3390/medicines7110067.

13.

Austin

P.C.

, White

I.R.

, Lee

D.S.

and Van Buuren,

, Missing Data in Clinical Research: A Tutorial on Multiple Imputation, Can. J. Cardiol. (2021),pp.1–10. doi:10.1016/j.cjca.2020.11.010.

14.

Hasan

, Alam

, Roy

, Dutta

, Jawad

and Das

, Informatics in Medicine Unlocked Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010 –2021), , Informatics Med. Unlocked 27 (2021), 100799.10.1016/j.imu.2021.100799.

15.

and Latch

E.K.

, Nonrandom missing data can bias Principal Component Analysis inference of population genetic structure, (2021), pp.1–10. doi:10.1111/1755-0998.13498.

16.

Zebari

R.R.

, Abdulazeez

A.M.

and Zeebaree

, A Comprehensive Review of Dimensionality Reduction Techniques for Feature Selection and Feature Extraction, J. Appl. Sci. Technol. Trends 1 (2020), 56–70. doi:10.38094/jastt1224.

17.

Van Wingerde

, Van Ginkel,

, SPSS Syntax for Combining Results of Principal Component Analysis of Multiply Imputed Data Sets using Generalized Procrustes Analysis, Appl. Psychol. Meas. 45 (2021), 231–232. doi:10.1177/0146621621990757.

18.

Sharaff

and Harshil

, Extra-Tree Classifier with Metaheuristics, Advances in Computer Communication and Computational Sciences (2019), 189–197. doi:https://doi.org/10.1007/978-981-13-6861-5_17.

19.

Alfian

, et al. Predicting Breast Cancer from Risk Factors Using SVM and Extra-Trees-Based Feature Selection Method, Computers 11 (2022), 136[Online]. Available: http://dx.doi.org/10.3390/computers11090136.

20.

Lee

D.G.

, et al. Data-Driven Prediction of Fatigue in Parkinson’s Disease Patients, Front. Artif. Intell. 4 (2021), 678678–678688. doi:10.3389/frai.2021.678678.

21.

Danaei

and Huseyin

, Diagnosis of polycystic ovary syndrome through different machine learning and feature selection techniques, Health Technol. (Berl). 12 (2022), 137–150. doi:10.1007/s12553-021-00613-y.

22.

Lamb

W.F.

, et al. Predictive Systems: Role of Feature Selection in Prediction of Heart Disease, In, Journal of Physics: Conference Series (2019), 0–6. doi:10.1088/1742-6596/1372/1/012074.

23.

Baby

, Devaraj

S.J.

, Hemanth

, A.R.A.J.M.M, , Turkish Journal of Electrical Engineering and Computer Sciences Leukocyte classification based on feature selection using extra trees classifier: A transfer learning approach, 29 (2021). doi: 10.3906/elk-2104-183.

24.

Pooja

, Rajneesh

and Anurag

, Coronary artery disease diagnosis using extra tree-support vector machine: ET-SVMRBF. [Online]. Available: , Int. J. Comput. Appl. Technol. 66 (2021), 219–218. http://dx.doi.org/10.1504/ijcat.2021.10043464.

25.

Kumar

, Pal

and Kumar

, Comparison of skin disease prediction by feature selection using ensemble data mining techniques, Informatics Med. Unlocked 16 (2019), 100202.10.1016/j.imu.2019.100202.

26.

Uma

and Santhoshkumar

, Analysis of Suitable Machine Learning Imputation Techniques for Arthritis Profile Data,pp, IETE J. Res. (2022), 1–22. doi:10.1080/03772063.2022.2120914.

27.

https://www.kaggle.com/datasets/santhoshkumarsundar/arthritis-profile-dataset-apddataset.https://doi.org/10.34740/KAGGLE/DSV/7717987.

28.

Kabir

M.F.

, Chen

and Ludwig

S.A.

, A performance analysis of dimensionality reduction algorithms in machine learning models for cancer prediction, Healthc. Anal. 3 (2023), 100125.10.1016/j.health.2022.100125.

29.

Song

, Guo

and Mei

, Feature selection using principal component analysis, Proc. - 2010 Int. Conf. Syst. Sci. Eng. Des. Manuf. Informatiz. ICSEM 2010 1 (2010), 27–30. doi:10.1109/ICSEM.2010.14.

30.

Asa

, et al. Advanced machine learning techniques for cardiovascular disease early detection and diagnosis, Multimed. Tools Appl. 24 (2023), 1–29. doi:10.1186/s12859-023-05300-5.

31.

Pagliaro

, Forecasting Significant Stock Market Price Changes Using Machine Learning: Extra Trees Classifier Leads, Electronics 12 (2023), 1–23. doi:10.3390/electronics12214551.

32.

Islam, , et al. Predicting the risk of diabetes retinopathy using explainable machine learning algorithms, Diabetes Metab. Syndr. Clin. Res. Rev. (2023), 102919.10.1016/j.dsx.2023.102919.

33.

Singh

, Pal

and Dahiya

A.K.

, Classification of Power Quality Disturbances using Linear Discriminant Analysis, Appl. Soft Comput. 138 (2023), 110181.10.1016/j.asoc.2023.110181.

34.

Uma

and Santhoshkumar

, Benchmark Datasets and Real-time Autoimmune Disease Dataset Analysis Using Machine Learning Algorithms with Implementation, Analysis and Results,pp, J. Intell. Fuzzy Syst. (2023), 1–15. doi:10.3233/JIFS-224115.

35.

Omuya

E.O.

, Okeyo

and Kimwele

, Sentiment analysis on social media tweets using dimensionality reduction and natural language processing, Eng. Reports 5 (2023), 1–14. doi:10.1002/eng2.12579.

36.

Majid

, Gulzar

, Ayoub

and Khan

, Using Ensemble Learning and Advanced Data Mining Techniques to Improve the Diagnosis of Chronic Kidney Disease, Int. J. Adv. Comput. Sci. Appl. 14 (2023), 470–480. doi:10.14569/IJACSA.2023.0141050.

37.

Dalle-Donne

, Rossi

, Colombo

, Giustarini

and Milzani

, Biomarkers of oxidative damage in human disease, Clin. Chem. 52 (2006), 601–623. doi:10.1373/clinchem.2005.061408.

38.

Ramasamy

and Santhoshkumar

, A Work Review on Clinical Laboratory Data Utilizing Machine Learning Use-Case Methodology, J. Intell. Med. Healthc. 2 (2024), 1–14. doi:10.32604/jimh.2023.046995.

39.

Ghatasheh

, Altaharwa

and Aldebei

, Modified Genetic Algorithm for Feature Selection and Hyperparameter Optimization: Case of XGBoost in Spam Prediction, IEEE Access 10 (2022), 84365–84383. doi:10.1109/ACCESS.2022.3196905.

40.

Tax

and Duin

, Feature scaling in support vector data description, pp, Proc. ASCI (2002), 95–102.

41.

Malan

, Smuts

C.M.

, Baumgartner

and Ricci

, Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns, Nutr. Res. 75 (2020), 67–76. doi:10.1016/j.nutres.2020.01.001.

42.

Somasundaram

R.S.

and Nedunchezhian

, Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values, Int. J. Comput. Appl. 21 (2011), 14–19. doi:10.5120/2619-3544.

43.

Al-Tawil

, Mahafzah

B.A.

, Al Tawil

and Aljarah,

, Bio-Inspired Machine Learning Approach to Type 2 Diabetes Detection, Symmetry (Basel). 15 (2023), 1–16. doi:10.3390/sym15030764.

44.

Hakak

, Alazab

, Khan

and Reddy

, An ensemble machine learning approach through effective feature extraction to classify fake news, Futur. Gener. Comput. Syst. 117 (2021), 47–58. doi:10.1016/j.future.2020.11.022.

45.

Subbiah

, Anbananthen

K.S.M.

, Thangaraj

, Kannan

and Chelliah

, Intrusion detection technique in wireless sensor network using grid search random forest with Boruta feature selection algorithm, J. Commun. Networks 24 (2022), 264–273. doi:10.23919/jcn.2022.000002.

46.

Maxwell

A.E.

, et al. Implementation of machine-learning classification in remote sensing: an applied review sensing: An applied review, Int. J. Remote Sens. 39 (2018), 2784–2817. doi:10.1080/01431161.2018.1433343.

47.

Baghdadi

N.A.

, Farghaly Abdelaliem,

S.M.

, Malki,

, Gad,

, Ewis

and Atlam,

, Advanced machine learning techniques for cardiovascular disease early detection and diagnosis, J. Big Data 10 (2023), 1–29. doi:10.1186/s40537-023-00817-1.

48.

Chaurasia

, Pandey

and Pal

, Chronic kidney disease: A prediction and comparison of ensemble and basic classifiers performance, Hum.-Intell. Syst. Integr. 4 (2022), 1–10.