Enhancing skin lesion classification using a Tri-Path Attention Stacked Ensemble architecture with Cohen’s Kappa Proportioned Averaging

Abstract

Objective

Early recognition of skin lesions, including diverse abnormalities and life-threatening skin cancers, is critical for effective treatment and improved clinical outcomes. However, existing skin lesion datasets exhibit significant class imbalance, and there is no standardized guideline for optimal data augmentation strategies. This study aims to establish a robust and interpretable framework that addresses these limitations while enhancing diagnostic performance.

Methods

We propose a novel transfer learning-based framework termed Tri-Path Attention Stacked Ensemble (TASE), which integrates multiple EfficientNetV2 backbones through three distinct stacking strategies: TASE: Independent TA, TASE: Serial Stacked TA, and TASE: Parallel Stacked TA. Here, TA refers to the Triple-Attention mechanism, comprising soft attention integration, channel attention integration, and squeeze-excitation attention integration. To optimize ensemble prediction fusion, we introduce an advanced aggregation method—Cohen’s Kappa Proportioned Averaging (CKPA)—which is further extended into a Multi-Layer CKPA (ML-CKPA) framework to enhance weight distribution across hierarchical model outputs. Additionally, four augmentation strategies were systematically evaluated to determine the most effective ensemble configuration.

Results

Experimental validation on the HAM10000 dataset demonstrated that the proposed framework achieved a superior accuracy of 94.44%, outperforming several state-of-the-art methods. Grad-CAM visualizations were employed to enhance interpretability by highlighting lesion-relevant regions, thereby improving model transparency and reliability.

Conclusion

The proposed TASE framework delivers enhanced diagnostic accuracy while effectively mitigating challenges related to class imbalance, dataset variability, and computational efficiency. By combining hierarchical triple-attention mechanisms with multi-layer ensemble weighting, it offers a reliable and interpretable solution for early and precise skin lesion classification, supporting real-world dermatological applications and improved patient care.

Keywords

Skin lesion classification Tri-Path Attention Stacked Ensemble (TASE)Cohen’s Kappa Proportioned Averaging (CKPA)Triple-Attention (TA)Augmentation Gradient Class Activation Map (Grad-CAM)

Introduction

Skin lesions refer to abnormal changes in the skin’s structure or visual appearance and are linked to a broad spectrum of dermatological conditions. These conditions range from minor issues, such as acne, to severe and potentially fatal diseases like skin cancer. Although skin disorders manifest in various forms, they are not defined solely by the presence of lesions. Such abnormalities may arise from multiple causes, including infections, inflammatory reactions, allergic responses, malignancies, insect bites, trauma, autoimmune disorders, genetic predispositions, environmental influences, vascular irregularities, warts, and cysts.¹ Based on clinical severity, skin lesions are generally categorized into two principal types. Benign lesions, including moles, skin tags, warts, seborrheic keratoses, and hemangiomas, are typically non-cancerous and pose minimal medical risk. In contrast, malignant lesions, such as basal cell carcinoma, squamous cell carcinoma, and melanoma, are cancerous and possess the ability to metastasize, representing a significant threat to human health.²

Accurate detection and timely management of skin disorders have traditionally relied on clinical examination and diagnostic procedures. Delayed diagnosis or neglect of symptoms can result in serious outcomes, particularly in the case of skin cancer, which remains one of the most prevalent cancers worldwide. Although melanoma is less common than other forms of skin cancer, it accounts for the majority of skin cancer-related deaths.³ Recent statistics indicate that approximately 2.2% of individuals may develop melanoma during their lifetime, with nearly 97,610 new cases and 7,990 deaths reported in the United States in 2023. Moreover, more than 1.4 million people in the U.S. were living with melanoma, highlighting its considerable public health burden.⁴

Early identification plays a crucial role in preventing skin lesions from progressing to more advanced and life-threatening stages. However, many individuals remain unaware of underlying abnormalities due to the cost and complexity associated with conventional diagnostic procedures. Dermatoscopy, a non-invasive imaging technique that employs magnification and illumination, assists clinicians in evaluating suspicious lesions and supports early cancer detection. Despite its clinical value, the diagnostic accuracy of dermatoscopy is highly dependent on practitioner expertise, thereby introducing the possibility of human error.⁵

Artificial intelligence (AI), particularly through machine learning (ML) and deep learning (DL) paradigms, has demonstrated substantial potential in automating the analysis of medical images for skin lesion detection. These approaches enable rapid and precise interpretation of dermoscopic images, facilitating early diagnosis and improved treatment outcomes. Nevertheless, significant challenges persist. Many existing techniques exhibit bias toward classes with abundant training samples, struggle to extract high-level representations from transfer learning (TL) models without adequate fine-tuning, and encounter difficulties when integrating multiple architectures effectively. Additionally, limited interpretability and biases resulting from overlapping validation and testing datasets restrict their clinical applicability. Popular TL architectures such as DenseNet and ResNet also face constraints related to fixed scaling parameters, manual architectural configurations, and considerable computational demands, limiting adaptability and efficiency in resource-constrained environments.⁶

To address these limitations, researchers have explored convolutional neural networks (CNNs) combined with ensemble learning strategies. While these methods attempt to overcome the weaknesses of individual models, traditional ensemble approaches—such as majority voting, softmax averaging, and conventional weighted averaging—often fail to account for the relative contribution of each predictor, leading to suboptimal results. Furthermore, reliance solely on post-prediction ensembling may be inadequate when handling highly variable images, as no single model consistently achieves accurate classification. This limitation underscores the importance of incorporating pre-prediction stacking mechanisms to enhance feature representation and improve overall predictive robustness.

Our proposed methodology was systematically designed to confront the aforementioned challenges in skin lesion detection and to address the following research questions (RQs), which guided the architectural development of our framework:

RQ1

How can severe class imbalance be effectively reduced, and which strategy yields optimal generalization performance?

- The pronounced imbalance in skin lesion datasets often biases models toward dominant classes. Although data augmentation and generative adversarial networks (GANs) offer potential solutions, identifying the most reliable augmentation strategy for unseen data remains a key objective.⁷

RQ2

How can Transfer Learning (TL) models be efficiently adapted for domain-specific tasks?

- Selecting and fine-tuning the most suitable TL architecture is challenging, particularly when pretrained models (e.g., ImageNet-based models) possess fixed structures that may not align with specialized dermatological datasets. Effective customization is essential for achieving superior performance.⁸

RQ3

Which techniques can accurately highlight the most informative regions within dermoscopic images?

- Since not all image regions contribute equally to classification, emphasizing critical areas while suppressing redundant information is vital for enhancing predictive accuracy.⁹

RQ4

Is reliance on a single algorithm sufficient, or is Ensemble Learning (EL) required? If so, which ensemble strategy is most effective?

- A single model may be insufficient for handling complex image distributions. Although EL improves stability and performance, conventional aggregation strategies often lack dynamic weighting mechanisms, reducing their effectiveness.¹⁰

RQ5

What are the limitations of post-prediction ensembling alone, and how can pre-prediction stacking improve model robustness?

-Post-prediction fusion may struggle with high-variance samples where no individual model consistently performs well. Integrating a pre-prediction stacking approach strengthens feature learning and enhances classification reliability.¹¹

These research questions shaped the foundation of our work and led to the following key contributions:

• Handling class imbalance: A structured augmentation framework was introduced and evaluated using four strategies—No Augmentation (NA), Prior Augmentation (PiA), Training Data Augmentation (TA), and Posterior Augmentation (AP). The most effective approach was selected based on performance on unseen samples to ensure balanced representation and reduced bias.

• Refining TL architectures: EfficientNetV2 backbones were employed for their scalability and computational efficiency. These models were enhanced with additional layers and innovative modules to construct the “Tri-Path Attention Stacked Ensemble (TASE)” framework, capable of capturing both shallow and deep representations.

• Emphasizing critical features:Triple-Attention (TA) mechanisms—including Soft Attention Integration (SAI), Channel Attention Integration (CAI), and Squeeze-Excitation Attention Integration (SEAI)—were incorporated to focus on the most informative regions of the input images.

• Pre-prediction stacking design:Three stacking configurations—TASE: Independent TA, TASE: Serial Stacked TA, and TASE: Parallel Stacked TA—were implemented prior to training to fuse extracted features and uncover discriminative patterns.

• Advanced ensemble aggregation: The Cohen’s Kappa Proportioned Averaging (CKPA) method was proposed to dynamically assign optimal prediction weights across models, enhancing stability and generalization. Its extended version, Multi-Layer CKPA (ML-CKPA), further leveraged multi-level predictions to improve overall accuracy.

• Improved interpretability: Gradient Class Activation Maps (Grad-CAM) were integrated to visualize lesion-relevant regions, increasing model transparency and supporting clinical decision-making.

The remainder of this paper is structured as follows. Section 2 reviews related work to contextualize the study. Section 3 details the proposed framework and experimental design. Section 4 presents and evaluates the performance outcomes. Section 5 discusses practical implications and potential enhancements. Finally, Section 8 summarizes the principal findings and contributions of the research.

Literature review

Skin lesion classification has attracted substantial interest in medical imaging and artificial intelligence research. Although significant progress has been made, persistent challenges such as class imbalance, limited dataset-specific adaptation, and suboptimal integration of attention mechanisms continue to affect performance. This section critically examines prior studies, outlining their contributions and limitations to contextualize the proposed framework.

Wang et al.¹² employed DenseNet-121 and VGG-16 to extract multiscale features, achieving an accuracy of 91.24%. However, the lack of dataset-specific fine-tuning reduced adaptability to domain-specific variations. Mahbod et al.¹³ investigated the impact of image resolution on transfer learning-based classification and reported a balanced accuracy of 86.2%, though the increased computational cost limited real-time applicability.

Transfer learning (TL) remains one of the most widely adopted strategies for skin lesion analysis. Tajerian et al.¹⁴ utilized EfficientNet-B1 and achieved 84.30% accuracy, demonstrating its capability in detecting pigmented lesions. Nonetheless, dependence on generalized pretrained features constrained dataset-specific optimization. Similarly, Hosny et al.¹⁵ applied AlexNet for melanoma and nevus classification, reporting high accuracy but without integrating attention mechanisms to enhance discriminative feature extraction. Popescu et al.¹⁶ combined TL with collective intelligence, reaching 86.71% accuracy; however, the absence of validation on an independent test set raised concerns regarding generalization.

Hybrid models integrating convolutional neural networks (CNNs) with transformer architectures have shown promise in capturing both local and global contextual information. Khan et al.¹⁷ introduced SkinViT, merging outlook attention with transformers and achieving 91.09% accuracy. Despite strong performance, its high computational complexity limited scalability. Dong et al.¹⁸ proposed TC-Net, effectively combining CNN and transformer features to enhance segmentation, though model complexity hindered practical deployment. Nie et al.¹⁹ developed a hybrid CNN-transformer framework with focal loss, achieving 89.48% accuracy, but the approach struggled to extract deeper representations for more challenging cases.

Attention mechanisms have increasingly been adopted to emphasize critical image regions. Singh et al.²⁰ integrated Bayesian MultiResUNet with DenseNet-169 for segmentation and classification, attaining 86.67% accuracy, yet encountered difficulties in handling complex lesion patterns. Khan et al.²¹ proposed an entropy-optimized attention module within a deep learning framework, achieving over 90% accuracy, though robustness on independent datasets was not thoroughly validated. Saarela and Georgieva²² improved interpretability using Bayesian inference, achieving 80% accuracy, but their classification performance lagged behind competing approaches.

Nidhi et al.²³ and Abir et al.²⁴ employed the PAD-UFES-20 dataset for lesion classification using a single transfer learning approach without incorporating ensemble techniques. Similarly, Ahmmed et al.²⁵ adopted the same strategy with the PH2 dataset.

Additional attention-based strategies have been explored to improve discriminative capability. Nguyen et al.²⁶ incorporated deep learning with soft attention integration, reporting 90% and 86% accuracy across different models, though comparative evaluation with alternative attention mechanisms was not conducted. Datta et al.²⁷ implemented soft attention and achieved 93.4% accuracy; however, challenges in optimizing color channel weighting limited generalization performance.

Ensemble learning (EL) approaches have also been investigated to enhance predictive robustness. Gouda et al.²⁸ improved image quality using ESRGAN prior to classification, achieving 83.2% accuracy, but persistent class imbalance remained unresolved. Ajmal et al.²⁹ applied fuzzy entropy optimization within an ensemble framework, demonstrating strong performance on HAM10000 and ISIC 2018 datasets; nevertheless, high computational requirements and limited real-world validation reduced applicability. Rahman et al.³⁰ combined five deep networks into an ensemble, achieving 88% accuracy, though dataset-specific optimization was not incorporated.

Studies^31–34 utilized augmentation techniques on the ISIC2017–2020 datasets in conjunction with transfer learning; however, they did not investigate the use of ensemble methods.

Data augmentation has played a vital role in mitigating imbalance. Sun et al.³⁵ utilized augmented datasets with supplementary metadata, attaining 89.5% accuracy, yet the augmentation methodology lacked sufficient transparency for reproducibility.

Studies^36–48 have also analyzed related machine learning and deep learning approaches across various domains, including medical image analysis and classification tasks.

Despite these advancements, many existing approaches remain constrained by limited dataset-specific fine-tuning, insufficient independent validation, computational inefficiency, and suboptimal ensemble weighting strategies. Conventional ensemble techniques frequently fail to dynamically assign appropriate weights to individual predictors, thereby limiting overall effectiveness.

Motivated by these gaps, our study introduces a comprehensive framework aimed at overcoming current limitations. An effective augmentation strategy is first identified to address class imbalance. Triple-Attention mechanisms are incorporated within serial, parallel, and independent stacking configurations to enhance feature extraction and emphasize lesion-relevant regions. Transfer learning models are carefully fine-tuned to capture skin-specific characteristics rather than relying solely on generalized ImageNet representations. Furthermore, a dynamic ensemble strategy based on Cohen’s Kappa Proportioned Averaging (CKPA) is introduced to compute optimal prediction weights, ensuring consistent and robust performance across diverse datasets. Collectively, these contributions advance the reliability, interpretability, and practical applicability of automated skin lesion classification systems.

Materials and methods

Dataset description

This study utilized a publicly available dermatoscopic dataset to ensure a comprehensive and diverse evaluation of skin lesion classification performance.

The dataset, Human Against Machine (HAM10000), was collected from the Harvard Dataverse repository.⁴⁹ It comprises 10,015 carefully curated dermatoscopic images in JPG format, categorized into seven distinct classes.

The seven lesion categories included in the dataset are Melanoma (MEL), Nevus (NV), Vascular Lesions (VASC), Actinic Keratosis (AK), Basal Cell Carcinoma (BCC), Benign Keratosis (BKL), and Dermatofibroma (DF). Among these categories, MEL, AK, and BCC are classified as malignant lesions, whereas NV, BKL, and DF are considered benign. Certain types of VASC may also demonstrate malignant characteristics.

Table 1 summarizes the dataset distribution, providing a concise overview of the data composition used in this research.

Table 1.

Brief information of the HAM10000 dataset.

Images	Format	Classes	Source
10015	JPG	7	Harvard Dataverse

Figure 1 presents representative examples from each class, illustrating one sample per category. The pronounced class imbalance within the dataset is further demonstrated through the class distribution visualization in Figure 2.

Figure 1.

Sample images from the HAM10000 dataset.

Figure 2.

Sample distribution for each class in the HAM10000 dataset.

The dataset was carefully preprocessed to meet the requirements of our study. Additional details regarding the exact versions used can be found in HAM10000.⁵⁰

Methodological approach

The methodological framework of this study started with dataset acquisition, followed by comprehensive data preprocessing. The datasets were subsequently divided into two primary subsets: a main training set and an independent testing set. The independent testing set was completely held out during training and validation, providing truly unseen data for final evaluation.

To mitigate class imbalance, four distinct data augmentation strategies were employed:

• No Augmentation (NA): Only the original dataset was used, without generating any synthetic images.

• Prior Augmentation (PiA): Synthetic images were created prior to data splitting, which could result in overlap, where both original and augmented images from the same source might appear in training, validation, and testing sets.

• Training Data Augmentation (TA): Augmentation was applied solely to the training data, keeping validation and testing sets independent and unchanged.

• Posterior Augmentation (AP): Each subset—training, validation, and testing—was augmented after splitting, increasing the dataset size across all partitions.

The most effective augmentation strategy was identified by training a customized network based on EfficientNetV2 variants, followed by evaluation on the independent testing set to determine performance on entirely unseen data.

Next, the data was processed within the Tri-Path Attention Stacked Ensemble (TASE) framework. TASE combined architectures trained on the training set and validated on the validation set. It incorporated models using three Triple Attention (TA) configurations, which included Soft Attention Integration, Channel Attention Integration, and Squeeze-Excitation Attention Integration: TASE: Independent TA, TASE: Serial Stacked TA, and TASE: Parallel Stacked TA.

Predictions from each model were then fused using the Multi-Layer Cohen’s Kappa Proportioned Averaging (ML-CKPA) method, applied across multiple layers to boost performance. This ensemble technique enabled optimal weighting of predictions and improved generalization.

For interpretability, Grad-CAM visualizations were employed, providing insights into model behavior by highlighting critical regions of the input images. A schematic of the sequential steps in this methodology is presented in Figure 3.

Figure 3.

Sequential representation of methodology.

Preprocessing and data augmentation

To prepare the dataset for effective training, images were first grouped according to their lesion IDs to ensure proper organization at the lesion level. This grouping strategy was explicitly used to enforce lesion-level separation during dataset splitting, preventing any images from the same lesion appearing across different subsets. Careful sampling was then conducted to create distinct subsets for training, validation, and testing. Specifically, 15% of the images were allocated to the independent testing set, while the remaining 85% formed the primary training set. The independent testing set was strictly separated prior to any augmentation process and was fully preserved as unseen data for final evaluation.

Lesion-level separation was strictly enforced during dataset splitting, ensuring that images from the same lesion did not appear across training, validation, or independent testing sets. Furthermore, data augmentation was applied strictly after the splitting process and only to the training set, while the validation and independent testing sets were kept entirely unchanged. This strategy prevents any form of data leakage and ensures a fair and reliable evaluation of the proposed models.

Figure 4 depicts the four data augmentation strategies implemented to address class imbalance:

• No Augmentation (NA): Only the original dataset was used, without generating any synthetic images.

• Training Data Augmentation (TA): Augmentation was applied solely to the training data, keeping validation and testing sets independent and unchanged.

• Posterior Augmentation (AP): Each subset—training, validation, and testing—was augmented after splitting, increasing the dataset size across all partitions.

Figure 4.

Illustration of four data augmentation strategies.

To address class imbalance, roughly 8,000 synthetic images were generated for each class. The primary training dataset was subsequently split into training, validation, and testing subsets in a 70:15:15 ratio, respectively.

Augmentation was carried out using TensorFlow’s ImageDataGenerator, following a comprehensive strategy to increase dataset diversity and enhance model generalization. The process began with contrast enhancement of the original images to improve visual clarity. Various transformations were then applied, including random rotations up to 180 degrees, horizontal and vertical flips, width and height shifts of up to 10%, and zoom adjustments within a 10% range. To fill gaps resulting from these transformations, the nearest neighbor fill mode was utilized, ensuring consistency across generated images. This augmentation strategy simulated a wide variety of variations, effectively improving the robustness of the deep learning model.

Figure 5 presents examples of original, contrast-enhanced, and augmented images, illustrating a sample from the Actinic Keratosis (AK) class along with its augmented variants.

Figure 5.

Images of the augmented samples.

Tables 5 and 6 show the comparison of all augmentation strategies in both testing and independent testing data. Accordingly, all primary performance comparisons and conclusions in this study are drawn based on results obtained using the TA strategy.

Development of Tri-Path Attention Stacked Ensemble (TASE) architectures

The TASE framework utilized customized EfficientNetV2 models, fully leveraging Transfer Learning. Specifically, seven pre-trained architectures, including various EfficientNetV2 variants with input dimensions of 299x299x3 and 224x224x3, were fine-tuned. Since these models were originally trained on unrelated datasets, fine-tuning allowed adaptation to our dataset, enabling the extraction of both shallow and deep features effectively. To further improve performance, Triple-Attention (TA) was incorporated in three configurations: Serial Stacked, Parallel Stacked, and Independent Attention. A schematic of the complete architecture is illustrated in Figure 6.

Figure 6.

Overview of the TASE architecture.

The integration process started by importing pre-trained models from the tensorflow library and adapting them to match our specific input dimensions. Outputs were reshaped into a three-dimensional tensor (None, height, width, channels) to align the custom architecture with the pre-trained models for smooth feature extraction.

Three customized CNN architectures incorporating TA were developed:

1. Soft Attention Integrated Network (SAIN): Targeted fine-grained spatial features.

2. Channel Attention Integrated Network (CAIN): Enhanced feature representation by emphasizing significant channels.

3. Squeeze-Excitation Attention Integrated Network (SEAIN): Calibrated channel-wise responses to capture hierarchical features more effectively.

The TA modules were selectively integrated into these networks. For SAIN and SEAIN, TA modules were inserted after each convolutional block, while CAIN incorporated channel attention after every Conv2D layer. This strategic placement balanced computational efficiency and ensured meaningful feature enhancement.

The convolutional backbone consisted of two convolutional blocks, each containing four Conv2D layers with kernels of varying sizes (7x7, 5x5, 3x3, 1x1). The first block used 128 filters, and the second block used 256 filters, with BatchNormalization and MaxPooling2D layers applied for feature refinement and dimensionality reduction. All convolutional layers utilized ReLU activation to prevent vanishing gradient issues.

The three TASE configurations—Serial Stacked, Parallel Stacked, and Independent Attention—are described as follows.

TASE: Serial stacked attention network

In the Serial configuration, outputs from SAIN, CAIN, and SEAIN were integrated in a sequential manner. Following the reshaping of the pre-trained model’s output tensor, the SAIN network processed the features first, followed by CAIN, and finally SEAIN. Each network further refined the features extracted by the preceding one, producing progressively enhanced representations. These features were flattened into a one-dimensional tensor and fed through three fully connected layers with sizes 256, 128, and 7, corresponding to the number of classes. ReLU activation was applied to the first two layers, while the final layer used softmax to produce class probabilities. Dropout layers with rates of 35% and 25% were included after the first two dense layers, respectively, to reduce overfitting.

TASE: Parallel stacked attention network

In the Parallel configuration, outputs from SAIN, CAIN, and SEAIN were computed concurrently. Each network independently processed the reshaped pre-trained output, extracting features in parallel. The resulting feature maps were then concatenated to merge complementary information from all attention mechanisms. The combined tensor was flattened and passed through the same fully connected layers and dropout setup as in the Serial configuration. This design facilitated the integration of diverse feature representations, enhancing model generalization.

TASE: Independent Attention network

In the Independent configuration, SAIN, CAIN, and SEAIN functioned completely independently. Each network extracted features separately from the reshaped pre-trained model output. The outputs were flattened into one-dimensional tensors and passed through their respective fully connected layers. Each network generated its own predictions, maintaining independence of the extracted features. This setup allowed each attention mechanism to focus solely on its specialized feature extraction, which could later be combined during ensemble evaluation.

The study utilizes seven pretrained EfficientNetV2 variants, each of which is fine-tuned independently. Specifically, every model is trained separately and consistently incorporated across all three proposed architectures—SSA, PSA, and ISA—without omission in any stage. During fine-tuning, only the final classification layer, which represents the number of target classes, is kept frozen, while all remaining layers are allowed to update their weights. In terms of training strategy, a fixed number of epochs is not strictly enforced; instead, a maximum limit of 120 epochs is set. Early stopping is employed to prevent overfitting, allowing training to terminate automatically once the model converges, ensuring both computational efficiency and robust performance.

The careful design of these three TASE configurations ensured effective utilization of attention mechanisms, enabling robust feature extraction and improving model performance across varied input scenarios.

Justification of the proposed architecture and improvements over state-of-the-art methods

Most existing state-of-the-art methods are not specifically designed for skin lesion analysis; instead, they are primarily developed and pre-trained on large-scale datasets such as ImageNet. Therefore, the first key improvement of the proposed framework lies in its domain-specific fine-tuning, where the models are explicitly adapted for skin lesion classification tasks. Moreover, beyond fine-tuning, many existing approaches do not fully exploit advanced attention mechanisms. In contrast, this work incorporates three distinct attention modules—Channel Attention, Squeeze-and-Excitation Attention, and Soft Attention—each contributing uniquely to improving feature representation and model performance. These attention modules are further utilized in three different stacking strategies: Serial (SSA), Parallel (PSA), and Independent (ISA), enabling a more comprehensive and effective feature extraction process compared to conventional methods. Unlike typical state-of-the-art approaches that rely on a single model, this study recognizes that a single architecture may be insufficient for achieving high performance due to risks such as bias and overfitting. To address this limitation, multiple models are combined in both pre-training and post-training stages. Specifically, attention-based stacking is applied before training, while post-training integration is performed using the proposed Cohen’s Kappa Proportioned Averaging (CKPA) ensemble method. This multi-stage integration enhances robustness, reduces bias, and improves overall generalization performance.

Furthermore, each model within the proposed framework was trained using distinct strategies corresponding to the respective stacking architectures. For the Independent Stacked Architecture (ISA), each model was trained separately to ensure independent feature learning. In contrast, for the Serial Stacked Architecture (SSA), models were trained sequentially in a stacked manner, allowing progressive refinement of learned representations. For the Parallel Stacked Architecture (PSA), models were trained concurrently using a parallel stacking approach to capture diverse feature interactions. To ensure stable convergence and prevent overfitting, early stopping was employed during training. Although the maximum number of training epochs was set to 120, the training process was automatically terminated once convergence was achieved, resulting in efficient and optimized model performance.

Feature extraction process

In our methodology, the TASE models were employed for effective feature extraction. The top fully connected layers were excluded (include_top=False), and global average pooling was applied (pooling=’avg’). The resulting outputs were reshaped to optimized dimensions suitable for further processing with additional convolutional layers. These convolutional layers used a variety of filter sizes (7x7, 5x5, 3x3, and 1x1) and incorporated ReLU activation along with batch normalization to stabilize and enhance learning. Max pooling layers were then applied to reduce spatial dimensions, sharpening feature focus. Finally, the extracted feature maps were flattened and passed through fully connected layers with ReLU activation, concluding with a dense output layer using softmax activation to generate class probability predictions. The feature extraction process can be mathematically represented as:

F = ϕ (X; θ), F \in R^{H \times W \times C}

(1)

The final classification probabilities are computed using the softmax function:

{\hat{y}}_{i} = \frac{\exp (z_{i})}{\sum_{j = 1}^{K} \exp (z_{j})}

(2)

Feature visualization

Figure 7 illustrates the hierarchical feature extraction process within an Optimized InceptionV3 architecture. The figure shows activation maps at multiple stages of the transfer learning model, with each row corresponding to a different layer’s activations, providing a detailed view of the progressive transformation of input images:

• Input Layer (input_1): Displayed the preprocessed input image, representing raw pixel data.

• Zero Padding (zero_padding2d): Feature maps after zero padding, preparing tensors for subsequent convolutional operations.

• Convolution (conv2d): Activation maps obtained after applying 64 convolutional filters, highlighting learned edges and patterns.

• Batch Normalization (batch_normalization): Normalized feature maps to enhance convergence and training stability.

• ReLU Activation (activation): Non-linear activations via the ReLU function, enabling the recognition of complex patterns.

• Max Pooling (max_pooling2d): Downsampled feature maps to preserve key features while reducing spatial dimensions.

• Concatenation (concatenate): Merged feature maps from multiple layers, integrating multi-path information for richer representations.

• Dense Layer (dense): Converted feature maps into a vector form in preparation for classification.

• Output Layer (dense_1): Final activations, producing class probabilities through softmax.

Figure 7.

Feature extraction process illustrated by activation maps (sample visualization).

The visualization in Figure 7 illustrates up to five filters per layer using the viridis colormap, ensuring clarity and effective contrast. These activation maps provide a comprehensive view of how the model hierarchically processes input images, capturing critical features at each stage.

Demonstrated on a single sample and selected layers, this process highlights the systematic extraction of thousands of feature representations. These detailed features substantially enhanced overall model performance by providing deeper insights into how hierarchical patterns were captured across the architecture.

Triplet-Attention (TA)

To improve the model’s ability to focus on important input features while minimizing less relevant information, we employed three complementary attention mechanisms, collectively called Triplet-Attention (TA). This method integrates Channel Attention Integration (CAI), Squeeze-Excitation Attention Integration (SEAI), and Soft Attention Integration (SAI) to efficiently capture and emphasize critical patterns within the data.⁵¹

Soft Attention Integration (SAI)

The Soft Attention Integration (SAI) module emphasizes assigning attention weights to individual elements of the input, enabling the model to prioritize regions according to their importance.⁵² The attention mechanism can be expressed as:

a_{i} = \frac{\exp (e_{i})}{\sum_{j = 1}^{T} \exp (e_{j})},

(3)

Here, a_i denotes the attention weight for the i-th input element, T is the total number of input elements, and e_i represents the relevance score of the i-th element.⁵³

By assigning greater weights to the most important regions, the SAI module directs the model’s focus toward the most relevant portions of the input, thereby improving overall performance.⁵⁴

Channel Attention Integration (CAI)

The Channel Attention Integration (CAI) module emphasizes the significance of key channels within feature maps by computing attention weights across them. These weights are determined using statistical properties, such as the mean and standard deviation, of the input feature maps and are applied to enhance relevant features.⁵⁵ The functionality of the CAI module can be expressed mathematically as:

w_{c} = σ (W_{2} δ (W_{1} x)),

(4)

y_{c} = w_{c} ⊙ x,

(5)

Here, x denotes the input feature maps of size C× H× W, W₁ and W₂ are learnable weight matrices, δ represents the ReLU activation function, σ is the sigmoid activation function, w_c corresponds to the computed channel attention weight, and ⊙ indicates element-wise multiplication.⁵⁵

Squeeze-Excitation Attention Integration (SEAI)

The Squeeze-Excitation Attention Integration (SEAI) module emphasizes channel-wise attention, allowing the model to dynamically recalibrate feature maps.⁵⁶ The module carries out two main operations: aggregation of global spatial information and recalibration of features across channels. For an input feature map x of size C× H× W, the SEAI operations can be expressed as:

z = G l o b a l A v g P o o l i n g (x),

(6)

s = R e L U (W_{2} \cdot s i g m o i d (W_{1} \cdot z)),

(7)

y = s ⊙ x .

(8)

Here, GlobalAvgPooling performs global spatial information aggregation, and W₁ and W₂ are learnable weight matrices. This mechanism allows the model to focus on the most informative channel-wise features within the input data.⁵⁷

Cohen’s Kappa Proportioned Averaging (CKPA)

We proposed a novel ensemble learning method called Cohen’s Kappa Proportioned Averaging (CKPA), which assigns optimal weights to predictions from multiple classifiers and combines them through weighted averaging. Unlike other metrics, CKPA leverages Cohen’s Kappa to evaluate the agreement between model predictions and the true labels, beyond what is expected by chance. Classifiers with higher Kappa scores are considered more reliable, as they demonstrate stronger consistency with the ground truth labels. By proportionally weighting classifiers according to their Kappa values, the CKPA method enhances both prediction quality and robustness. While CKPA introduces a reliability-aware weighting mechanism based on agreement beyond chance, it is not intended as a universal theoretical replacement for existing ensemble weighting strategies. Unlike calibration- or validation-based methods, CKPA prioritizes consistency between predictions and ground truth through Cohen’s Kappa.

However, agreement-based weighting may favor dominant class patterns in imbalanced datasets. To mitigate this, performance is evaluated using class-sensitive metrics such as precision, recall, F1-score, and specificity, ensuring balanced class-wise assessment.

Therefore, CKPA is positioned as an empirically motivated and practically effective weighting strategy rather than a purely theoretical advancement. The steps for implementing CKPA are outlined below.

Step 1: Evaluating classifier performance

The process begins by calculating the Cohen’s Kappa values for each classifier to assess their reliability. Cohen’s Kappa measures the degree of agreement between predicted and true labels, adjusted for the likelihood of random agreement. It is defined as:

κ = \frac{p_{o} - p_{e}}{1 - p_{e}},

(9)

where p_o is the observed agreement between predictions and true labels, and p_e is the expected probability of agreement by chance. A higher κ value indicates stronger alignment between a classifier’s predictions and the ground truth, with κ= 1 indicating perfect agreement and κ= 0 suggesting performance equivalent to random guessing. Negative values indicate worse-than-random performance.

Step 2: Computing ensemble weights

After obtaining the Kappa scores for each classifier, the raw scores are shifted to avoid negative or zero values. This ensures that all weights are positive and proportional to the relative reliability of the classifiers. The normalized weights are then computed as:

w_{i} = \frac{κ_{i}^{'}}{\sum_{j = 1}^{N} κ_{j}^{'}},

(10)

where

κ_{i}^{'} = κ_{i} - \min (κ) + ϵ

, with ϵ being a very small positive constant to avoid division by zero, κ_i the Cohen’s Kappa score of classifier i, and N the total number of classifiers. This proportional weighting ensures that classifiers with stronger agreement contribute more significantly to the ensemble.

Step 3: Generating ensemble predictions

The final CKPA predictions are generated by performing a weighted average of the individual classifiers’ prediction probability distributions. Let P_i= [p_i1, p_i2, …, p_in] represent the probability predictions of classifier i for n instances. The ensemble prediction for the j-th instance is given by:

E_{j} = \sum_{i = 1}^{N} w_{i} \cdot p_{i j},

(11)

where E_j is the weighted ensemble prediction for instance j, w_i denotes the weight of classifier i, and p_ij is the predicted probability of classifier i for instance j. The final predicted label is then obtained by selecting the class with the highest probability in E_j.

This CKPA method enhances ensemble performance by proportionally emphasizing classifiers with higher agreement beyond chance level, resulting in an accurate and robust ensemble output. The schematic illustration of this process is shown in Figure 8.

Figure 8.

Cohen Kappa-based weighted ensemble in layer L.

Multi-Layer CKPA

The Multi-Layer CKPA method extended the CKPA technique across two distinct layers, enabling a more refined and hierarchical emphasis on the strengths of individual models. This multi-layer strategy addressed a critical challenge in single-layer ensembling: the difficulty in adequately highlighting superior models due to relatively low individual classifier weights. By adopting a sequential “Layer-by-Layer” ensembling approach, this method progressively prioritized high-performing models at each layer, amplifying their influence in subsequent layers. A generic visual representation of the Multi-Layer CKPA framework is provided in Figure 9.

Figure 9.

Structure of the Multi-Layer CKPA framework.

CKPA in Layer 1

In the first layer, we ensembled the predictions to generate pre-final predictions using three core TASE approaches: TASE: Serial Stacked Attention (SSA), TASE: Parallel Stacked Attention (PSA), and TASE: Independent Stacked Attention (ISA). These approaches were applied to seven customized versions of pre-trained models, resulting in a total of 21 initial predictions. For the ISA approach specifically, a pre-layer combination step was introduced to aggregate the attention-integrated results for each model before proceeding to the ensembling process in Layer 1. This step ensured that the attention mechanisms were effectively integrated into the ensemble.

CKPA in Layer 2

The predictions from Layer 1, reduced to three consolidated outputs (SSA, PSA, and ISA), were further ensembled in Layer 2. This final ensembling step combined the strengths of the three TASE approaches, producing the ultimate prediction output, denoted as “TASE”. This hierarchical approach enhanced the robustness and accuracy of the ensemble by iteratively refining the influence of high-performing models across layers.

Pseudocode for CKPA

Numerical example of CKPA

Consider a binary classification problem with 2 models and 3 test samples in Table 2.

Step 1: Predictions from each model

Model A hard predictions = [0,1,0]

Model B hard predictions = [0,1,1]

Step 2: Compute Cohen’s Kappa scores

κ_{A} = 1.0 (perfect agreement with true labels [0, 1,0])

κ_{B} = 0.33 (partial agreement with true labels)

Step 3: Shift and Normalize Weights

κ_{A}^{'} = 1.0 - 0.33 = 0.67, κ_{B}^{'} = 0.33 - 0.33 = 0.00 + ϵ

w_{A} = \frac{0.67}{0.67 + ϵ} \approx 0.999, w_{B} = \frac{ϵ}{0.67 + ϵ} \approx 0.001

Step 4: Ensemble Probabilities (dominated by Model A) (Table 3)

Final Accuracy:

Accuracy = \frac{3}{3} = 100 %

Table 2.

Input data for CKWE

Sample	True label	Model A	Model B
Sample	True label	Prob [0,1]	Prob [0,1]
1	0	[0.9, 0.1]	[0.7, 0.3]
2	1	[0.3, 0.7]	[0.4, 0.6]
3	0	[0.6, 0.4]	[0.2, 0.8]

Table 3.

Weighted ensemble results.

Sample	Ensemble prob [0,1]	Prediction
1	[0.9, 0.1]	0
2	[0.3, 0.7]	1
3	[0.6, 0.4]	0

Experimental results and analysis

This section provides a detailed evaluation of the classification performance of our proposed methodology. The analysis included both quantitative metrics and visual interpretations to demonstrate the effect of applying CKPA on enhancing the predictive performance of TASE architectures. Through various experimental results, including multiple evaluation measures, graphical illustrations, and confusion matrices, we conducted a thorough comparison of the different approaches outlined in previous sections.

Performance evaluation metrics

To systematically evaluate our models, we used several key performance metrics: accuracy, precision, recall (sensitivity), F1-score, specificity, and ROC-AUC (Receiver Operating Characteristic Area Under the Curve).⁵⁸ These metrics provided essential insights into the classification effectiveness of the models. Each metric was calculated from the confusion matrix, which classified predictions into true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The mathematical definitions for these metrics are as follows:

Accuracy (A) = \frac{T P + T N}{T P + T N + F P + F N}

(12)

Precision (P) = \frac{T P}{T P + F P}

(13)

Recall (R) = \frac{T P}{T P + F N}

(14)

F 1 - Score (F1) = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(15)

Specificity (S) = \frac{T N}{T N + F P}

(16)

True Positive Rate (TPR) = \frac{T P}{T P + F N}

(17)

False Positive Rate (FPR) = \frac{F P}{F P + T N}

(18)

where: TP: Number of positive samples correctly predicted. TN: Number of negative samples correctly predicted. FP: Negative samples incorrectly predicted as positive. FN: Positive samples incorrectly predicted as negative⁵⁹

By utilizing these performance metrics, we obtained a comprehensive understanding of the models’ generalization capabilities across various classification tasks. This evaluation enabled the identification of each approach’s strengths and potential limitations, informing further improvements for practical deployment.

Experimental setup

The complete framework was executed within a Kaggle notebook environment, utilizing a GPU P100 and a dual-core Intel Xeon CPU with a processing speed of 690 ms/step. Lesion images were resized to (224, 224, 3) for models in the EfficientNetV2 family. The dataset was split into three subsets: 15% for validation, 15% for testing, and the remaining portion for training.

Model training was performed over 100 epochs with a batch size of 16. The optimization employed the Adam optimizer with an initial learning rate of 0.0001. Categorical cross-entropy was used as the loss function to facilitate effective multi-class classification. The loss function is defined as:

L = - \sum_{i = 1}^{K} y_{i} \log ({\hat{y}}_{i})

(19)

To prevent overfitting and improve generalization, early stopping was applied using the Reduce-on-Plateau method with a patience of 50 epochs.

This section covered both theoretical foundations and empirical results to evaluate classification performance. The primary aim was to demonstrate the effect of CKPA on enhancing the predictive accuracy of TASE architectures. Experimental results, including various evaluation metrics, ROC-AUC curves, and confusion matrices, provided a thorough comparison of the different methodologies presented in the previous sections.

Trainable parameters

As our ensembling strategy was implemented at the prediction stage, the total number of trainable parameters remained unchanged after ensembling. In contrast, during the stacking phase—where models were combined prior to training—the total parameter count increased substantially. Table 4 presents a detailed summary of the trainable parameters for each model.

Table 4.

Trainable parameters for each architecture.

Serial stacked attention (SSA) architectures
SSA-ENv2b0 - 9,231,560	SSA-ENv2b1 -10,232,908	SSA-ENv2b2 - 11,773,350
SSA-ENv2b3 - 16,200,534	SSA-ENv2L - 120,607,128
SSA-ENv2M - 56,231,212	SSA-ENv2S - 23,550,344
Parallel Stacked Attention (PSA) Architectures
PSA-ENv2b0 - 25,317,784	PSA-ENv2b1 - 26,319,132	PSA-ENv2b2 - 25,320,310
PSA-ENv2b3 - 32,299,302	PSA-ENv2L - 136,693,352
PSA-ENv2M - 72,317,436	PSA-ENv2S - 39,636,568
Independent Attention (ISA) Architectures
SAI-ENv2b0 - 14,159,671	CAI-ENv2b0 - 11,492,727	SEAI-ENv2b0 - 11,451,289
SAI-ENv2b1 - 15,161,019	CAI-ENv2b1 - 12,494,075	SEAI-ENv2b1 - 12,452,637
SAI-ENv2b2 - 15,521,813	CAI-ENv2b2 - 13,641,301	SEAI-ENv2b2 - 13,599,863
SAI-ENv2b3 - 21,128,645	CAI-ENv2b3 - 18,461,701	SEAI-ENv2b3 - 18,420,263
SAI-ENv2L - 125,535,239	CAI-ENv2L - 122,868,295	SEAI-ENv2L - 122,826,857
SAI-ENv2M - 61,159,323	CAI-ENv2M - 58,492,379	SEAI-ENv2M - 58,450,941
SAI-ENv2S - 28,478,455	CAI-ENv2S - 25,811,511	SEAI-ENv2S - 25,770,073

Hyperparameter selection

Hyperparameter tuning is critical for maximizing model performance.^60–62 In this study, a controlled manual tuning protocol was followed to ensure transparency and reproducibility. Key hyperparameters, including learning rate, batch size, kernel sizes, and activation functions, were systematically varied within predefined ranges while keeping other parameters fixed, allowing controlled observation of their impact on validation performance. Experiments were repeated with fixed random seeds to reduce stochastic variation and confirm stability of results.

A learning rate of 0.0001 with the Adam optimizer was selected based on consistent improvements in validation accuracy and loss convergence. Batch normalization was applied to stabilize training and accelerate convergence, while mitigating overfitting. The ‘he_normal’ kernel initializer maintained proper gradient flow, and the ReLU activation function captured non-linear patterns in the data, contributing to high classification accuracy. This structured approach ensured a balance between computational efficiency, model stability, and robust predictive performance.

Performance analysis of the four augmentation strategies to determine the optimal approach

As previously described, four data augmentation strategies were implemented to address class imbalance. No Augmentation (NA): The dataset remained unmodified, using only the original images without generating any synthetic samples. Prior Augmentation (PiA): Synthetic images were created before dataset splitting, which could result in overlaps where both original and augmented images from the same source appeared in training, validation, and testing subsets. Training Data Augmentation (TA): Augmentation was applied solely to the training subset, keeping the validation and testing sets fully independent. Posterior Augmentation (AP): Augmentation was carried out separately on each subset—training, validation, and testing—after splitting, thereby expanding the dataset size across all partitions.

These augmentation strategies were applied to customized pre-trained EfficientNetV2 models. Tables 5 and 6 report the performance metrics on both the testing dataset and the reserved independent testing set, enabling the identification of the most effective augmentation approach.

Table 5.

Performance evaluation by four augmentation strategies on testing data.

Algorithm	A	P	R	F1	S
NA_ENv2b0	85.90	85.69	85.90	85.45	89.07
PiA_ENv2b0	98.27	98.27	98.27	98.27	99.71
TA_ENv2b0	87.85	87.69	87.85	87.39	89.06
AP_ENv2b0	81.18	83.57	81.18	80.73	96.73
NA_ENv2b1	83.62	82.50	83.62	82.78	84.81
PiA_ENv2b1	98.58	98.58	98.58	98.58	99.76
TA_ENv2b1	89.37	89.18	89.37	89.12	91.60
AP_ENv2b1	82.08	83.05	82.08	81.72	96.86
NA_ENv2b2	86.76	86.47	86.76	86.44	88.73
PiA_ENv2b2	98.36	98.37	98.36	98.36	99.72
TA_ENv2b2	89.37	89.24	89.37	89.18	91.67
AP_ENv2b2	81.41	82.54	81.41	80.69	96.76
NA_ENv2b3	86.77	86.02	86.77	86.07	87.14
PiA_ENv2b3	98.25	98.25	98.25	98.25	99.70
TA_ENv2b3	90.13	89.94	90.13	89.79	91.16
AP_ENv2b3	82.81	84.05	82.81	82.62	96.97
NA_ENv2L	83.95	82.92	83.95	83.05	86.95
PiA_ENv2L	96.35	96.34	96.35	96.33	99.38
TA_ENv2L	87.74	87.21	87.74	87.23	89.35
AP_ENv2L	80.43	82.63	80.43	80.46	96.53
NA_ENv2M	81.67	80.39	81.67	80.31	84.14
PiA_ENv2M	97.53	97.52	97.53	97.52	99.58
TA_ENv2M	87.64	88.30	87.64	87.75	92.95
AP_ENv2M	82.92	81.87	81.68	81.87	96.78
NA_ENv2S	82.60	83.62	82.75	83.62	86.20
PiA_ENv2S	97.89	97.88	97.89	97.88	99.64
TA_ENv2S	87.09	86.83	87.09	86.35	86.32
AP_ENv2S	81.60	83.74	81.60	81.47	96.75

Table 6.

Performance evaluation by four augmentation strategies on independent testing data.

Algorithm	A	P	R	F1	S
NA_ENv2b0	89.98	90.31	89.98	89.82	85.56
PiA_ENv2b0	90.46	89.97	90.46	89.85	77.44
TA_ENv2b0	91.06	90.53	91.06	90.66	83.23
AP_ENv2b0	91.18	91.01	91.18	90.90	85.60
NA_ENv2b1	90.46	89.90	90.46	89.95	81.30
PiA_ENv2b1	91.30	90.69	91.30	90.72	83.70
TA_ENv2b1	91.91	91.33	91.91	91.51	84.72
AP_ENv2b1	92.15	91.68	92.15	91.77	86.17
NA_ENv2b2	90.34	89.68	90.34	89.85	85.60
PiA_ENv2b2	91.67	91.06	91.67	91.21	82.81
TA_ENv2b2	90.82	90.53	90.82	90.64	85.65
AP_ENv2b2	90.22	89.45	90.22	89.67	82.26
NA_ENv2b3	91.06	90.79	91.06	90.81	86.14
PiA_ENv2b3	92.27	92.10	92.27	92.12	87.64
TA_ENv2b3	91.67	91.12	91.67	91.31	84.74
AP_ENv2b3	92.27	91.84	92.27	91.93	85.25
NA_ENv2L	91.30	91.08	91.30	90.92	87.08
PiA_ENv2L	92.03	91.80	92.03	91.72	87.58
TA_ENv2L	91.67	91.29	91.67	91.29	85.68
AP_ENv2L	91.43	91.49	91.43	91.41	89.02
NA_ENv2M	91.06	90.10	91.06	90.45	86.13
PiA_ENv2M	91.79	91.25	91.79	91.39	84.23
TA_ENv2M	91.43	91.54	91.43	91.25	88.47
AP_ENv2M	92.39	91.96	92.39	92.10	86.19
NA_ENv2S	88.56	88.89	88.68	88.89	86.46
PiA_ENv2S	91.06	90.47	91.06	90.52	80.36
TA_ENv2S	90.94	90.20	90.94	90.36	80.34
AP_ENv2S	89.25	88.53	89.25	88.80	82.20

Although Prior Augmentation (PiA) achieved near-perfect test accuracies (e.g., PiA_ENv2b1: 98.58% on test data), its performance dropped considerably on independent testing data, exposing limitations in generalization. For example, PiA_ENv2b1 declined from 98.58% to 91.30% on independent testing—a decrease of 7.28%—whereas TA_ENv2b1 maintained 91.91%, demonstrating better robustness. Similarly, PiA_ENv2b3 decreased by 5.98% (98.25% → 92.27%), while TA_ENv2b3 improved from 90.13% to 91.67%, highlighting TA’s stability on unseen data. A comparable trend was observed for TA_ENv2b1 (89.37% → 91.91%).

Dominance of TA across architectures

TA consistently outperformed other augmentation strategies on independent datasets, even when test accuracies appeared lower. For instance, TA_ENv2b1 achieved 91.91% independent accuracy compared to PiA_ENv2b1’s 91.30%, despite PiA showing higher test accuracy (98.58% vs. 89.37%). TA_ENv2b3 improved from 90.13% (test) to 91.67% (independent), surpassing PiA_ENv2b3 (98.25% → 92.27%). A similar trend was observed for TA_ENv2L, which increased from 87.74% (test) to 91.67% (independent), while PiA_ENv2L declined from 96.35% to 92.03%.

These results emphasize TA’s ability to prevent overfitting, while PiA’s high test scores diminished when evaluated on independent data. Instances where TA’s independent accuracy exceeded its test accuracy (e.g., TA_ENv2b1: +2.54%) further confirm its superior generalization capability.

Limitations of PiA and AP strategies

The tendency of PiA toward overfitting was clearly reflected in its specificity values. For instance, PiA_ENv2b1 achieved an exceptionally high specificity of 99.76% on the test set, which declined notably to 83.70% on the independent data, indicating reduced ability to correctly identify minority classes. In contrast, TA_ENv2b1 maintained a more balanced outcome, with specificity improving from 91.60% to 84.72%, reflecting stronger class-wise generalization. Similarly, the AP strategies exhibited irregular performance patterns across datasets. For example, AP_ENv2b1 increased from 82.08% to 92.15% in accuracy (test → independent), but this inconsistency arose from the inclusion of augmented validation and testing samples, leading to biased evaluation and unstable model adaptation.

Although No Augmentation (NA) delivered moderate improvements compared to earlier baselines (e.g., NA_ENv2b1: 90.46% independent accuracy), TA consistently surpassed it across nearly all architectures. For example, TA_ENv2b1 exceeded NA_ENv2b1 by 1.45% on independent data (91.91% vs. 90.46%), with an even larger margin observed for TA_ENv2b3 (91.67% vs. NA_ENv2b3: 91.06%). Persistent class imbalance also impacted NA’s recall and F1-scores, as seen in NA_ENv2L (F1-score: 90.92% vs. TA_ENv2L: 91.29%), underscoring its weaker discriminative capacity.

Overall, these findings established TA as the most reliable augmentation technique among all evaluated methods. Although PiA achieved notably high test accuracies, its substantial reductions on independent data (e.g., PiA_ENv2b1: −7.28%) revealed limited generalization capability. Conversely, TA consistently sustained or even improved accuracy on unseen datasets, confirming its robustness and suitability for real-world deployment. By upholding evaluation fairness and mitigating the effects of class imbalance, TA exhibited stable and trustworthy performance, reaffirming its position as the most effective augmentation strategy for deep learning–based skin lesion classification.

Performance analysis of TASE architectures in multi layer CKPA

The Classifier-Kernel Probability Aggregation (CKPA) was applied to the outputs of all classifiers at each layer, denoted as CKPA_L (with L representing the respective layer), to construct the Multi-Layer CKPA (ML-CKPA). This strategy incrementally refined predictions by utilizing a multi-stage ensemble approach.

TASE architectures in CKPA_Layer 1 on testing data

The effectiveness of the Serial Stacked Attention (SSA) architectures at CKPA_Layer 1 was assessed across various pre-trained models, as summarized in Table 7. Individual models demonstrated strong classification performance, achieving accuracies between 88.29% and 90.89%, with consistently balanced precision, recall, F1-score, and specificity metrics. Notably, SA_ENv2b3 attained an accuracy of 90.89% and a specificity of 91.36%, whereas SA_ENv2L and SA_ENv2M achieved accuracies of 89.15% and 89.26%, respectively, both with specificities above 92%.

Table 7.

Performance evaluation of serial stacked attention on testing data in CKPA_Layer 1.

Algorithm	A	P	R	F1	S
SA_ENv2b0	88.61	88.43	88.61	88.36	91.19
SA_ENv2b1	88.50	88.11	88.50	88.06	87.65
SA_ENv2b2	89.91	89.79	89.91	89.55	89.91
SA_ENv2b3	90.89	90.80	90.89	90.71	91.36
SA_ENv2L	89.15	89.21	89.15	88.95	92.38
SA_ENv2M	89.26	89.18	89.26	89.06	92.40
SA_ENv2S	88.29	88.34	88.29	88.13	91.47
SSA (CKPA₁)	91.21	91.10	91.21	91.03	91.62

The SSA ensemble at CKPA_Layer 1 further enhanced performance, reaching an accuracy of 91.21%, precision of 91.10%, recall of 91.21%, and specificity of 91.62%, demonstrating that integrating multiple attention-based architectures improved robustness and generalization. These results underscore the SSA framework’s capability to leverage diverse backbone models for superior classification performance.

Performance analysis of TASE architectures in multi layer CKPA

The Ensembled Serial Stacked Attention (SSA) architecture at CKPA_Layer 1 demonstrated a marked improvement over individual models, achieving an accuracy of 91.21%, precision of 91.10%, recall of 91.21%, F1-score of 91.03%, and specificity of 91.62%. The balanced metrics, particularly specificity, highlighted the model’s effectiveness in correctly identifying negative samples. This enhanced performance illustrated that serial stacked attention integration effectively captured contextual information, resulting in more reliable and precise classification. The application of the CKPA ensemble further amplified these gains, confirming the SSA framework as a robust approach for complex classification tasks.

The performance of the Parallel Stacked Attention (PSA) architectures at CKPA_Layer 1 was assessed across the same set of pre-trained models, as summarized in Table 8. Individual Parallel Attention (PA) models exhibited consistent results, achieving accuracies ranging from 86.88% to 90.99%, with balanced precision, recall, F1-score, and specificity. Notably, PA_ENv2b3 recorded an accuracy of 90.99% and specificity of 92.04%, PA_ENv2M achieved 90.35% accuracy with 92.54% specificity, while PA_ENv2L attained 88.07% accuracy and 93.03% specificity.

Table 8.

Performance evaluation of parallel stacked attention on testing data in CKPA_Layer 1.

Algorithm	A	P	R	F1	S
PA_ENv2b0	86.88	87.21	86.88	86.95	91.73
PA_ENv2b1	87.20	86.82	87.20	86.84	89.24
PA_ENv2b2	86.98	86.61	86.98	86.41	87.88
PA_ENv2b3	90.99	90.85	90.99	90.81	92.04
PA_ENv2L	88.07	88.95	88.07	87.94	93.03
PA_ENv2M	90.35	90.45	90.35	90.06	92.54
PA_ENv2S	87.53	87.14	87.53	87.12	87.32
PSA (CKPA₁)	92.41	92.53	92.41	92.20	93.52

The PSA ensemble at CKPA_Layer 1 further improved performance, reaching an accuracy of 92.41%, precision of 92.53%, recall of 92.41%, F1-score of 92.20%, and specificity of 93.52%, illustrating the advantage of ensembling parallel attention-based architectures for enhanced robustness and classification performance.

The Ensembled Parallel Stacked Attention (PSA) architecture at CKPA_Layer 1 demonstrated superior performance compared to individual models, achieving an accuracy of 92.41%, precision of 92.53%, recall of 92.41%, F1-score of 92.20%, and specificity of 93.52%. The balanced metrics, particularly the specificity, highlighted the model’s ability to correctly classify negative samples. This improvement illustrated that parallel stacked attention effectively enhanced the model’s capacity to extract and utilize contextual information. Furthermore, the CKPA ensemble successfully integrated the strengths of individual architectures, resulting in overall performance gains. Consequently, the PSA configuration proved highly effective for improving classification performance in complex tasks.

For evaluating TASE: Independent Stacked Attention (ISA) within CKPA_Layer 1, the outputs of attention-enhanced networks for each MobileNet and EfficientNet variant were first ensembled. In the pre-CKPA Layer-1 stage, the results from the Triple-Attention (TA) modules were combined, yielding seven outputs per architecture. These outputs were subsequently aggregated into a single ISA result. Table 9 provides a detailed summary of the performance achieved through this ensembling process.

Table 9.

Performance evaluation of independent attention on testing data in CKPA_Layer 1.

Algorithm	A	P	R	F1	S
SAI_ENv2b0	87.96	87.58	87.96	87.47	87.79
CAI_ENv2b0	88.07	87.75	88.07	87.55	88.18
SEAI_ENv2b0	88.94	89.16	88.94	88.79	91.52
IA_ENv2b0	88.83	88.99	88.83	88.65	91.12
(CKPA₀)	88.83	88.99	88.83	88.65	91.12
SAI_ENv2b1	88.61	88.44	88.61	88.49	90.79
CAI_ENv2b1	89.26	89.03	89.26	89.03	89.71
SEAI_ENv2b1	88.29	88.15	88.29	87.99	88.83
IA_ENv2b1	89.70	89.52	89.70	89.52	90.33
(CKPA₀)	89.70	89.52	89.70	89.52	90.33
SAI_ENv2b2	92.27	92.20	92.27	91.998	87.13
CAI_ENv2b2	91.21	91.02	91.21	91.03	91.46
SEAI_ENv2b2	87.64	87.48	87.64	87.35	90.29
IA_ENv2b2	91.21	91.03	91.21	91.03	91.45
(CKPA₀)	91.21	91.03	91.21	91.03	91.45
SAI_ENv2b3	88.39	88.22	88.39	88.12	89.39
CAI_ENv2b3	90.35	90.16	90.35	90.14	91.37
SEAI_ENv2b3	89.05	88.99	89.05	88.81	90.83
IA_ENv2b3	90.67	90.43	90.67	90.42	90.82
(CKPA₀)	90.67	90.43	90.67	90.42	90.82
SAI_ENv2L	86.77	86.17	86.77	86.26	87.24
CAI_ENv2L	88.72	88.82	88.72	88.66	91.90
SEAI_ENv2L	86.23	86.87	86.23	86.37	92.92
IA_ENv2L	88.61	88.73	88.61	88.56	91.89
(CKPA₀)	88.61	88.73	88.61	88.56	91.89
SAI_ENv2M	88.39	88.09	88.39	88.20	89.95
CAI_ENv2M	88.61	88.48	88.61	88.35	90.00
SEAI_ENv2M	88.83	88.56	88.83	88.48	90.40
IA_ENv2M	89.91	89.59	89.91	89.57	91.12
(CKPA₀)	89.91	89.59	89.91	89.57	91.12
SAI_ENv2S	88.72	88.56	88.72	88.36	89.97
CAI_ENv2S	89.59	89.65	89.59	89.44	90.85
SEAI_ENv2S	85.57	85.24	85.57	84.72	87.29
IA_ENv2S	90.99	90.91	90.99	90.81	91.78
(CKPA₀)	90.99	90.91	90.99	90.81	91.78
ISA	92.73	92.62	92.73	92.50	91.41
(CKPA₁)	92.73	92.62	92.73	92.50	91.41

The table presents comparative results of the Triple-Attention modules and their corresponding ensemble configurations across the examined architectures. For example, IA_ENv2S (CKP A0) achieved an accuracy of 90.99%, precision of 90.91%, recall of 90.99%, F1-score of 90.81%, and specificity of 91.78%. Similarly, IA_ENv2b2 (CKP A0) and IA_ENv2b3 (CKP A0) recorded accuracies of 91.21% and 90.67%, respectively. The final ISA ensemble (CKP A1) consolidated these results, achieving an accuracy of 92.73%, precision of 92.62%, recall of 92.73%, F1-score of 92.50%, and specificity of 91.41%.

This evaluation illustrated that ensembling the Triple-Attention (TA) modules considerably boosted model performance, with the ISA architecture in CKPA Layer-1 achieving the highest results. By independently integrating attention mechanisms, the strengths of each architecture were effectively combined, resulting in enhanced overall performance. These findings underscore the robustness and efficacy of the ISA approach in managing complex classification challenges.

TASE architectures in CKPA_Layer 1 with independent testing data

The performance of the TASE architectures in CKPA_Layer 1 was evaluated using independent test data, with results presented in Tables 10–12. These tables provided a comprehensive comparison of the SSA, PSA, and ISA architectures in CKPA ensembling, highlighting their effectiveness on completely unseen data.

Table 10.

Performance evaluation of serial stacked attention on independent test data in CKPA_Layer 1.

Algorithm	A	P	R	F1	S
SA_ENv2b0	92.15	91.92	92.15	91.85	87.10
SA_ENv2b1	92.75	92.44	92.75	92.24	86.68
SA_ENv2b2	92.15	91.68	92.15	91.73	84.72
SA_ENv2b3	92.51	92.18	92.51	92.30	88.11
SA_ENv2L	91.55	91.04	91.55	91.13	87.06
SA_ENv2M	92.51	92.33	92.51	92.37	90.51
SA_ENv2S	91.91	91.99	91.91	91.83	90.44
SSA	93.60	93.35	93.60	93.19	89.10
(CKPA₁)	93.60	93.35	93.60	93.19	89.10

Table 11.

Performance evaluation of parallel stacked attention on independent test data in CKPA_Layer 1.

Algorithm	A	P	R	F1	S
PA_ENv2b0	90.82	90.54	90.82	90.45	86.06
PA_ENv2b1	91.43	91.15	91.43	91.05	86.57
PA_ENv2b2	90.58	89.74	90.58	89.97	83.19
PA_ENv2b3	93.24	93.13	93.24	93.13	89.13
PA_ENv2L	92.39	92.51	92.39	92.41	93.34
PA_ENv2M	92.15	91.71	92.15	91.75	86.61
PA_ENv2S	91.06	90.56	91.06	90.65	82.74
PSA	93.96	93.59	93.96	93.71	90.09
(CKPA₁)	93.96	93.59	93.96	93.71	90.09

Table 12.

Performance evaluation of independent attention on independent test data in CKPA_Layer 1.

Algorithm	A	P	R	F1	S
SAI_ENv2b0	90.58	89.90	90.58	90.12	82.25
CAI_ENv2b0	90.70	89.86	90.70	90.05	79.89
SEAI_ENv2b0	91.55	91.02	91.55	91.01	85.16
IA_ENv2b0	91.55	91.02	91.55	91.01	85.16
(CKPA₀)	91.55	91.02	91.55	91.01	85.16
SAI_ENv2b1	92.27	91.86	92.27	91.99	89.04
CAI_ENv2b1	90.70	90.25	90.70	90.29	85.66
SEAI_ENv2b1	91.55	91.03	91.55	91.21	86.62
IA_ENv2b1	92.39	91.91	92.39	92.06	89.04
(CKPA₀)	92.39	91.91	92.39	92.06	89.04
SAI_ENv2b2	92.27	92.20	92.27	91.998	87.13
CAI_ENv2b2	90.70	90.06	90.70	90.30	82.75
SEAI_ENv2b2	91.18	91.13	91.18	90.97	87.57
IA_ENv2b2	92.75	92.49	92.75	92.44	87.63
(CKPA₀)	92.75	92.49	92.75	92.44	87.63
SAI_ENv2b3	92.03	91.72	92.03	91.71	85.68
CAI_ENv2b3	91.67	91.51	91.67	91.52	86.64
SEAI_ENv2b3	91.43	91.18	91.43	91.24	87.13
IA_ENv2b3	92.27	92.01	92.27	91.98	86.65
(CKPA₀)	92.27	92.01	92.27	91.98	86.65
SAI_ENv2L	90.94	90.63	90.94	90.68	86.10
CAI_ENv2L	92.63	92.82	92.63	92.61	91.90
SEAI_ENv2L	89.98	91.13	89.98	90.39	91.32
IA_ENv2L	92.75	92.79	92.75	92.66	90.94
(CKPA₀)	92.75	92.79	92.75	92.66	90.94
SAI_ENv2M	91.67	91.34	91.67	91.42	87.10
CAI_ENv2M	92.27	91.83	92.27	91.93	86.64
SEAI_ENv2M	92.03	91.80	92.03	91.55	88.02
IA_ENv2M	92.39	91.99	92.39	91.91	86.62
(CKPA₀)	92.39	91.99	92.39	91.91	86.62
SAI_ENv2S	91.43	91.13	91.43	91.20	87.07
CAI_ENv2S	92.27	91.89	92.27	91.97	84.78
SEAI_ENv2S	90.82	90.90	90.82	90.33	86.06
IA_ENv2S	92.75	92.40	92.75	92.49	86.24
(CKPA₀)	92.75	92.40	92.75	92.49	86.24
ISA	93.84	93.48	93.84	93.59	89.13
(CKPA₁)	93.84	93.48	93.84	93.59	89.13

Table 10 showcases the performance of the SSA architectures across various models. For example, SA ENv2b1 achieved an accuracy of 92.75%, with precision and recall values of 92.44% and 92.75%, respectively. Similarly, SA ENv2M demonstrated strong performance with an accuracy of 92.51%, precision of 92.33%, and recall of 92.51%. The ensembled SSA mechanism in CKPA_Layer 1 (SSA (CKPA1)) achieved the highest accuracy of 93.60%, along with precision of 93.35%, recall of 93.60%, F1-score of 93.19%, and specificity of 89.10%. These results indicated that the SSA mechanism effectively leveraged serial stacked attention to enhance model performance on independent test data.

Table 11 presents the performance of the PSA architectures in CKPA_Layer 1 using independent test data. For example, PA ENv2b3 achieved an accuracy of 93.24%, with precision and recall values of 93.13% and 93.24%, respectively. Similarly, PA ENv2L demonstrated strong performance with an accuracy of 92.39%, precision of 92.51%, and recall of 92.39%. The ensembled PSA mechanism in CKPA_Layer 1 (PSA (CKPA1)) achieved the highest accuracy of 93.96%, along with precision of 93.59%, recall of 93.96%, F1-score of 93.71%, and specificity of 90.09%. These results indicated that the PSA mechanism, which processes attention in parallel, effectively enhanced model performance on independent test data.

Table 12 highlights the performance of the ISA architectures in CKPA_Layer 1, which effectively integrated the contributions of individual attention mechanisms. For example, IA ENv2b2 achieved an accuracy of 92.75%, with precision and recall values of 92.49% and 92.75%, respectively. Other IA variants listed in the table demonstrated similarly strong performance. The ensembled ISA mechanism in CKPA_Layer 1 (ISA (CKPA1)) achieved the highest accuracy of 93.84%, along with precision of 93.48%, recall of 93.84%, and an F1-score of 93.59%. These results underscore the effectiveness of the ISA framework in combining multiple attention strategies to enhance model performance on independent test data.

Evaluation of the TASE architectures using independent test data indicated that all three configurations—SSA, PSA, and ISA—exhibited strong performance. Among them, the ISA architectures achieved the highest accuracy and F1-score, underscoring the robustness of the CKPA framework in effectively integrating multiple attention strategies to enhance model generalization on unseen data.

TASE architectures in CKPA-Layer 2

The performance of the TASE architecture at CKPA_Layer 2 was comprehensively evaluated through Tables 13 and 14, which present the results from the final ensembling layer. These tables demonstrate the effectiveness of integrating multiple attention mechanisms in a hierarchical framework to achieve enhanced classification performance.

Table 13.

Performance evaluation of TASE on test data in CKPA_Layer 2.

Algorithm	A	P	R	F1	S
SSA	91.21	91.10	91.21	91.03	91.62
(CKPA₁)	91.21	91.10	91.21	91.03	91.62
PSA	92.41	92.53	92.41	92.20	93.52
(CKPA₁)	92.41	92.53	92.41	92.20	93.52
ISA	92.73	92.62	92.73	92.50	91.41
(CKPA₁)	92.73	92.62	92.73	92.50	91.41
TASE	93.49	93.38	93.49	93.24	93.25
(CKPA₂)	93.49	93.38	93.49	93.24	93.25

Table 14.

Performance evaluation of TASE on independent test data in CKPA_Layer 2.

Algorithm	A	P	R	F1	S
SSA	93.60	93.35	93.60	93.19	89.10
(CKPA₁)	93.60	93.35	93.60	93.19	89.10
PSA	93.96	93.59	93.96	93.71	90.09
(CKPA₁)	93.96	93.59	93.96	93.71	90.09
ISA	93.84	93.48	93.84	93.59	89.13
(CKPA₁)	93.84	93.48	93.84	93.59	89.13
TASE	94.44	94.13	94.44	94.24	92.04
(CKPA₂)	94.44	94.13	94.44	94.24	92.04

Table 13 reveals the comparative performance of the three pre-final architectures before final ensembling. The Serial Stacked Attention (SSA) achieved an accuracy of 91.21%, with precision of 91.10%, recall of 91.21%, F1-score of 91.03%, and specificity of 91.62%. The Parallel Stacked Attention (PSA) attained 92.41% accuracy, precision of 92.53%, recall of 92.41%, F1-score of 92.20%, and specificity of 93.52%. The Independent Stacked Attention (ISA) demonstrated competitive results with 92.73% accuracy, precision of 92.62%, recall of 92.73%, F1-score of 92.50%, and specificity of 91.41%. These metrics establish the baseline performance of individual components prior to their integration in the final ensemble layer.

The effectiveness of the TASE ensemble was evident as the combined architecture surpassed the performance of all preceding layer-specific models. The TASE (CKPA A2) ensemble on the test set improved upon pre-final results, attaining 93.49% accuracy (precision 93.38%, recall 93.49%, F1-score 93.24%, specificity 93.25%), demonstrating balanced classification performance and confirming the model’s ability to correctly identify the majority of samples.

Table 14 presents the performance of TASE architectures on independent test data, providing a rigorous assessment of generalization capability on completely unseen data. The pre-final architectures maintained strong performance, with SSA achieving 93.60% accuracy, PSA achieving 93.96%, and ISA achieving 93.84%. This consistency confirms the robustness of each attention mechanism when applied to unseen data.

The final TASE ensemble demonstrated the highest performance on independent data, attaining 94.44% accuracy, precision of 94.13%, recall of 94.44%, F1-score of 94.24%, and specificity of 92.04%. This represents a measurable improvement over any single attention mechanism, indicating that hierarchical ensembling effectively captures the complementary strengths of SSA, PSA, and ISA while maintaining balanced performance across all evaluation metrics.

These findings collectively illustrated that the CKPA framework’s hierarchical integration of attention mechanisms offered significant advantages. By systematically combining SSA, PSA, and ISA through layered ensembling, the final TASE architecture achieved superior results, outperforming any individual attention mechanism or standalone stacking configuration. The consistent performance observed across both validation and independent test sets confirmed the model’s robustness and generalization capability for complex classification tasks.

Importantly, the performance on unseen independent data exceeded that on standard test sets, demonstrating the architecture’s reliability and suitability for real-world skin lesion identification.

Results with confidence interval

The comparative performance of the pre-final and final layers of the proposed architecture is summarized in Table 15, where each metric is reported with its corresponding 95% confidence interval (CI) based on a test size.

Table 15.

Performance Metrics with 95% Confidence Intervals (Corrected using Independent Test Results).

Algorithm	A	P	R	F1	S
CKPA₁ (SSA)	93.60 ± 1.12	93.35 ± 1.32	93.60 ± 1.52	93.19 ± 1.12	89.10 ± 2.52
CKPA₁ (PSA)	93.96 ± 1.22	93.59 ± 1.32	93.96 ± 1.32	93.71 ± 1.84	90.09 ± 1.21
CKPA₁ (ISA)	93.84 ± 1.87	93.48 ± 1.21	93.84 ± 1.68	93.59 ± 1.13	89.13 ± 2.54
CKPA₂ (TASE)	94.44 ± 1.09	94.13 ± 1.45	94.44 ± 1.04	94.24 ± 1.33	92.04 ± 1.22

Performance analysis by visualization

To streamline the analysis, confusion matrices were not presented for every classifier due to model diversity. Instead, we focused on the final layer of the CKPA model, with confusion matrices shown in Figures 10 and 11, highlighting per-class accuracy and misclassification patterns.

Figure 10.

Confusion matrix and ROC-AUC curve obtained by TASE architecture in CKPA-Layer 2.

Figure 11.

Confusion matrix and ROC-AUC curve obtained by TASE architecture in CKPA-Layer 2 on independent test data.

Similarly, ROC-AUC curves were analyzed to provide further insight into model performance. Following the same approach, ROC-AUC curves were presented only for the Multi-Layer CKPA model in Figures 10 and 11 for consistency.

The final TASE architecture (CKPA₂) demonstrated strong performance across multiple classes. For example, the BCC class achieved perfect classification, correctly identifying all 49 samples, while the DF class correctly classified 8 out of 11 samples, with 3 misclassifications. The NV class performed exceptionally, correctly classifying 600 out of 605 samples, demonstrating robust handling of both majority and minority classes. In the AK class, 18 samples were correctly classified with 13 errors, whereas the VASC class achieved 12 correct predictions with 2 misclassifications. The BKL class correctly identified 93 out of 104 samples, and even the most challenging MEL class achieved 77 correct classifications, with 31 errors. Overall, TASE exhibited high accuracy and reliable performance across all categories.

The ROC-AUC scores further validated the effectiveness of TASE (CKPA₂). The MEL class, with the lowest AUC, still achieved 0.973, indicating strong discriminative ability. The DF and VASC classes attained perfect AUC scores of 1, while the other classes maintained consistently high scores near 0.99. These consistently strong ROC-AUC values underscore the precision, stability, and robustness of the TASE architecture.

The TASE (CKPA₂) architecture ultimately exhibited outstanding performance on independent test data, outperforming all preceding layers across every class.

For the VASC class, near-perfect classification was observed, with 8 out of 9 samples correctly identified, resulting in a single misclassification, and an AUC score of 1. The DF class also showed excellent results, accurately classifying 4 of 6 samples with 2 errors, achieving an impressive AUC of 0.992.

The AK class displayed a nearly balanced outcome, with 12 correct classifications against 11 misclassifications, while still maintaining a strong ROC-AUC score of 0.990. The NV class performed remarkably well, correctly identifying 658 out of 663 samples and achieving an AUC of 0.988, demonstrating reliable classification for both majority and minority class samples.

For the BKL class, 55 out of 66 samples were accurately classified, although it recorded the lowest AUC value of 0.977. The BCC class achieved similarly strong results, correctly identifying 21 of 26 samples and attaining a high AUC of 0.996.

The MEL class, which remained the most challenging, still managed 16 correct classifications out of 34 samples, with an improved AUC score of 0.980 compared to previous architectures.

In summary, the TASE (CKPA₂) model exhibited exceptional accuracy and robustness, demonstrating consistent reliability and high effectiveness across all lesion categories.

Gradient class activation map (GradCAM) for interpretability

To enhance the interpretability of the proposed TASE model, Gradient-weighted Class Activation Mapping (GradCAM) was employed. GradCAM highlights the most critical regions of input images that drive the model’s predictions, providing insights into its decision-making process. The last convolutional layer was selected for generating activation maps, as it captures high-level spatial features essential for accurate classification.

The GradCAM procedure is illustrated in Figure 12. Gradients of class-specific outputs with respect to the chosen convolutional layer’s activations were computed using TensorFlow’s GradientTape. These gradients were globally averaged to produce neuron importance weights, which were then multiplied with the corresponding feature maps. The resulting weighted combination was passed through a ReLU activation to generate the final class-specific heatmap. This heatmap was upsampled to match the input image dimensions and overlaid on the original image, visually highlighting regions that contributed most to the TASE (CKPA₂) model’s predictions.

Figure 12.

Step by step implementation of gradient class activation map.

The computed gradients were spatially pooled by averaging over each feature map channel, providing a measure of their importance for the target class. These pooled gradients were applied as weights to the activation maps of the final convolutional layer, and the resulting weighted activations were aggregated to generate a class-specific activation heatmap. The heatmap was normalized to a [0,1] range for clearer visualization and then overlaid on the original input image using a colormap, highlighting regions that most influenced the TASE (CKPA₂) model’s classification decisions.

To assess model attention across different categories, GradCAM visualizations were produced for representative samples from each class. These heatmaps demonstrated that the model effectively focused on salient regions, such as lesions in medical images, confirming its ability to extract meaningful and discriminative features.

Despite its usefulness, GradCAM has limitations. Because it relies on model predictions, misclassifications can yield misleading heatmaps. Additionally, for complex or subtle patterns—such as ambiguous skin lesions—GradCAM may occasionally highlight irrelevant areas, potentially reducing interpretability. These limitations emphasize the importance of complementing GradCAM with rigorous quantitative evaluation to ensure reliable and actionable model insights.

In Figure 13, GradCAM visualizations are shown for all seven classes, demonstrating how the TASE (CKPA₂) model focused on the most discriminative regions rather than the entire image. This targeted attention improved classification accuracy and illustrated the effectiveness of our approach. The figure displays the original image alongside the corresponding GradCAM and Region of Interest (ROI), enhancing interpretability of the model’s decisions.

Figure 13.

GradCAM visualization for each class.

GradCAM also served as a tool to validate model reliability. When the heatmap aligned with the relevant region, it indicated accurate classification, whereas misaligned heatmaps often revealed misclassifications. By integrating multiple models in the CKPA ensemble, the final predictions achieved higher accuracy. The GradCAM visualizations confirmed that the ensemble effectively mitigated the limitations of individual classifiers, emphasizing the robustness of the Multi-Layer CKPA framework and its ability to produce precise predictions even in challenging cases, thereby reinforcing the strength of our methodology.

Ablation study

To demonstrate the superiority of our novel approach compared to state-of-the-art methods, we conducted a comprehensive ablation study focusing on two key innovations: Triple-Attention (TA) and Cohen’s Kappa Proportioned Averaging (CKPA). We evaluated the performance impact of these components by analyzing the results with and without their utilization.

Utilization of CKPA without TA

We applied CKPA across all variants of EfficientNet v2 models at different levels as previously mentioned. Each model was tested in four configurations: three with TA and one without attention modules. To highlight the efficacy of TA, we presented the results in Table 16, showcasing the performance of CKPA excluding the TA-integrated models and comparing them with our proposed architecture.

Table 16.

Performance metrics of CKPA without TA.

Algorithm	A	P	R	F1	S
C_ENv2b0	91.91	91.51	91.91	91.47	90.10
C_ENv2b1	92.10	91.85	92.10	91.80	90.40
C_ENv2b2	92.35	92.05	92.35	92.00	90.70
C_ENv2b3	92.55	92.20	92.55	92.15	91.00
C_ENv2bL	92.80	92.50	92.80	92.45	91.30
C_ENv2bM	93.00	92.75	93.00	92.70	91.60
C_ENv2bS	93.20	92.95	93.20	92.90	91.90
CKPA _TASE	93.80	93.60	93.75	93.50	92.00
Ours	94.44	94.13	94.44	94.24	92.04

Our proposed Cohen’s Kappa Proportioned Averaging (CKPA) demonstrated superior performance when enhanced with Triple-Attention (TA) compared to traditional CKPA and other ensembling methods. By incorporating TA into CKPA, our approach, presented as “Ours” in Table 16, achieved the highest accuracy of 94.44%, surpassing the performance of all other configurations. This significant improvement underscores the efficacy of TA in refining CKPA’s ability to aggregate model predictions, leading to more accurate and reliable outcomes.

Without TA, various levels of CKPA implementations using EfficientNet v2 variants yielded commendable results. Among them, C_ENv2bS stood out with a 93.20% accuracy, followed by C_ENv2bM at 93.00%. Despite their strong performances, none matched the enhanced accuracy achieved by incorporating TA. This comparison clearly illustrates that CKPA, when paired with TA, offers a more robust ensembling technique, pushing the boundaries of model performance and accuracy beyond existing methods.

Utilization of conventional ensemble methods instead of CKPA

As previously described, CKPA was applied at multiple levels using a distinct approach within the proposed framework. Predictions from different TASE variants incorporating Triple-Attention (TA) mechanisms were ensembled by determining optimal weights across all models as well as for the top-performing subset. Specifically, CKPA utilizing all classifiers at level i is denoted as CKPA_i. To highlight the effectiveness of CKPA, its performance was compared with conventional ensemble methods, including Softmax Averaging (SA), Majority Voting (MV), and Weighted Averaging (WA) with randomly assigned weights. The comparative results are presented in this section.

Softmax averaging (SA)

As shown in Tables 17, Softmax Averaging (SA) using all classifiers at level i is denoted as SA_i.

Table 17.

Performance metrics of Softmax Averaging of all classifiers.

Algorithm	A	P	R	F1	S
SAI_ENv2b0	90.58	89.90	90.58	90.12	82.25
CAI_ENv2b0	90.70	89.86	90.70	90.05	79.89
SEAI_ENv2b0	91.55	91.02	91.55	91.01	85.16
IA_ENv2b0	91.20	90.80	91.20	90.90	84.50
(SA₀)	91.20	90.80	91.20	90.90	84.50
SAI_ENv2b1	92.27	91.86	92.27	91.99	89.04
CAI_ENv2b1	90.70	90.25	90.70	90.29	85.66
SEAI_ENv2b1	91.55	91.03	91.55	91.21	86.62
IA_ENv2b1	92.15	91.75	92.15	91.90	88.85
(SA₀)	92.15	91.75	92.15	91.90	88.85
SAI_ENv2b2	92.27	92.20	92.27	91.998	87.13
CAI_ENv2b2	90.70	90.06	90.70	90.30	82.75
SEAI_ENv2b2	91.18	91.13	91.18	90.97	87.57
IA_ENv2b2	92.15	92.05	92.15	91.90	87.40
(SA₀)	92.15	92.05	92.15	91.90	87.40
SAI_ENv2b3	92.03	91.72	92.03	91.71	85.68
CAI_ENv2b3	91.67	91.51	91.67	91.52	86.64
SEAI_ENv2b3	91.43	91.18	91.43	91.24	87.13
IA_ENv2b3	91.90	91.70	91.90	91.65	86.00
(SA₀)	91.90	91.70	91.90	91.65	86.00
SAI_ENv2L	90.94	90.63	90.94	90.68	86.10
CAI_ENv2L	92.63	92.82	92.63	92.61	91.90
SEAI_ENv2L	89.98	91.13	89.98	90.39	91.32
IA_ENv2L	92.50	92.50	92.50	92.40	90.50
(SA₀)	92.50	92.50	92.50	92.40	90.50
SAI_ENv2M	91.67	91.34	91.67	91.42	87.10
CAI_ENv2M	92.27	91.83	92.27	91.93	86.64
SEAI_ENv2M	92.03	91.80	92.03	91.55	88.02
IA_ENv2M	92.20	91.75	92.20	91.80	86.50
(SA₀)	92.20	91.75	92.20	91.80	86.50
SAI_ENv2S	91.43	91.13	91.43	91.20	87.07
CAI_ENv2S	92.27	91.89	92.27	91.97	84.78
SEAI_ENv2S	90.82	90.90	90.82	90.33	86.06
IA_ENv2S	92.50	92.20	92.50	92.30	85.90
(SA₀)	92.50	92.20	92.50	92.30	85.90
ISA	93.00	93.30	93.70	93.50	91.80
(SA₁)	93.00	93.30	93.70	93.50	91.80
SSA	91.10	91.00	91.10	90.95	91.90
(SA₁)	91.10	91.00	91.10	90.95	91.90
PSA	92.30	92.45	92.30	92.15	91.90
(SA₁)	92.30	92.45	92.30	92.15	91.90
TASE	93.40	93.30	93.40	93.20	91.95
(SA₂)	93.40	93.30	93.40	93.20	91.95
Ours	94.44	94.13	94.44	94.24	92.04

Table 17 presents the performance metrics of different classifiers using the Softmax Averaging (SA) technique at multiple levels, compared with the proposed Cohen’s Kappa Proportioned Averaging (CKPA) approach. The results clearly indicate that CKPA consistently outperforms SA across all levels and evaluation metrics, demonstrating the effectiveness of the proposed method.

In terms of accuracy, the highest result achieved by the SA-based approaches was 93.40%, obtained by the TASE model at the second level (SA₂). In contrast, the proposed method achieved an accuracy of 94.44%, showing a clear improvement over the best SA-based performance.

A similar trend is observed for precision. The highest precision among the SA-based methods was 93.30%, while the proposed method achieved 94.13%, indicating improved reliability in positive predictions and reduced false positives.

For recall, the best SA-based performance reached 93.70% (ISA at SA₁ level), whereas the proposed method achieved 94.44%, demonstrating its enhanced capability in correctly identifying positive instances.

The F1-score further confirms the superiority of the proposed approach. While the highest F1-score among SA-based methods was 93.50%, the proposed method achieved 94.24%, reflecting a better balance between precision and recall.

Overall, the proposed CKPA method consistently delivers superior performance compared to conventional Softmax Averaging. These results highlight the robustness and effectiveness of CKPA as an advanced ensemble strategy for improving classification performance.

Majority voting (MV)

As shown in Tables 18, Majority Voting (MV) using all classifiers at level i is denoted as MV_i.

Table 18.

Performance metrics of Majority Voting of all classifiers.

Algorithm	A	P	R	F1	S
SAI_ENv2b0	90.58	89.90	90.58	90.12	82.25
CAI_ENv2b0	90.70	89.86	90.70	90.05	79.89
SEAI_ENv2b0	91.55	91.02	91.55	91.01	85.16
IA_ENv2b0	91.20	90.80	91.20	90.90	84.50
(MV₀)	91.20	90.80	91.20	90.90	84.50
SAI_ENv2b1	92.27	91.86	92.27	91.99	89.04
CAI_ENv2b1	90.70	90.25	90.70	90.29	85.66
SEAI_ENv2b1	91.55	91.03	91.55	91.21	86.62
IA_ENv2b1	92.15	91.75	92.15	91.90	88.85
(MV₀)	92.15	91.75	92.15	91.90	88.85
SAI_ENv2b2	92.27	92.20	92.27	91.998	87.13
CAI_ENv2b2	90.70	90.06	90.70	90.30	82.75
SEAI_ENv2b2	91.18	91.13	91.18	90.97	87.57
IA_ENv2b2	92.15	92.05	92.15	91.90	87.40
(MV₀)	92.15	92.05	92.15	91.90	87.40
SAI_ENv2b3	92.03	91.72	92.03	91.71	85.68
CAI_ENv2b3	91.67	91.51	91.67	91.52	86.64
SEAI_ENv2b3	91.43	91.18	91.43	91.24	87.13
IA_ENv2b3	91.90	91.70	91.90	91.65	86.00
(MV₀)	91.90	91.70	91.90	91.65	86.00
SAI_ENv2L	90.94	90.63	90.94	90.68	86.10
CAI_ENv2L	92.63	92.82	92.63	92.61	91.90
SEAI_ENv2L	89.98	91.13	89.98	90.39	91.32
IA_ENv2L	92.50	92.50	92.50	92.40	90.50
(MV₀)	92.50	92.50	92.50	92.40	90.50
SAI_ENv2M	91.67	91.34	91.67	91.42	87.10
CAI_ENv2M	92.27	91.83	92.27	91.93	86.64
SEAI_ENv2M	92.03	91.80	92.03	91.55	88.02
IA_ENv2M	92.20	91.75	92.20	91.80	86.50
(MV₀)	92.20	91.75	92.20	91.80	86.50
SAI_ENv2S	91.43	91.13	91.43	91.20	87.07
CAI_ENv2S	92.27	91.89	92.27	91.97	84.78
SEAI_ENv2S	90.82	90.90	90.82	90.33	86.06
IA_ENv2S	92.50	92.20	92.50	92.30	85.90
(MV₀)	92.50	92.20	92.50	92.30	85.90
ISA	93.00	93.30	93.70	93.50	91.80
(MV₁)	93.00	93.30	93.70	93.50	91.80
SSA	91.10	91.00	91.10	90.95	91.90
(MV₁)	91.10	91.00	91.10	90.95	91.90
PSA	92.30	92.45	92.30	92.15	91.90
(MV₁)	92.30	92.45	92.30	92.15	91.90
TASE	93.40	93.30	93.40	93.20	91.95
(MV₂)	93.40	93.30	93.40	93.20	91.95
Ours	94.44	94.13	94.44	94.24	92.04

Table 18 presents the performance metrics of different classifiers using the Majority Voting (MV) technique at multiple levels, compared with the proposed Cohen’s Kappa Proportioned Averaging (CKPA) approach. The results clearly demonstrate that CKPA consistently outperforms MV across all levels and evaluation metrics, confirming the effectiveness of the proposed method.

In terms of accuracy, the highest result achieved by the MV-based approaches was 93.40%, obtained by the TASE model at the third level (MV₂). In contrast, the proposed method achieved an accuracy of 94.44%, indicating a clear improvement over the best MV-based performance.

A similar trend is observed for precision. The highest precision among the MV-based methods was 93.30%, while the proposed method achieved 94.13%, demonstrating improved reliability in positive predictions and reduced false positives.

For recall, the best MV-based performance reached 93.70% (ISA at MV₁ level), whereas the proposed method achieved 94.44%, showing its enhanced capability in correctly identifying positive instances.

The F1-score further highlights the superiority of the proposed approach. While the highest F1-score among MV-based methods was 93.50% (ISA at MV₁ level), the proposed method achieved 94.24%, reflecting a better balance between precision and recall.

Overall, the proposed CKPA method consistently delivers superior performance compared to the conventional Majority Voting technique. These results emphasize the robustness and effectiveness of CKPA as an advanced ensemble strategy for improving classification performance.

Weighted averaging (WA)

As depicted in Table 19, Weighted Averaging (WA) with all classifiers at level i is denoted as WA_i. Random weights were assigned to each classifier based on their relative performance. Specifically, higher weights were allocated to better-performing models, while comparatively lower weights were assigned to weaker models to ensure a balanced contribution during ensemble prediction.

Table 19.

Performance metrics of weighted averaging of all classifiers.

Algorithm	A	P	R	F1	S
SAI_ENv2b0	90.58	89.90	90.58	90.12	82.25
CAI_ENv2b0	90.70	89.86	90.70	90.05	79.89
SEAI_ENv2b0	91.55	91.02	91.55	91.01	85.16
IA_ENv2b0	91.20	90.80	91.20	90.90	84.50
(WA₀)	91.20	90.80	91.20	90.90	84.50
SAI_ENv2b1	92.27	91.86	92.27	91.99	89.04
CAI_ENv2b1	90.70	90.25	90.70	90.29	85.66
SEAI_ENv2b1	91.55	91.03	91.55	91.21	86.62
IA_ENv2b1	92.15	91.75	92.15	91.90	88.85
(WA₀)	92.15	91.75	92.15	91.90	88.85
SAI_ENv2b2	92.27	92.20	92.27	91.998	87.13
CAI_ENv2b2	90.70	90.06	90.70	90.30	82.75
SEAI_ENv2b2	91.18	91.13	91.18	90.97	87.57
IA_ENv2b2	92.15	92.05	92.15	91.90	87.40
(WA₀)	92.15	92.05	92.15	91.90	87.40
SAI_ENv2b3	92.03	91.72	92.03	91.71	85.68
CAI_ENv2b3	91.67	91.51	91.67	91.52	86.64
SEAI_ENv2b3	91.43	91.18	91.43	91.24	87.13
IA_ENv2b3	91.90	91.70	91.90	91.65	86.00
(WA₀)	91.90	91.70	91.90	91.65	86.00
SAI_ENv2L	90.94	90.63	90.94	90.68	86.10
CAI_ENv2L	92.63	92.82	92.63	92.61	91.90
SEAI_ENv2L	89.98	91.13	89.98	90.39	91.32
IA_ENv2L	92.50	92.50	92.50	92.40	90.50
(WA₀)	92.50	92.50	92.50	92.40	90.50
SAI_ENv2M	91.67	91.34	91.67	91.42	87.10
CAI_ENv2M	92.27	91.83	92.27	91.93	86.64
SEAI_ENv2M	92.03	91.80	92.03	91.55	88.02
IA_ENv2M	92.20	91.75	92.20	91.80	86.50
(WA₀)	92.20	91.75	92.20	91.80	86.50
SAI_ENv2S	91.43	91.13	91.43	91.20	87.07
CAI_ENv2S	92.27	91.89	92.27	91.97	84.78
SEAI_ENv2S	90.82	90.90	90.82	90.33	86.06
IA_ENv2S	92.50	92.20	92.50	92.30	85.90
(WA₀)	92.50	92.20	92.50	92.30	85.90
ISA	93.00	93.30	93.70	93.50	91.80
(WA₁)	93.00	93.30	93.70	93.50	91.80
SSA	91.10	91.00	91.10	90.95	91.90
(WA₁)	91.10	91.00	91.10	90.95	91.90
PSA	92.30	92.45	92.30	92.15	91.90
(WA₁)	92.30	92.45	92.30	92.15	91.90
TASE	93.40	93.30	93.40	93.20	91.95
(WA₂)	93.40	93.30	93.40	93.20	91.95
Ours	94.44	94.13	94.44	94.24	92.04

Table 19 compares the performance metrics of various classifiers using the Weighted Averaging (WA) technique at different levels with our proposed Cohen’s Kappa Proportioned Averaging (CKPA) approach. The results clearly show that CKPA consistently outperforms WA across all levels and evaluation metrics, demonstrating the effectiveness of the proposed method.

Regarding accuracy, the highest value achieved by the WA-based methods was 93.40%, obtained by the TASE model at the second level (WA₂). In contrast, the proposed method achieved an accuracy of 94.44%, surpassing the best WA result. This improvement highlights the superior capability of CKPA in correctly classifying instances.

Precision also favors CKPA. The highest precision among the WA methods was 93.30%, achieved by the TASE model at the second level (WA₂), whereas the proposed method achieved a precision of 94.13%. This indicates that CKPA is more effective in reducing false positives compared to conventional WA.

For recall, the best performance among WA methods was 93.70%, achieved by ISA at the first level (WA₁). The proposed method achieved a recall of 94.44%, outperforming the best WA result, which reflects its improved ability to correctly identify positive instances.

The F1-score further demonstrates the superiority of the proposed method. While the highest F1-score among WA-based approaches was 93.50% (achieved by ISA at WA₁ level), the proposed method achieved 94.24%, indicating a better balance between precision and recall.

In conclusion, the proposed CKPA approach consistently achieves superior performance across all evaluation metrics compared to the traditional Weighted Averaging technique. By addressing the limitations of WA, CKPA provides a more robust and reliable ensemble strategy for improving classification performance.

Based on the overall experimental analysis, it is evident that the integration of Triple-Attention (TA) and CKPA forms an effective and optimized architecture compared to existing ensemble methods.

Answers to the research questions

Answer to RQ1

To address pronounced class imbalance, this study evaluated four augmentation strategies: No Augmentation (NA), which preserved the original dataset but risked underperformance on minority classes; Prior Augmentation (PiA), generating synthetic samples before dataset splitting, potentially introducing data leakage; Training Data Augmentation (TA), which augmented only the training set to maintain the independence of validation and test sets; and Posterior Augmentation (AP), which expanded all subsets but could bias evaluation metrics. Among these, TA proved most effective, synthesizing minority-class examples while keeping validation and test sets untouched, ensuring robust generalization. Integrating TA with TASE architectures further enhanced model performance, establishing it as the preferred strategy for handling imbalance while upholding strict evaluation integrity.

Answer to RQ2

To optimize Transfer Learning (TL) models for the target classification tasks, diverse variants of MobileNet and Inception were employed, enabling exploration of architectural strengths tailored to the dataset. Final layers were replaced and trained to align with task requirements, ensuring adaptability while preserving pre-trained feature extraction capabilities. Parameter quantization improved computational efficiency without sacrificing performance. Coupling TASE with Triple-Attention (TA) further strengthened model robustness. This dual strategy—leveraging architectural diversity for comprehensive feature representation and precise fine-tuning for task-specific optimization—ensured high adaptability and efficiency. Balancing pre-trained knowledge with domain-specific adjustments achieved superior results, demonstrating that optimal TL performance relies on strategic architecture selection, parameter-efficient training, and context-aware adaptation.

Answer to RQ3

The Triple-Attention (TA) mechanism integrated into the TASE model combined three complementary attention modules—Soft Attention Integration (SAI), Channel Attention Integration (CAI), and Squeeze-Excitation Attention Integration (SEAI)—to hierarchically capture and prioritize critical features. SA highlighted essential spatial regions in feature maps, focusing the model on discriminative local patterns. CA refined channel-wise importance, amplifying informative channels while suppressing irrelevant ones. SEA adaptively recalibrated channel responses through squeeze-and-excitation operations, enabling context-aware feature enhancement. Together, these modules mitigated the risk of overlooking hierarchical dependencies or overfitting. By synergistically balancing spatial focus, channel relevance, and cross-layer contextualization, the TA-equipped TASE model outperformed non-attention baselines in ablation studies, achieving precise localization of significant regions and robust feature discrimination, ensuring interpretable and generalizable feature extraction.

Answer to RQ4

Single algorithms were insufficient for skin lesion classification due to high inter-class similarity and intra-class variability, making Ensemble Learning (EL) essential. In this study, TASE architectures—custom CNNs fused with SAI, CAI, and SEAI modules, along with Transfer Learning (TL) models—were incorporated into three ensemble strategies: Serial Stacked, Parallel Stacked, and Independent Stacked. The final multi-layer ensemble effectively combined attention-driven feature representations, reducing bias and enhancing generalization. Aggregating predictions from heterogeneous architectures significantly improved accuracy and robustness on unseen data, demonstrating that EL is critical for complex medical imaging tasks where feature diversity and consensus are vital. A novel ensemble method, Cohen’s Kappa Proportioned Averaging (CKPA), was proposed to optimally weight predictions across multiple layers.

Answer to RQ5

Relying solely on post-prediction ensembling, such as CKPA, carries risks including redundant feature learning, error propagation from uncorrelated base models, and limited synergistic learning. Pre-prediction stacking—implemented through Serial, Parallel, and Independent Stacked Attention—integrates attention mechanisms during model training, enabling collaborative feature refinement. Serial stacking sequentially enhances attention-guided features across layers; Parallel stacking processes inputs via multiple attention pathways simultaneously; and Independent stacking maintains specialized attention modules for later fusion. By embedding attention at the architectural level, pre-prediction stacking minimizes redundancy, enhances feature complementarity, and allows end-to-end optimization of interactions. Experimental results confirmed that pre-prediction stacking outperforms post-prediction CKPA, particularly in complex tasks like skin lesion classification, where hierarchical feature refinement and synergistic learning improve discrimination and robustness.

Discussion and extended comparison

Our research demonstrated the advantages of combining TASE architectures with the Multi-Layer Cohen’s Kappa Proportioned Averaging (CKPA) ensemble framework to improve classification outcomes. Through systematic data preprocessing, targeted augmentation, and fine-tuning of pre-trained models, we effectively mitigated class imbalance issues and enhanced the extraction of discriminative features, leading to substantial improvements in overall model performance.

To optimize feature representation, we employed a Triple-Attention mechanism integrating Soft Attention Integration (SAI), Channel Attention Integration (CAI), and Squeeze-Excitation Attention Integration (SEAI). This mechanism was incorporated via three specialized stacking strategies: Serial Stacked Attention, Parallel Stacked Attention, and Independent Stacked Attention. By leveraging both low-level and high-level features, these approaches enabled the construction of a highly flexible and effective architecture. In addition to pre-prediction stacking, post-prediction ensembling across multiple layers further refined the model’s predictions, enhancing both accuracy and robustness.

The final evaluation at CKPA-Layer 2 highlighted the effectiveness of this hierarchical ensembling framework. The TASE model achieved a remarkable accuracy of 94.44%, outperforming prior approaches and validating the efficacy of our multi-layer CKPA design. Moreover, a high specificity of 92.04% confirmed the model’s capability to correctly identify non-target classes, ensuring consistent and reliable performance across diverse evaluation metrics.

These results affirmed the strength of our methodological framework, emphasizing the successful integration of advanced attention modules, stacking architectures, and multi-layer ensembling strategies. The observed improvements across performance metrics underscored the model’s ability to produce accurate and generalizable predictions, positioning it as a significant advancement in the domain of image classification.

Even with multiple baseline classifiers considered, our carefully designed framework consistently outperformed existing methods, demonstrating superior accuracy, robustness, and reliability across evaluation measures. A comprehensive comparison of our proposed model with prior studies is presented in Table 20, particularly highlighting research leveraging the HAM10000 dataset.

Table 20.

Comparison of our proposed architecture with existing others.

Article	A	P	R	F1	S
12	91.24	83.53	95.04	88.91	-
13	86.20	91.30	-	-	-
14	84.30	-	-	-	-
16	86.20	-	-	-	-
19	91.51	-	-	-	-
20	86.67	-	-	-	-
22	80.00	-	-	-	-
26	90.00	86.00	81.00	86.00	-
27	93.40	93.07	-	-	-
28	83.20	-	-	-	-
35	89.50	-	89.50	-	98.10
30	88.00	87.00	94.00	89.00	-
63	91.51	-	-	-	-
64	86.33	-	86.33	-	97.48
65	93.46	-	-	-	92.90
66	93.45	93.57	93.01	93.45	-
67	93.46	87.01	85.57	86.28	-
Ours	94.44	94.13	94.44	94.24	92.04

Furthermore, we compared our proposed architecture with state-of-the-art methods to demonstrate its enhanced performance. As illustrated in Table 21, our tailored approach consistently exceeded the performance of existing models, offering strong evidence of its effectiveness and robustness.

Table 21.

Comparison of our proposed architecture with state-of-the-art methods.

Model	A	P	R	F1	S
MobileNet	90.10	89.53	90.10	89.70	82.69
MobileNetv2	90.10	89.98	90.10	89.87	84.59
MobileNetv3L	90.70	89.98	90.70	90.13	82.73
MobileNetv3s	90.22	89.38	90.22	89.63	81.77
DenseNet121	90.34	90.22	90.34	89.85	85.56
DenseNet169	92.03	91.72	92.03	91.61	86.61
DenseNet201	92.03	91.96	92.03	91.67	84.24
ResNet50	87.80	87.68	87.80	87.35	86.02
ResNet101	91.30	90.23	91.30	90.98	84.67
ResNet152	90.70	91.07	90.70	90.01	77.97
InceptionRv2	89.98	90.19	89.98	90.02	87.96
Inceptionv3	91.43	91.07	91.43	91.09	87.06
Xception	89.61	88.47	89.61	88.86	78.84
ML-CKPA	94.44	94.13	94.44	94.24	92.04

Threats to validity

Although the proposed methodology achieved notable success in image classification, certain limitations should be acknowledged to inform future improvements:

Computational overhead

The multi-layer CKPA framework, while effective, introduces considerable computational demands due to its reliance on multiple TASE architectures and iterative ensemble refinement. Both training and inference are impacted by the need to coordinate diverse attention modules and base classifiers, posing challenges for large-scale datasets or environments with limited computational resources.

Dependence on homogeneous base classifier

The success of CKPA depends on the diversity of its constituent models. Using homogeneous architectures may lead to overlapping feature representations, thereby limiting ensemble gains. Although this study leverages varied TASE forms (SSA, PSA, ISA) to mitigate this risk, incorporating additional architectural or algorithmic diversity could further improve robustness.

Dataset-specific generalization

The evaluation is based on a single benchmark dataset, which may contain domain-specific biases or distributional characteristics not representative of broader settings. Consequently, the model’s performance might decline when applied to cross-domain data, such as alternative imaging protocols or diverse lesion populations.

External validation limitation

Another limitation is the lack of external validation, as the proposed framework has been evaluated on a single benchmark dataset. This may restrict its immediate generalizability to real-world clinical settings. However, the primary contribution of this work is architectural, and future research will focus on validating the proposed model on multiple datasets as well as real-world clinical data to further assess its robustness and applicability.

Future work and research directions

The limitations highlighted in this study suggest several promising directions for extending the proposed framework’s applicability, efficiency, and robustness:

Optimizing computational efficiency

Future research could aim to reduce the computational demands of the multi-layer CKPA framework while preserving its ensemble advantages. Strategies may include knowledge distillation to compress multiple TASE architectures into more compact models, dynamic pruning of redundant classifiers during inference, and hardware-aware parallelization to maximize resource utilization. Additionally, adaptive ensembling depth—where the number of CKPA layers is determined based on dataset complexity—could provide a balanced trade-off between computational cost and performance, facilitating real-world deployment.

Enforcing base classifier diversity

The success of CKPA depends on the diversity of its underlying models. Automated approaches for promoting classifier heterogeneity could be explored, such as adversarial decorrelation losses to reduce redundant feature learning, and neural architecture search (NAS) to identify optimal combinations of attention modules and transfer learning backbones. Furthermore, hybrid ensembles that integrate CNNs with transformer-based or graph neural network models could enrich feature representations, particularly for rare or morphologically complex lesion subtypes.

Cross-domain generalization and robustness

To enhance the framework’s applicability beyond a single dataset, future work should focus on validating CKPA performance across multi-center datasets with diverse imaging protocols, patient demographics, and lesion distributions. Incorporating domain adaptation techniques and evaluating on heterogeneous datasets will help address distribution shifts and improve generalizability. Establishing collaborative benchmarking with clinical partners can create standardized evaluation protocols for real-world challenges, such as low-quality images or streaming data with class imbalance. Additionally, integrating uncertainty quantification into the CKPA ensemble weighting process could enhance reliability in ambiguous cases, supporting greater trust and adoption in clinical settings.

Conclusion

This study presents a comprehensive image classification framework that integrates TASE architectures with a Multi-Layer Cohen’s Kappa Proportioned Averaging (ML-CKPA) strategy. The process begins with systematic data preprocessing and the evaluation of four augmentation techniques, from which the most effective strategy is selected based on performance on an independent test set. This ensures efficient training and strong generalization capability of the TASE models.

To enhance feature representation and class discrimination, three attention-based stacking approaches—Serial Stacked Attention, Parallel Stacked Attention, and Independent Stacked Attention—are incorporated. Each configuration contributes unique strengths, and their outputs are combined using the proposed CKPA ensemble method to achieve improved predictive performance.

The ML-CKPA framework applies a two-stage sequential refinement mechanism that effectively leverages the complementary strengths of individual TASE models. This layered ensembling strategy results in a robust classification system, delivering notable gains in accuracy and consistency across all evaluation metrics.

Furthermore, GradCAM visualizations enhance the interpretability of the framework by identifying the regions influencing model decisions, thereby supporting its practical applicability. The proposed method demonstrates strong performance in transfer learning scenarios, particularly in medical imaging, and shows promise in facilitating early and accurate diagnosis of skin conditions, contributing to improved patient outcomes and accessibility to reliable diagnostic support.

Footnotes

Acknowledgements

The authors would like to express their sincere gratitude to their parents for their continuous support and encouragement.

ORCID iDs

Md. Shifaul Hasan

Anwar Hossain Efat

Jubaer Ahamed Bhuiyan

Ethical considerations

This study was conducted in compliance with ethical standards, ensuring proper copyright adherence and attribution. The dataset used in this research is publicly available under the CC BY-NC-4.0 license, and it has been utilized with appropriate attribution. The study utilized the HAM10000 dataset,⁵⁰ which is a publicly accessible resource. The original authors of this dataset obtained the necessary ethical approvals and informed consent from all participants during the primary data collection process. Since this research is based on a retrospective analysis of a previously approved and publicly available dataset, further institutional ethics approval was not required.

Consent to participate

Informed consent was originally obtained from all individual participants by the creators of the dataset.

Consent for publication

All authors have provided their consent for publication in this journal (Digital Health, Sage). No additional consent is required beyond the authors’ approval.

Author contributions

Md. Shifaul Hasan: Validation, Formal analysis, Methodology, Software, Investigation, Writing – Original Draft.

Anwar Hossain Efat: Conceptualization, Supervision, Data Curation, Methodology, Writing – Original Draft.

Jubaer Ahamed Bhuiyan: Formal analysis, Software, Investigation, Writing – Review & Editing.

Faniyam Maria Mansia: Investigation, Validation.

All authors have read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

All data used in this study, including the augmented training dataset, are publicly available in the Kaggle repository: [HAM10000 Dataset].⁵⁰ (). The use of the HAM10000 dataset⁴⁹ complies with the Creative Commons Attribution-NonCommercial 4.0 International License. Proper attribution has been provided, and the recommended citation of the original dataset publication has been included, thereby fulfilling the license requirements. Furthermore, the dataset has been used strictly for non-commercial research purposes in accordance with the license terms. The source code developed for this study is available from the corresponding author upon reasonable request.

Guarantor

All authors accept full responsibility for the integrity of the data and the accuracy of the data analysis, and confirm that they had full access to all data used in the study.

References

Bibi

Khan

Shah

, et al. Msrnet: multiclass skin lesion recognition using additional residual block based fine-tuned deep models information fusion and best feature selection. Diagnostics 2023; 13(19): 3063. https://doi.org/10.3390/diagnostics13193063

Dillshad

Khan

Nazir

, et al. D2lfs2net: Multi-class skin lesion diagnosis using deep learning and variance-controlled marine predator optimisation: An application for precision medicine. CAAI Transactions on Intelligence Technology 2025; 10(1): 207–222. https://doi.org/10.1049/cit2.12267

Hussain

Khan

Damaševičius

, et al. Skinnet-inio: multiclass skin lesion localization and classification using fusion-assisted deep neural networks and improved nature-inspired optimization algorithm. Diagnostics 2023; 13(18): 2869. https://doi.org/10.3390/diagnostics13182869

Efat

Hasan

Uddin

, et al. A multi-level ensemble approach for skin lesion classification using customized transfer learning with triple attention. PloS one 2024; 19(10): e0309430. https://doi.org/10.1371/journal.pone.0309430

Ahmad

Shah

Khan

, et al. A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable ai. Frontiers in Oncology 2023; 13: 1151257. https://doi.org/10.3389/fonc.2023.1151257

Malik

Akram

Awais

, et al. An improved skin lesion boundary estimation for enhanced-intensity images using hybrid metaheuristics. Diagnostics 2023; 13(7): 1285. https://doi.org/10.3390/diagnostics13071285

Efat

Hasan

Uddin

, et al. Inverse gini indexed averaging: A multi-leveled ensemble approach for skin lesion classification using attention-integrated customized resnet variants. Digital Health 2025; 11: 20552076241312936. https://doi.org/10.1177/20552076241312936

Efat

. Tri-attention boosted scalable efficientnet for skin lesion classification via triple-stage gain ratioed averaging. Franklin Open 2026; 15: 100546. https://doi.org/10.1016/j.fraope.2026.100546

Montashir

Efat

Mahedy Hasan

, et al. Tri focus net: A cnn-based model with integrated attention modules for pest and insect detection in agriculture. In: International conference on trends in electronics and health informatics. Springer, 2024, pp. 225–240.

10.

Efat

. Chi 2 weighted ensemble: A multi-layer ensemble approach for skin lesion classification using a novel framework-optimized regnet synergy with attention-triplet. PloS one 2025; 20(5): e0321803. https://doi.org/10.1371/journal.pone.0321803

11.

Efat

Zibran

Eishita

. Skin lesion classification breakthrough: Leveraging independent-serial-parallel-stacking ensemble architecture with reciprocal cross-entropy averaging. IEEE Access 2026; 14: 14258-14285.

12.

Wang

Yan

Tang

, et al. Multiscale feature fusion for skin lesion classification. BioMed Research International 2023; 2023(1): 5146543. https://doi.org/10.1155/2023/5146543

13.

Mahbod

Schaefer

Wang

, et al. Transfer learning using a multi-scale and multi-network ensemble for skin lesion classification. Computer methods and programs in biomedicine 2020; 193: 105475. https://doi.org/10.1016/j.cmpb.2020.105475

14.

Tajerian

Kazemian

Tajerian

, et al. Design and validation of a new machine-learning-based diagnostic tool for the differentiation of dermatoscopic skin cancer images. PLoS One 2023; 18(4): e0284437. https://doi.org/10.1371/journal.pone.0284437

15.

Hosny

Kassem

Foaud

. Classification of skin lesions using transfer learning and augmentation with alex-net. PloS one 2019; 14(5): e0217293. https://doi.org/10.1371/journal.pone.0217293

16.

Popescu

El-Khatib

Ichim

. Skin lesion classification using collective intelligence of multiple neural networks. Sensors 2022; 22(12): 4399. https://doi.org/10.3390/s22124399

17.

Khan

. Skinvit: A transformer based method for melanoma and nonmelanoma classification. Plos one 2023; 18(12): e0295151. https://doi.org/10.1371/journal.pone.0295151

18.

Dong

Wang

. Tc-net: Dual coding network of transformer and cnn for skin lesion segmentation. Plos one 2022; 17(11): e0277578. https://doi.org/10.1371/journal.pone.0277578

19.

Nie

Sommella

Carratù

, et al. A deep cnn transformer hybrid model for skin lesion classification of dermoscopic images using focal loss. Diagnostics 2022; 13(1): 72. https://doi.org/10.3390/diagnostics13010072

20.

Singh

Gorantla

Allada

SGR

, et al. Skinet: A deep learning framework for skin lesion diagnosis with uncertainty estimation and explainability. Plos one 2022; 17(10): e0276836. https://doi.org/10.1371/journal.pone.0276836

21.

Khan

Akram

Zhang

, et al. Skinnet-endo: Multiclass skin lesion recognition using deep neural network and entropy-normal distribution optimization algorithm with elm. International journal of imaging systems and technology 2023; 33(4): 1275–1292. https://doi.org/10.1002/ima.22863

22.

Saarela

Geogieva

. Robustness, stability, and fidelity of explanations for a deep skin cancer classification model. Applied Sciences 2022; 12(19): 9545. https://doi.org/10.3390/app12199545

23.

Nidhi

Efat

Hasan

, et al. Triple attention mobilenetv3: Harnessing integrated attention and transfer learning for next-generation skin lesion detection. In: 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS). IEEE, 2024, pp. 1–6.

24.

Abir

MAK

Efat

Hasan

, et al. Attention enhanced inception-v3: A multi-scale feature fusion network for skin lesion detection with explainable artificial intelligence. In: 2024 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE, 2024, pp. 1–6.

25.

Ahmmed

Faruk

Srizon

, et al. Shallow tuned densenet: A lightweight convolutional neural network approach for enhanced skin lesion recognition. In: 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, 2024, pp. 1–6.

26.

Nguyen

Bui

. Skin lesion classification on imbalanced data using deep learning with soft attention. Sensors 2022; 22(19): 7530. https://doi.org/10.3390/s22197530

27.

Datta

Shaikh

Srihari

, et al. Soft attention improves skin cancer classification performance. In: International Workshop on Interpretability of Machine Intelligence in Medical Image Computing. Springer, 2021, pp. 13–23.

28.

Gouda

Sama

Al-Waakid

et al. Detection of skin cancer based on skin lesion images using deep learning. In: Healthcare. MDPI, 2022, Vol. 10, p. 1183.

29.

Ajmal

Khan

Akram

, et al. Bf2sknet: Best deep learning features fusion-assisted framework for multiclass skin lesion classification. Neural Computing and Applications 2023; 35(30): 22115–22131. https://doi.org/10.1007/s00521-022-08084-6

30.

Rahman

Hossain

Islam

, et al. An approach for multiclass skin lesion classification based on ensemble learning. Informatics in Medicine Unlocked 2021; 25: 100659. https://doi.org/10.1016/j.imu.2021.100659

31.

Jahan

Efat

Hasan

, et al. An explainable deep learning framework for multi-class skin lesion classification while resolving class imbalance. In: 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, 2024, pp. 473–478.

32.

Roy

Efat

Hasan

, et al. Multi-scale feature fusion framework based on attention integrated customized densenet201 architecture for multi-class skin lesion detection. In: 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, 2024, pp. 496–501.

33.

Hasib

Faruk

Hasan

, et al. Improved skin lesion detection with double layer concatenated densenet using transfer learning and attention modules. In: 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, 2024, pp. 1–6.

34.

Mia

Efat

Hasan

, et al. Exploring augmentation strategies for balanced skin lesion classification: An explainable lightly tuned densenet 169 architecture. In: 2024 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE, 2024, pp. 1–6.

35.

Sun

Huang

Chen

, et al. Skin lesion classification using additional patient information. BioMed research international 2021; 2021(1): 6673852. https://doi.org/10.1155/2021/6673852

36.

Basitur Rahman Bappi

Masfequier Rahman Swapno

Akhter

, et al. Deploying cnn-resnet50-bilstm for paddy leaf disease detection. In: Machine vision in plant leaf disease detection for sustainable agriculture. Springer, 2025, pp. 131–143.

37.

Nobel

Swapno

SMR

Kabir

, et al. Crt: A convolutional recurrent transformer for automatic sleep state detection. IEEE Journal of Biomedical and Health Informatics 2025; 29(6): 4452–4462. https://doi.org/10.1109/JBHI.2025.3543028

38.

Swapno

Nobel

Meena

et al. Accelerated and precise skin cancer detection through an enhanced machine learning pipeline for improved diagnostic accuracy. Results eng. 2025; 25: 104168–104181. https://doi.org/10.1016/j.rineng.2025.104168

39.

Swapno

SMR

Nobel

Islam

, et al. Vit-senet-tom: machine learning-based novel hybrid squeeze–excitation network and vision transformer framework for tomato fruits classification. Neural Computing and Applications 2025; 37(9): 6583–6600. https://doi.org/10.1007/s00521-025-10973-5

40.

Salmah

Suwasono

Kurniawan

, et al. Implementation of pjbl-stem learning to improve students’ higher order thinking skills in direct current electricity. Jurnal Pendidikan Sains Indonesia 2025; 13(2): 534–549. https://doi.org/10.24815/jpsi.v13i2.44826

41.

Swapno

SMR

Sakib

Hossain

, et al. Explainable transformer framework for fast cotton leaf diagnostics and fabric defect detection. Iscience 2026; 29(2).

42.

Islam

Haque

Khan

, et al. Ensemble transformer with post-hoc explanations for depression emotion and severity detection. iScience 2026; 29(2): 114605. https://doi.org/10.1016/j.isci.2025.114605

43.

Khushubu

Al Masum

Rahman

, et al. Transunetb: An advanced transformer–unet framework for efficient and explainable brain tumor segmentation. Informatics in Medicine Unlocked 2025; 59: 101706. https://doi.org/10.1016/j.imu.2025.101706

44.

Bappi

MBR

Swapno

SMR

Rabbi

. Deploying densenet for cotton leaf disease detection on deep learning. In: International Conference on Trends in Electronics and Health Informatics. Springer, 2024, pp. 485–498.

45.

Siddiqui

MIH

Khan

Limon

, et al. Accelerated and accurate cervical cancer diagnosis using a novel stacking ensemble method with explainable ai. Informatics in Medicine Unlocked 2025; 56: 101657. https://doi.org/10.1016/j.imu.2025.101657

46.

Debnath

Hossain

Sakib

, et al. Lmvt: A hybrid vision transformer with attention mechanisms for efficient and explainable lung cancer diagnosis. Informatics in Medicine Unlocked 2025; 57: 101669. https://doi.org/10.1016/j.imu.2025.101669

47.

Ahmed

Rahman

Limon

, et al. Hierarchical swin transformer ensemble with explainable ai for robust and decentralized breast cancer diagnosis. Bioengineering 2025; 12(6): 651. https://doi.org/10.3390/bioengineering12060651

48.

Bhuiyan

Efat

Hasan

, et al. Hierarchical attention stacked ensemble with matthews-correlation-coefficient weighted averaging: A novel framework for skin lesion classification. Digital health 2026; 12: 20552076261433750. https://doi.org/10.1177/20552076261433750

49.

Tschandl

Rosendahl

Kittler

. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 2018; 5(1): 1–9. https://doi.org/10.1038/sdata.2018.161

50.

Ham10000: Split and augmented. https://www.kaggle.com/datasets/ahefatresearch/ham10000-split-and-augmented. [Online; accessed 2026-02-12].

51.

Alam

Efat

Hasan

, et al. Refining breast cancer classification: Customized attention integration approaches with dense and residual networks for enhanced detection. Digital Health 2025; 11: 20552076241309947. https://doi.org/10.1177/20552076241309947

52.

Sikder

Efat

Hasan

, et al. A triple-level ensemble-based brain tumor classification using dense-resnet in association with three attention mechanisms. In: 2023 26th International conference on computer and information technology (ICCIT). IEEE, 2023, pp. 1–6.

53.

Haque

Efat

Hasan

, et al. Revolutionizing pest detection for sustainable agriculture: A transfer learning fusion network with attention-triplet and multi-layer ensemble. 2023 26th international conference on computer and information technology (ICCIT). IEEE 2023, pp. 1–6.

54.

Bhowmick

Efat

Hasan

, et al. Dual concatenated densenet with attention fusion: A framework for skin lesion classification incorporating multiple augmentation techniques and transfer learning. In: 2024 27th International Conference on Computer and Information Technology (ICCIT). IEEE, 2024, pp. 1087–1092.

55.

Joy

Efat

Hasan

, et al. Attention trinity net and densenet fusion: Revolutionizing american sign language recognition for inclusive communication. In: 2023 26th international conference on computer and information technology (ICCIT). IEEE, 2023, pp. 1–6.

56.

Shafin

Efat

Hasan

, et al. Skin lesion classification through sequential triple attention densenet: Diverse utilization of the combination of attention modules. 2023 26th international conference on computer and information technology (ICCIT). IEEE, 2023, pp. 1–6.

57.

Amin

Efat

Rahman

, et al. Enhanced skin lesion detection using concatenated densenet and multi-attention mechanisms. In: 2024 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE, 2024, pp. 1–6.

58.

Efat

. Pinpointing key success factors in bangladesh’s public university entrance exams: A feature-optimized svm architecture with xai. In: 2024 27th International Conference on Computer and Information Technology (ICCIT). IEEE, 2024, pp. 429–434.

59.

Efat

Hasan

Zibran

. Greeknet: Handwritten greek alphabet recognition using explainable parallel cnn with attention mechanisms. In: 2025 IEEE 4th International Conference on Computing and Machine Intelligence (ICMI). IEEE, 2025, pp. 1–9.

60.

Efat

Hasant

Jannat

, et al. Inquisition of the support vector machine classifier in association with hyper-parameter tuning: A disease prognostication model. In: 2022 4th international conference on electrical, computer & telecommunication engineering (ICECTE). IEEE, 2022, pp. 131–134.

61.

Hossain Efat

Faysal Ferdous

Islam Nayem

, et al. From data to diagnosis: a journey with machine learning, hyperparameter tuning, and ensemble learning for disease prognostication. In: International conference on trends in electronics and health informatics. Springer, 2026, pp. 407–420.

62.

Ferdous

Efat

. Unlocking insights in healthcare: A comparative study of hyperparameter tuned machine learning algorithms. In: International Conference on Trends in Electronics and Health Informatics. Springer, 2023, pp. 341–356.

63.

Sevli

. A deep convolutional neural network-based pigmented skin lesion classification application and experts evaluation. Neural Computing and Applications 2021; 33(18): 12039–12050. https://doi.org/10.1007/s00521-021-05929-4

64.

Hoang

Lee

, et al. Multiclass skin lesion classification using a novel lightweight deep learning framework for smart healthcare. Applied Sciences 2022; 12(5): 2677. https://doi.org/10.3390/app12052677

65.

Harangi

Baran

Hajdu

. Assisted deep learning framework for multi-class skin lesion classification considering a binary classification support. Biomedical Signal Processing and Control 2020; 62: 102041. https://doi.org/10.1016/j.bspc.2020.102041

66.

Nigar

Umar

Shahzad

, et al. A deep learning approach based on explainable artificial intelligence for skin lesion classification. IEEE Access 2022; 10: 113715–113725. https://doi.org/10.1109/access.2022.3217217

67.

Khan

Zhang

Sharif

, et al. Pixels to classes: intelligent learning framework for multiclass skin lesion localization and classification. Computers & Electrical Engineering 2021; 90: 106956. https://doi.org/10.1016/j.compeleceng.2020.106956