Abstract
Real-world skin lesion classification faces three major challenges: severe class imbalance, high intra-class variability, and the need to reject out-of-distribution (OOD) samples. Conventional monolithic models often struggle to address these issues simultaneously. To mitigate this limitation, we propose a multi-stage decoupled hybrid framework that combines Supervised Contrastive Learning (SupCon) with structural reconstruction. First, representation learning is decoupled using SupCon. Compared with standard cross-entropy training, SupCon alleviates feature degradation under long-tailed distributions by encouraging a more balanced feature space. Second, to address open-set recognition, we integrate contrastive semantic features with structural anomaly scores derived from an independent Convolutional Autoencoder (CAE). These complementary signals—semantic confidence and reconstruction error—are fused through a linear boundary formulation to support both known-class classification and unknown-sample rejection. On the ISIC 2019 dataset, the proposed framework achieves a Balanced Accuracy of 78.5% on known classes and improves the unknown-class F1-score to 51.3%. These results indicate that semantic–structural fusion enhances robustness under long-tailed and open-set conditions.
Keywords
Introduction
Skin cancer remains one of the most prevalent malignancies worldwide. 1 Although melanoma is not the most common subtype, its high invasiveness and metastatic potential account for a substantial proportion of skin cancer–related mortality. Clinical evidence indicates that the five-year survival rate of melanoma exceeds 90% when detected at an early stage, highlighting the importance of accurate early diagnosis.
Despite substantial progress in automated skin lesion analysis, a gap remains between reported algorithmic per- formance and real-world clinical applicability. In practice, diagnostic systems must address three interrelated chal- lenges.2,3 First, the data distribution is highly imbalanced and typically follows a long-tailed pattern.4,5 Common lesions such as melanocytic nevi (NV) vastly outnumber rare categories like dermatofibroma (DF), which often leads to performance degradation on minority classes. 6 Second, der- moscopic images exhibit substantial intra-class variability. 7 Lesion appearance is influenced by illumination conditions, hair occlusion, acquisition angles, and imaging artifacts, increasing the difficulty of robust feature extraction.8,9 Third, the clinical environment is inherently open-set. 10 Models must not only classify known categories accurately but also identify and reject samples belonging to previously unseen classes, a problem commonly referred to as Out-of- Distribution (OOD) detection.11,12
Conventional deep learning approaches typically attempt to address these challenges within a unified cross-entropy training paradigm, which may lead to suboptimal trade-offs under severe imbalance and distribution shifts. 13 To alleviate these limitations, prior studies have explored several directions. Architectural advances employ more expressive backbones such as EfficientNet 14 and Vision Transformers (ViT).9,15 Data-centric strategies utilize augmentation techniques 16 or generative models to synthesize minority samples. 17 Algorithmic refinements include modified loss functions, such as Focal Loss, and contrastive learning strategies. 18 Recent works have further expanded these efforts. For example, Gayatri (2024) 16 and Venugopal (2023) 19 improved classification through refined loss formulations and backbone adaptations. Oztel (2024) 20 and Zhang (2025) 9 investigated Vision Transformers for enhanced global feature modeling. Ensemble learning has also demonstrated improved robustness.6,21 In the context of OOD detection, Milara (2025) 11 analyzed classifier behavior under distribution shifts. Nevertheless, many existing methods address imbalance and OOD detection in an entangled manner, which may limit performance when both challenges coexist.
To address this issue, we propose a multi-stage decoupled hybrid framework. First, for representation learning, we adopt Supervised Contrastive Learning (SupCon) to construct a more balanced feature space under long-tailed distributions. Second, for open-set recognition, we combine contrastive semantic features with structural anomaly scores derived from an independent Convolutional Autoencoder (CAE). This design integrates semantic discrimination with structural reconstruction to support both classification and anomaly rejection. Finally, the proposed DH-OOD detector fuses these complementary signals through a linear boundary formulation. Extensive experiments demonstrate improved robustness in long-tailed and open-set settings, with an unknown-class F1-score of 51.3%.
The remainder of this paper is organized as follows. Section 2 describes the proposed methodology and experimental protocols. Section 3 presents the experimental results and analysis. Section 4 concludes the study.
Materials and methods
The proposed framework operates as a multi-stage decoupled system designed to address feature quality, class imbalance, and OOD detection as distinct, manageable sub-problems (Figure 1).

Overview of the proposed decoupled hybrid framework. The architecture comprises three core stages: (1) Representation learning via SupCon; (2) Classifier fine-tuning for semantic discrimination and Autoencoder training for structural anomaly detection; (3) The DH-OOD hybrid detector fuses deep classifier signals with CAE reconstruction errors.
Stage I: Decoupled representation learning via SupCon
End-to-end training on imbalanced datasets often results in a feature space dominated by majority classes. To mitigate this, we decouple representation learning from classifier training. In Stage I, we employ SupCon18,22 to pre-train the backbone network
Stage II: Imbalance-aware classifier fine-tuning
With discriminative features
Here,
Stage III: Hybrid OOD detection via Dh-OOD
We propose the DH-OOD hybrid system, which fuses two distinct signals via a dual-stream mechanism. In the feature space, we posit that typical in-distribution samples occupy a manifold characterized by high semantic confidence and low reconstruction error. To quantify this, we define a hybrid fusion score
Based on the distribution of
Experiment
This study primarily utilizes the ISIC 2019 skin lesion dataset. 16 The training phase employs 25,331 images, with the detailed class distribution presented in Table 1. The data exhibit extreme long-tail characteristics: the most common class, Melanocytic Nevus (NV), accounts for over 50% of the dataset, whereas rare classes such as Dermatofibroma (DF) and Vascular Lesions (VASC) account for less than 1%. Such disparity is a primary driver of performance degradation in traditional classifiers. 13 For evaluation, we constructed a test set containing 8238 images (Table 1). To evaluate robustness in a realistic open environment, the test set design adheres to two defined criteri. First, it maintains an imbalanced distribution similar to the training set for the 8 known classes to evaluate standard diagnostic capability under data sparsity. Second, it introduces a 9th “unknown” class (UNK) unseen during training, comprising 2047 images. It is crucial to note that this UNK class is not a single homogeneous pathology but a diverse collection of undefined atypical lesions and outliers. This heterogeneous composition ensures that our evaluation inherently tests the model's robustness against complex, multi-source distribution shifts rather than a single artifact type. Figure 2 visualizes representative samples, highlighting the high variability in color, texture, and morphology.

Visualization of ISIC 2019 dataset samples. This figure illustrates the morphological diversity of skin lesions, covering known categories and challenging visual features.
Comparison of data distribution between ISIC 2019 training and test sets.
Given the composite challenge of long-tailed class distribution and open-set recognition, we adopted a rigorous evaluation protocol following the benchmarks established by previous works.25,26 To comprehensively assess model performance, we report five complementary metrics. We use standard Accuracy (ACC) as a baseline measure of overall correctness, while Balanced Accuracy (B-Acc) is used to address extreme class imbalance, defined as the arithmetic mean of recall across known classes. B-Acc prevents the evaluation from being biased toward majority classes, such as Nevus. For Out-of-Distribution (OOD) detection, we employ the Area Under the Receiver Operating Characteristic (AUROC) to evaluate the global separability between known and unknown samples across all decision thresholds. To quantify clinical safety, we report the False Positive Rate at 95% True Positive Rate (FPR95), which measures the fraction of OOD samples incorrectly classified as in-distribution when 95% of known samples are correctly accepted. Finally, we calculate the F1-Score for the unknown class (UNK F1) to assess the harmonic equilibrium between the precision and recall of the rejection mechanism, ensuring the model does not achieve trivial performance by rejecting all inputs.
We employ EfficientNet-B4
14
as the backbone architecture. The training process is structured into three distinct phases. Stage I (SupCon) pre-trains the backbone for 50 epochs with a batch size of 16 using the Adam optimizer and a learning rate of
Results and discussion
Before evaluating the OOD detection performance, we first validated the training stability of the decoupled stages. Figure 3 illustrates the learning curves for the feature extraction and classifier fine-tuning phases. As shown in the left panel, the Supervised Contrastive (SupCon) loss in Stage I decreases steadily, indicating that the backbone effectively learns compact class representations. Subsequently, the right panel confirms that the classifier head in Stage II converges rapidly to a high Balanced Accuracy on the frozen backbone, validating the efficacy of the decoupled training protocol.

Training process validation shows stable convergence of decoupled stages. Left: SupCon loss (Stage I) decreases steadily. Right: Classifier Balanced Accuracy (Stage II) on the frozen backbone converges rapidly to a high level.
Hyperparameter sensitivity analysis
To ensure the proposed framework generalizes well to unseen data, we determined the optimal hyperparameters using a held-out validation set, strictly isolating the test set to prevent data leakage. First, we performed a grid search to determine the balanced fusion coefficients

Grid search of fusion weights α and β. The annotated map displays UNK F1-scores on the validation set. The high-score region indicates that the mechanism is robust when both signals are weighted effectively, rather than collapsing to a single stream.

Trade-off analysis between known-class balanced accuracy and unknown-class F1-score. The solid curve, corresponding to the left axis, represents known-class balanced accuracy, whereas the dashed curve, corresponding to the right axis, represents unknown-class F1-score. The x-axis represents the target coverage of known validation samples determined by τ. The vertical dotted line marks the chosen operational point, τ = 95%, balancing diagnostic recall with rejection precision.
Stepwise ablation and sensitivity analysis
To validate the efficacy of each decoupled component, we conducted a stepwise ablation study. The results, detailed in Table 2, quantitatively demonstrate the contribution of each stage. First, regarding the backbone strategy, we compared the Frozen” approach against standard fine-tuning. As shown in Rows B and H, unfreezing the backbone during classifier training (Row B) caused a marked degradation in Balanced Accuracy compared to the proposed frozen strategy (Row H) (

Qualitative prediction examples. The figure demonstrates accurate classification of 8 known lesions (e.g., MEL, NV, BCC) and successful rejection of the 9th unknown lesion picture.
Comprehensive ablation study analyzing backbone strategies, data augmentation, and OOD scoring mechanisms.
Robustness analysis using Leave-One-Out protocol across all 8 classes. Each row represents a scenario where that specific class was held out from training and treated as OOD during testing.
Robustness verification via LOO protocol
To verify that DH-OOD generalizes beyond specific class splits, we evaluated the model using the LOO protocol. Table 3 reports the detection performance when each known class is treated as an unknown anomaly. The DH-OOD framework demonstrates consistent robustness with an average AUC of 92.3%. Notably, even when common classes like NV or MEL are held out, the structural branch successfully identifies them as outliers, suggesting that the autoencoder captures general characteristics of the in-distribution manifold rather than overfitting specific outlier patterns. Complementing these quantitative results, the confusion matrix in Figure 7 confirms high recall for the UNK class. Additionally, Grad- CAM visualizations (Figure 8) indicate that the backbone focuses on semantically meaningful lesion features rather than confounding background artifacts, validating clinical interpretability.

Confusion matrix of the final model on the test set. Diagonal elements represent Recall. The model maintains high classification accuracy across the 8 known classes while achieving significant recall for the UNK class, confirming the efficacy of the DH-OOD module in rejecting OOD samples.

Grad-CAM based visual explanations. Heatmaps indicate that SupCon pre-training guides the model to focus on morphological lesion features. This validates that the decoupled feature extractor possesses enhanced semantic localization and clinical interpretability.
Robust analysis using Leave-One-Out protocol across all 8 classes. Each row represents a scenario where that specific class was held out from training and treated as OOD during testing.
Benchmarking against state-of-the-art
We compared the final DH-OOD framework against SOTA methods under two distinct settings. Table 4 presents the classification results of the 8 known classes. Our model achieves a Balanced Accuracy of 78.5% ± 0.4% and an AUC of 0.928, outperforming recent methods including MetaBlock
25
and Weighted Ensembles,
30
indicating that incorporating the rejection mechanism does not substantially compromise classification performance. Table 5 provides a comprehensive comparison against long-tailed recognition methods and OOD detectors. Two critical insights emerge. First, established long-tailed methods like Smooth Balance Softmax(BSM)
31
improve Balanced Accuracy but exhibit severe overfitting to the closed-set distribution, yielding negligible UNK F1-scores. This suggests that aggressively expanding decision boundaries for minority classes may increase overlap with unknown regions. Second, modern OOD detectors, such as ASH,
26
struggle with the high intra- class variance in skin lesions. In contrast, DH-OOD reduces the FPR95 to
Performance comparison on ISIC 2019. Our model results represent over 10 independent runs.
Unified performance comparison against SOTA methods on ISIC 2019. Results are reported as mean ± std over 10 independent runs. (↑: higher is better, ↓: lower is better).
Conclusion
This paper proposes a multi-stage, decoupled hybrid framework to address the complex challenges in real-world skin lesion analysis systematically. By decoupling representation learning from classifier training, SupCon constructs a more balanced feature space, mitigating degradation in minority classes. Ablation studies support the contribution of each decoupled component. In particular, the frozen backbone strategy and CutMix augmentation help maintain feature stability and preserve local structural information.Furthermore, the DH-OOD detector combines semantic confidence with structural anomaly scores through a linear boundary formulation, achieving a Balanced Accuracy of 78.5% and an unknown-class F1-Score of 51.3%. Comparative experiments show consistent improvements over several baseline and competitive methods. The LOO protocol further suggests that the structural branch captures general patterns of typicality rather than overfitting specific outliers. While the DH-OOD module currently relies on autoencoder reconstruction error, future work will explore Normalising Flows to model data manifolds more precisely and investigate the transferability of this decoupled framework to architectures such as Vision Transformers.
Footnotes
Author contributions
Benyuan He: Investigation, Methodology, Software, Writing—original draft. Lei Yao: Supervision— review editing. Ning Xue: Supervision—review editing. Chunxiu Liu: Writing—review editing. Tiezhu Liu: Conceptualization, Supervision, Writing—review editing. Zhimei Qi: Supervision, Writing—review editing.
All authors have read and agreed to the published version of the manuscript.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
