DH-OOD: A decoupled hybrid framework for robust skin lesion classification via semantic-structural fusion

Abstract

Real-world skin lesion classification faces three major challenges: severe class imbalance, high intra-class variability, and the need to reject out-of-distribution (OOD) samples. Conventional monolithic models often struggle to address these issues simultaneously. To mitigate this limitation, we propose a multi-stage decoupled hybrid framework that combines Supervised Contrastive Learning (SupCon) with structural reconstruction. First, representation learning is decoupled using SupCon. Compared with standard cross-entropy training, SupCon alleviates feature degradation under long-tailed distributions by encouraging a more balanced feature space. Second, to address open-set recognition, we integrate contrastive semantic features with structural anomaly scores derived from an independent Convolutional Autoencoder (CAE). These complementary signals—semantic confidence and reconstruction error—are fused through a linear boundary formulation to support both known-class classification and unknown-sample rejection. On the ISIC 2019 dataset, the proposed framework achieves a Balanced Accuracy of 78.5% on known classes and improves the unknown-class F1-score to 51.3%. These results indicate that semantic–structural fusion enhances robustness under long-tailed and open-set conditions.

Keywords

skin lesion classification out-of-distribution detection (OOD)class imbalance contrastive learning anomaly detection

Introduction

Skin cancer remains one of the most prevalent malignancies worldwide.¹ Although melanoma is not the most common subtype, its high invasiveness and metastatic potential account for a substantial proportion of skin cancer–related mortality. Clinical evidence indicates that the five-year survival rate of melanoma exceeds 90% when detected at an early stage, highlighting the importance of accurate early diagnosis.

Despite substantial progress in automated skin lesion analysis, a gap remains between reported algorithmic per- formance and real-world clinical applicability. In practice, diagnostic systems must address three interrelated chal- lenges.^2,3 First, the data distribution is highly imbalanced and typically follows a long-tailed pattern.^4,5 Common lesions such as melanocytic nevi (NV) vastly outnumber rare categories like dermatofibroma (DF), which often leads to performance degradation on minority classes.⁶ Second, der- moscopic images exhibit substantial intra-class variability.⁷ Lesion appearance is influenced by illumination conditions, hair occlusion, acquisition angles, and imaging artifacts, increasing the difficulty of robust feature extraction.^8,9 Third, the clinical environment is inherently open-set.¹⁰ Models must not only classify known categories accurately but also identify and reject samples belonging to previously unseen classes, a problem commonly referred to as Out-of- Distribution (OOD) detection.^11,12

Conventional deep learning approaches typically attempt to address these challenges within a unified cross-entropy training paradigm, which may lead to suboptimal trade-offs under severe imbalance and distribution shifts.¹³ To alleviate these limitations, prior studies have explored several directions. Architectural advances employ more expressive backbones such as EfficientNet¹⁴ and Vision Transformers (ViT).^9,15 Data-centric strategies utilize augmentation techniques¹⁶ or generative models to synthesize minority samples.¹⁷ Algorithmic refinements include modified loss functions, such as Focal Loss, and contrastive learning strategies.¹⁸ Recent works have further expanded these efforts. For example, Gayatri (2024)¹⁶ and Venugopal (2023)¹⁹ improved classification through refined loss formulations and backbone adaptations. Oztel (2024)²⁰ and Zhang (2025)⁹ investigated Vision Transformers for enhanced global feature modeling. Ensemble learning has also demonstrated improved robustness.^6,21 In the context of OOD detection, Milara (2025)¹¹ analyzed classifier behavior under distribution shifts. Nevertheless, many existing methods address imbalance and OOD detection in an entangled manner, which may limit performance when both challenges coexist.

To address this issue, we propose a multi-stage decoupled hybrid framework. First, for representation learning, we adopt Supervised Contrastive Learning (SupCon) to construct a more balanced feature space under long-tailed distributions. Second, for open-set recognition, we combine contrastive semantic features with structural anomaly scores derived from an independent Convolutional Autoencoder (CAE). This design integrates semantic discrimination with structural reconstruction to support both classification and anomaly rejection. Finally, the proposed DH-OOD detector fuses these complementary signals through a linear boundary formulation. Extensive experiments demonstrate improved robustness in long-tailed and open-set settings, with an unknown-class F1-score of 51.3%.

The remainder of this paper is organized as follows. Section 2 describes the proposed methodology and experimental protocols. Section 3 presents the experimental results and analysis. Section 4 concludes the study.

Materials and methods

The proposed framework operates as a multi-stage decoupled system designed to address feature quality, class imbalance, and OOD detection as distinct, manageable sub-problems (Figure 1).

Figure 1.

Overview of the proposed decoupled hybrid framework. The architecture comprises three core stages: (1) Representation learning via SupCon; (2) Classifier fine-tuning for semantic discrimination and Autoencoder training for structural anomaly detection; (3) The DH-OOD hybrid detector fuses deep classifier signals with CAE reconstruction errors.

Stage I: Decoupled representation learning via SupCon

End-to-end training on imbalanced datasets often results in a feature space dominated by majority classes. To mitigate this, we decouple representation learning from classifier training. In Stage I, we employ SupCon^18,22 to pre-train the backbone network $f_{θ}$ . We construct an ISICContrastiveDataset, generating two strongly augmented views $(v_{1}, v_{2})$ for each image x. These views are processed by a shared backbone $f_{θ}$ and a Multi-Layer Perceptron (MLP) projection head $g_{ϕ}$ , yielding projected features $z_{1}$ and $z_{2}$ . The SupConLoss objective functions to attract representations of the same class while repelling those of different classes.¹³ A key aspect of this design is the decoupling mechanism: the projection head $g_{ϕ}$ acts as an information bottleneck, encouraging the projected space z to focus on class-discriminative information. After training, $g_{ϕ}$ is discarded, and the backbone $f_{θ}$ retains transferable visual representations, such as texture and shape, which support subsequent stages.

Stage II: Imbalance-aware classifier fine-tuning

With discriminative features $f_{θ}$ established, we address the classifier bias induced by class imbalance. In Stage II, the backbone $f_{θ}$ is frozen, and only a linear classification head is trained. We integrate two strategies: first, weighted sampling based on inverse class frequency to oversample rare classes effectively; second, the introduction of CutMix²³ as a data augmentation mechanism. CutMix serves as an efficient Vicinal Risk Minimization strategy, generating a mixed sample $\tilde{x}$ by combining two training images $x_{i}$ and $x_{j}$ :

\tilde{x} = M ⊙ x_{i} + (1 - M) ⊙ x_{j}

Here, $x_{i}, x_{j} \in R^{H \times W \times C}$ denote the input image tensors for the source and target classes, respectively, where $H, W, C$ correspond to the height, width, and channel dimensions. The term $M \in {0, 1}^{H \times W}$ represents a binary spatial mask generated by sampling a bounding box coordinates $(r_{x}, r_{y}, r_{w}, r_{h})$ from a uniform distribution, such that the region defined by $M$ is set to 1 and the exterior to 0. The operator $⊙$ signifies element-wise multiplication, and $1$ is a tensor of ones acting as the complementary mask. Unlike standard noise augmentation, this operation spatially disrupts the global geometry of the lesion (e.g., the circular border of a Nevus). By replacing a local patch of $x_{i}$ with content from $x_{j}$ , the model is encouraged to rely on local discriminative semantics, such as pigment network texture or color irregularities, rather than global shape contours. This regularization reduces the risk of overfitting to dominant geometric features in majority classes. Concurrently, to support OOD detection in the subsequent stage, we train a Convolutional Autoencoder (CAE) utilizing the training data of known classes. Unlike the classifier, the CAE is trained without class labels or weighted sampling, relying solely on Mean Squared Error (MSE) for image reconstruction. The objective is to model the typical structural manifold of normal skin lesions. This ensures the system establishes a structural baseline for structures, providing the necessary structural metrics for anomaly rejection in Stage III.

Stage III: Hybrid OOD detection via Dh-OOD

We propose the DH-OOD hybrid system, which fuses two distinct signals via a dual-stream mechanism. In the feature space, we posit that typical in-distribution samples occupy a manifold characterized by high semantic confidence and low reconstruction error. To quantify this, we define a hybrid fusion score $S (x)$ that integrates semantic certainty with structural fidelity:

S (x) = α \cdot s_{s e m a n t i c} (x) - β \cdot s_{s t r u c t u r e} (x)

where

s_{s e m a n t i c} (x) = \max_{k} P (y = k | x)

represents the Maximum Softmax Probability (MSP)²⁴ derived from the frozen classifier

f_{θ}

, serving as a proxy for semantic confidence; and

s_{s t r u c t u r e} (x) = | | x - CAE (x) | |_{2}^{2}

denotes the pixel-wise reconstruction error from the independent Autoencoder, quantifying the deviation of the input from the learned manifold of normal skin features. The scalars

α, β \in R^{+}

act as calibration coefficients to balance these signals, ensuring that neither the high-variance reconstruction error nor the potentially overconfident softmax score dominates the decision boundary.

Based on the distribution of $S (x)$ on a held-out validation set, we establish a decision rule using a calibrated threshold $τ$ :

\hat{y} = {\begin{matrix} \arg \max_{k} P (y = k | x), & if S (x) \geq τ \\ OOD, & if S (x) < τ \end{matrix}

where

\hat{y}

denotes the final decision of the system. If the hybrid score

S (x)

meets the decision threshold

τ

, the system assigns the class label with the highest semantic probability; otherwise, the sample is rejected as an Out-of-Distribution (OOD) anomaly.Geometrically, this formulation defines a linear decision boundary in the joint semantic–structural space. The threshold

τ

is calibrated to retain 95% of known validation samples (TPR = 95%). This boundary covers the high-density region of in-distribution samples while excluding high-error anomalies, without requiring explicit OOD supervision.

Experiment

This study primarily utilizes the ISIC 2019 skin lesion dataset.¹⁶ The training phase employs 25,331 images, with the detailed class distribution presented in Table 1. The data exhibit extreme long-tail characteristics: the most common class, Melanocytic Nevus (NV), accounts for over 50% of the dataset, whereas rare classes such as Dermatofibroma (DF) and Vascular Lesions (VASC) account for less than 1%. Such disparity is a primary driver of performance degradation in traditional classifiers.¹³ For evaluation, we constructed a test set containing 8238 images (Table 1). To evaluate robustness in a realistic open environment, the test set design adheres to two defined criteri. First, it maintains an imbalanced distribution similar to the training set for the 8 known classes to evaluate standard diagnostic capability under data sparsity. Second, it introduces a 9th “unknown” class (UNK) unseen during training, comprising 2047 images. It is crucial to note that this UNK class is not a single homogeneous pathology but a diverse collection of undefined atypical lesions and outliers. This heterogeneous composition ensures that our evaluation inherently tests the model's robustness against complex, multi-source distribution shifts rather than a single artifact type. Figure 2 visualizes representative samples, highlighting the high variability in color, texture, and morphology.

Figure 2.

Visualization of ISIC 2019 dataset samples. This figure illustrates the morphological diversity of skin lesions, covering known categories and challenging visual features.

Table 1.

Comparison of data distribution between ISIC 2019 training and test sets.

Class	Train / test	Ratio
MEL (Melanoma)	4522 / 1327	0.178 / 0.161
NV (Melanocytic Nevus)	12,875 / 2495	0.508 / 0.303
BCC (Basal Cell Carcinoma)	3323 / 975	0.130 / 0.118
AK (Actinic Keratosis)	867 / 374	0.030 / 0.045
BKL (Benign Keratosis)	2624 / 660	0.100 / 0.080
DF (Dermatofibroma)	239 / 91	0.009 / 0.011
VASC (Vascular Lesion)	253 / 104	0.010 / 0.013
SCC (Squamous Cell Carcinoma)	628 / 165	0.024 / 0.020
UNK (Unknown / OOD)	0 / 2047	0.000 / 0.248
Total	25,331 / 8238	1.000 / 1.000

Given the composite challenge of long-tailed class distribution and open-set recognition, we adopted a rigorous evaluation protocol following the benchmarks established by previous works.^25,26 To comprehensively assess model performance, we report five complementary metrics. We use standard Accuracy (ACC) as a baseline measure of overall correctness, while Balanced Accuracy (B-Acc) is used to address extreme class imbalance, defined as the arithmetic mean of recall across known classes. B-Acc prevents the evaluation from being biased toward majority classes, such as Nevus. For Out-of-Distribution (OOD) detection, we employ the Area Under the Receiver Operating Characteristic (AUROC) to evaluate the global separability between known and unknown samples across all decision thresholds. To quantify clinical safety, we report the False Positive Rate at 95% True Positive Rate (FPR95), which measures the fraction of OOD samples incorrectly classified as in-distribution when 95% of known samples are correctly accepted. Finally, we calculate the F1-Score for the unknown class (UNK F1) to assess the harmonic equilibrium between the precision and recall of the rejection mechanism, ensuring the model does not achieve trivial performance by rejecting all inputs.

We employ EfficientNet-B4¹⁴ as the backbone architecture. The training process is structured into three distinct phases. Stage I (SupCon) pre-trains the backbone for 50 epochs with a batch size of 16 using the Adam optimizer and a learning rate of $2 \times 10^{- 4}$ . Stage II (Hybrid Training) concurrently optimizes two components: the linear classifier head is fine-tuned for 30 epochs with a batch size of 32 ( $l r = 1 \times 10^{- 4}$ ), while the CAE is trained on known-class data for 100 epochs using MSE loss to learn the structural manifold. Stage III operates in a non-gradient mode to deploy the finalized model. The balanced values for the fusion coefficients ( $α, β$ ) and the decision threshold ( $τ$ ) are empirically determined based on the comprehensive sensitivity analysis and grid search detailed. Once these hyperparameters are fixed, the initialized DH-OOD detector performs inference on the test data to execute the clinical tasks of classifying known lesions and rejecting anomalies. To ensure statistical reliability and reproducibility, all quantitative metrics reported in the subsequent sections represent the mean $\pm$ standard deviation derived from 10 independent runs. These runs utilized a fixed set of random seeds ${100 \times i ∣ i = 1, \dots, 10}$ to initialize model weights and data shuffling. To verify generalization capability beyond fixed dataset splits, we adopted a comprehensive Leave-One-Out (LOO) protocol.²⁷ In this setup, we systematically rotated through all 8 known classes, treating one as the ground-truth OOD set while training on the remaining 7. This rigorous protocol evaluates the detector's robustness against unseen morphological shifts.

Results and discussion

Before evaluating the OOD detection performance, we first validated the training stability of the decoupled stages. Figure 3 illustrates the learning curves for the feature extraction and classifier fine-tuning phases. As shown in the left panel, the Supervised Contrastive (SupCon) loss in Stage I decreases steadily, indicating that the backbone effectively learns compact class representations. Subsequently, the right panel confirms that the classifier head in Stage II converges rapidly to a high Balanced Accuracy on the frozen backbone, validating the efficacy of the decoupled training protocol.

Figure 3.

Training process validation shows stable convergence of decoupled stages. Left: SupCon loss (Stage I) decreases steadily. Right: Classifier Balanced Accuracy (Stage II) on the frozen backbone converges rapidly to a high level.

Hyperparameter sensitivity analysis

To ensure the proposed framework generalizes well to unseen data, we determined the optimal hyperparameters using a held-out validation set, strictly isolating the test set to prevent data leakage. First, we performed a grid search to determine the balanced fusion coefficients $α$ (semantic weight) and $β$ (structural weight). As illustrated in Figure 4, the UNK F1-Score forms a convex surface with a stable balanced performance region along the diagonal, particularly where $α = 0.4$ and $β = 0.8$ (yielding peak performance). This indicates that a balanced integration of both semantic confidence and structural reconstruction error is essential; relying excessively on either single modality leads to suboptimal detection boundaries. Second, we analyzed the impact of the decision threshold $τ$ on the trade-off between identifying known classes and rejecting outliers. Figure 5 plots the ID Balanced Accuracy and UNK F1-Score against the target coverage of known validation samples. We observed that the OOD detection performance remains relatively stable within a target coverage range of approximately 94%–96%. Consequently, we selected $τ = 95 %$ as a representative operational point within this robust interval. As observed, pushing the target coverage beyond this stable zone toward 99% marginally improves ID Balanced Accuracy ( $+ 1.0 %$ ) but causes a decline in OOD detection performance (UNK F1 decreases by $\approx 16 %$ ), as the decision boundary becomes too loose to filter out anomalies. Thus, $τ = 95 %$ provides a practical trade-off between diagnostic recall and anomaly rejection.

Figure 4.

Grid search of fusion weights α and β. The annotated map displays UNK F1-scores on the validation set. The high-score region indicates that the mechanism is robust when both signals are weighted effectively, rather than collapsing to a single stream.

Figure 5.

Trade-off analysis between known-class balanced accuracy and unknown-class F1-score. The solid curve, corresponding to the left axis, represents known-class balanced accuracy, whereas the dashed curve, corresponding to the right axis, represents unknown-class F1-score. The x-axis represents the target coverage of known validation samples determined by τ. The vertical dotted line marks the chosen operational point, τ = 95%, balancing diagnostic recall with rejection precision.

Stepwise ablation and sensitivity analysis

To validate the efficacy of each decoupled component, we conducted a stepwise ablation study. The results, detailed in Table 2, quantitatively demonstrate the contribution of each stage. First, regarding the backbone strategy, we compared the Frozen” approach against standard fine-tuning. As shown in Rows B and H, unfreezing the backbone during classifier training (Row B) caused a marked degradation in Balanced Accuracy compared to the proposed frozen strategy (Row H) ( $78.5 % \to 66.5 %$ ) and a significant drop in OOD detection capability (UNK F1 dropped from $51.3 %$ to $27.5 %$ ). These results indicate that fine-tuning on long-tailed data shifts the feature space toward majority classes, whereas the freezing strategy helps preserve the distribution learned during SupCon pre-training. Second, we benchmarked CutMix against other robust augmentations, including Mixup²⁸ and PixMix.²⁹ While all strategies improved basic classification over the baseline, CutMix (Row H) yielded the balanced performance ( $78.5 %$ B-Acc), marginally outperforming PixMix (Row E, $77.8 %$ ) and Mixup (Row D, $77.2 %$ ). Finally, we evaluated the contribution of the DH-OOD module against established semantic OOD baselines. While Energy-based scores (Row G) improved upon the MSP baseline (Row F), they remained limited by the closed-world assumption. The proposed DH-OOD (Row H) fuses semantic confidence with structural reconstruction error, achieving an UNK F1-Score of $51.3 % \pm 0.6 %$ .(Figure 6)

Figure 6.

Qualitative prediction examples. The figure demonstrates accurate classification of 8 known lesions (e.g., MEL, NV, BCC) and successful rejection of the 9th unknown lesion picture.

Table 2.

Comprehensive ablation study analyzing backbone strategies, data augmentation, and OOD scoring mechanisms.

ID	Stage I (feature)	Stage II (classifier)	Stage III (OOD scorer)	8-Class B-Acc (%)	UNK F1 (%)
A	-	Baseline (CE, End-to-End)	-	65.2 ± 0.9	1.1 ± 0.2
Comparison 1: Backbone Strategy
B SupCon CutMix + Fine-tuned			Hybrid Detector	66.5 ± 0.8	27.5 ± 1.8
Comparison 2: Data Augmentation
C	SupCon	Frozen	Hybrid Detector	74.6 ± 0.7	35.8 ± 1.4
D	SupCon	Mixup + Frozen	Hybrid Detector	77.2 ± 0.7	46.8 ± 0.4
E	SupCon	PixMix + Frozen	Hybrid Detector	77.8 ± 0.6	49.8 ± 1.5
Comparison 3: OOD Scoring Method
F SupCon CutMix + Frozen			MSP	74.0 ± 0.4	12.4 ± 2.2
G	SupCon	CutMix + Frozen	Energy Score	75.2 ± 0.4	28.7 ± 1.9
Ours
H	SupCon	CutMix + Frozen	Hybrid Detector	78.5 ± 0.4	51.3 ± 0.6

Robustness analysis using Leave-One-Out protocol across all 8 classes. Each row represents a scenario where that specific class was held out from training and treated as OOD during testing.

Robustness verification via LOO protocol

To verify that DH-OOD generalizes beyond specific class splits, we evaluated the model using the LOO protocol. Table 3 reports the detection performance when each known class is treated as an unknown anomaly. The DH-OOD framework demonstrates consistent robustness with an average AUC of 92.3%. Notably, even when common classes like NV or MEL are held out, the structural branch successfully identifies them as outliers, suggesting that the autoencoder captures general characteristics of the in-distribution manifold rather than overfitting specific outlier patterns. Complementing these quantitative results, the confusion matrix in Figure 7 confirms high recall for the UNK class. Additionally, Grad- CAM visualizations (Figure 8) indicate that the backbone focuses on semantically meaningful lesion features rather than confounding background artifacts, validating clinical interpretability.

Figure 7.

Confusion matrix of the final model on the test set. Diagonal elements represent Recall. The model maintains high classification accuracy across the 8 known classes while achieving significant recall for the UNK class, confirming the efficacy of the DH-OOD module in rejecting OOD samples.

Figure 8.

Grad-CAM based visual explanations. Heatmaps indicate that SupCon pre-training guides the model to focus on morphological lesion features. This validates that the decoupled feature extractor possesses enhanced semantic localization and clinical interpretability.

Table 3.

Robust analysis using Leave-One-Out protocol across all 8 classes. Each row represents a scenario where that specific class was held out from training and treated as OOD during testing.

Held-out class	F1-Score (%)	AUC (%)
MEL	47.5	91.2
NV	46.8	90.5
BCC	49.2	92.1
AK	48.6	92.5
BKL	50.3	93.4
DF	50.1	93.1
VASC	53.4	92.9
SCC	48.9	92.8
Average	49.4	92.3

Benchmarking against state-of-the-art

We compared the final DH-OOD framework against SOTA methods under two distinct settings. Table 4 presents the classification results of the 8 known classes. Our model achieves a Balanced Accuracy of 78.5% ± 0.4% and an AUC of 0.928, outperforming recent methods including MetaBlock²⁵ and Weighted Ensembles,³⁰ indicating that incorporating the rejection mechanism does not substantially compromise classification performance. Table 5 provides a comprehensive comparison against long-tailed recognition methods and OOD detectors. Two critical insights emerge. First, established long-tailed methods like Smooth Balance Softmax(BSM)³¹ improve Balanced Accuracy but exhibit severe overfitting to the closed-set distribution, yielding negligible UNK F1-scores. This suggests that aggressively expanding decision boundaries for minority classes may increase overlap with unknown regions. Second, modern OOD detectors, such as ASH,²⁶ struggle with the high intra- class variance in skin lesions. In contrast, DH-OOD reduces the FPR95 to $42.4 % \pm 1.6 %$ and achieves an UNK F1-score of $51.3 % \pm 0.6 %$ . These results indicate that combining classifier confidence with structural reconstruction error improves open-set detection performance.

Table 4.

Performance comparison on ISIC 2019. Our model results represent over 10 independent runs.

Method	Model/Strategy	ACC(%)	B-Acc(%)	AUC(%)
Baseline	End-to-End Cross-Entropy	70.1	65.2	0.85
Pacheco et al.²⁵	EfficientNetB2	73.3	54.2	0.928
Gun et al.³²	EfficientNetB1	76.0	61.0	-
Pacheco et al.²⁵	MetaBlock	80.7	76.2	0.866
Rahman et al.³⁰	Weighted Ensemble	81.1	75.0	0.930
Ours	DH-OOD	82.5	78.5	0.928

Table 5.

Unified performance comparison against SOTA methods on ISIC 2019. Results are reported as mean ± std over 10 independent runs. (↑: higher is better, ↓: lower is better).

Method	B-Acc (%)↑	UNK F1 (%)↑	FPR95 (%)↓	AUC (%)↑
CE (Baseline)	65.2 ± 0.5	1.1 ± 0.2	85.2 ± 1.2	72.1 ± 1.4
cRT³³	75.8 ± 0.4	2.0 ± 0.3	91.2 ± 0.9	75.8 ± 1.1
ALA Loss³⁴	77.2 ± 0.6	1.5 ± 0.4	88.5 ± 1.5	77.2 ± 1.3
Smooth BSM³¹	79.2 ± 0.3	2.1 ± 0.6	89.1 ± 1.1	77.5 ± 0.9
ReAct³⁵	74.2 ± 0.5	25.6 ± 2.1	61.8 ± 2.3	85.5 ± 1.8
MLS³⁶	71.1 ± 0.4	28.2 ± 2.4	58.5 ± 1.9	90.2 ± 1.5
ASH²⁶	76.4 ± 0.6	32.8 ± 2.9	54.1 ± 2.2	91.5 ± 1.2
DH-OOD (Ours)	78.5 ± 0.4	51.3 ± 0.6	42.4 ± 1.6	92.8 ± 0.8

Conclusion

This paper proposes a multi-stage, decoupled hybrid framework to address the complex challenges in real-world skin lesion analysis systematically. By decoupling representation learning from classifier training, SupCon constructs a more balanced feature space, mitigating degradation in minority classes. Ablation studies support the contribution of each decoupled component. In particular, the frozen backbone strategy and CutMix augmentation help maintain feature stability and preserve local structural information.Furthermore, the DH-OOD detector combines semantic confidence with structural anomaly scores through a linear boundary formulation, achieving a Balanced Accuracy of 78.5% and an unknown-class F1-Score of 51.3%. Comparative experiments show consistent improvements over several baseline and competitive methods. The LOO protocol further suggests that the structural branch captures general patterns of typicality rather than overfitting specific outliers. While the DH-OOD module currently relies on autoencoder reconstruction error, future work will explore Normalising Flows to model data manifolds more precisely and investigate the transferability of this decoupled framework to architectures such as Vision Transformers.

Footnotes

Author contributions

Benyuan He: Investigation, Methodology, Software, Writing—original draft. Lei Yao: Supervision— review editing. Ning Xue: Supervision—review editing. Chunxiu Liu: Writing—review editing. Tiezhu Liu: Conceptualization, Supervision, Writing—review editing. Zhimei Qi: Supervision, Writing—review editing.

All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The data that support the findings of this study are openly available in ISIC 2019—Skin Cancer Detection at .

References

Siegel

Giaquinto

Gemignani

, et al. Cancer statistics, 2024. CA Cancer J Clin 2024; 74: 12–49.

Cassidy

Kendrick

Brodzicki

, et al. Analysis of the ISIC image datasets: usage, benchmarks and recommendations. Med Image Anal 2022; 75: 102305.

Hameed

Zameer

Raja

MAZ

. A comprehensive systematic review: advancements in skin cancer classification and segmentation using the ISIC dataset. Comput Model Eng Sci 2024; 140: 2131–2164.

Huang

Zhang

Ran

, et al. An ingeniously designed skin lesion classification model across clinical and dermatoscopic datasets. Diagnostics (Basel) 2025; 15: 2011.

Zhu

Wang

Shi

, et al. A deep learning fusion network trained with clinical and high-frequency ultrasound images in the multi-classification of skin diseases in comparison with dermatologists: a prospective and multicenter study. eClinicalMed 2024; 67: 102391.

Khan

Alam

Ahmed

. Enhanced skin cancer diagnosis via deep convolutional neural networks with ensemble learning. SN Comput Sci 2025; 6: 124.

Ranjan Kumar

, et al. Deep learning-based automated classification of skin lesions using CNN and computer vision. SN Comput Sci 2025; 6: 846.

Alrabai

, et al. Exploring pre-trained models for skin cancer classification. Appl Syst Innov 2025; 8: 35.

Zhang

Liu

Ouyang

, et al. Dermvit: diagnosis-guided vision transformer for robust and efficient skin lesion classification. Bioengineering (Basel) 2025; 12: 421.

10.

Hong

, et al. Out-of-distribution detection in medical image analysis: a survey. arXiv [Preprint]. 2024. Available from: arXiv:2404.18279.

11.

Milara

Gomez-Martinez

Chushig-Muzo

, et al. Out-of-distribution performance analysis of skin lesion classifiers for dermoscopic images. Research Square [Preprint] 2025. doi:10.21203/rs.3.rs-7544969/v1

12.

, et al. Deep neural forest for out-of-distribution detection of skin lesion images. IEEE J Biomed Health Inform 2023; 27: 157–165.

13.

Alzahrani

. SkinLiTE: lightweight supervised contrastive learning model for enhanced skin lesion detection and disease typification in dermoscopic images. Curr Med Imaging 2024; 20: e15734056313837.

14.

Tan

. Efficientnet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, 2019, pp.6105–6114.

15.

Rani

Santhiya

. Optimized vision transformer with adaptive attention for high-precision skin cancer classification. In: 2025 International Conference on Advanced Computing Technologies (ICoACT), 2025, pp.1–6.

16.

Gayatri

Aarthy

. Reduction of overfitting on the highly imbalanced ISIC-2019 skin dataset using deep learning frameworks. J Xray Sci Technol 2024; 32: 53–68.

17.

Farooq

, et al. Derm-T2IM: harnessing synthetic skin lesion data. arXiv [Preprint]. 2024. Available from: arXiv:2401.05159.

18.

Duan

Chen

. Prototypes contrastive learning empowered intelligent diagnosis for skin lesion. IEEE Internet Things J 2024; 11: 35329–35340.

19.

Venugopal

, et al. A deep neural network using modified EfficientNet. Decis Anal J 2023; 8: 100278.

20.

Oztel

. Vision transformer and CNN-based skin lesion analysis: classification of monkeypox. Multimed Tools Appl 2024; 83: 1–15.

21.

Mandal

, et al. Active learning with particle swarm optimization for enhanced skin cancer classification utilizing deep CNN models. J Imaging Inform Med 2025; 38: 2472–2489.

22.

Khosla

Teterwak

Wang

, et al. Supervised contrastive learning. Adv Neural Inf Process Syst 2020; 33: 18661–18673.

23.

Yun

Han

, et al. Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp.6023–6032.

24.

Hendrycks

Gimpel

. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv [Preprint]. 2016. Available from: arXiv:1610.02136.

25.

Pacheco

AGC

Krohling

. An attention-based mechanism to combine images and metadata. IEEE J Biomed Health Inform 2021; 25: 3554–3563.

26.

Djurisic

Bozanic

Ashok

, et al. Extremely simple activation shaping for out-of-distribution detection. arXiv [Preprint]. 2022. Available from: arXiv:2209.09858.

27.

Wong

. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit 2015; 48: 2839–2846.

28.

Thulasidasan

Chennupati

Bilmes

, et al. On Mixup training: improved calibration and predictive uncertainty for deep neural networks. Adv Neural Inf Process Syst 2021; 34: 13888–13899.

29.

Hendrycks

Zhao

Basart

, et al. Pixmix: dreamlike pictures comprehensively improve safety measures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp.16783–16792.

30.

Rahman

Hossain

Islam

, et al. An approach for multiclass skin lesion classification based on ensemble learning. Inform Med Unlocked 2021; 25: 100659.

31.

Hong

. Smooth balance softmax for long-tailed image classification. Int Conf Adv Inf Commun Technol 2024; 1205: 323–331.

32.

Gun

Bilgin

. Classification of skin lesions using deep learning. In: 2024 Innovations in Intelligent Systems and Applications Conference (ASYU), 2024, pp.1–6.

33.

Kang

Xie

Rohrbach

, et al. Decoupling representation and classifier for long-tailed recognition. arXiv [Preprint]. 2019. Available from: arXiv:1910.09217.

34.

Zhao

Liu

Shen

, et al. Adaptive logit adjustment loss for long-tailed visual recognition. Proc AAAI Conf Artif Intell 2022; 36: 3472–3480.

35.

Sun

Ming

Zhu

, et al. React: out-of-distribution detection with rectified activations. Adv Neural Inf Process Syst 2021; 34: 144–157.

36.

Hendrycks

Basart

Mazeika

, et al. Scaling out-of-distribution detection for real-world settings. In: proceedings of the 39th International Conference on Machine Learning. Proc Mach Learn Res 2022; 162: 8759–8773.