Cell Type Prediction for Single-Cell RNA Sequencing Utilizing Unsupervised Domain Adaptation and Semi-Supervised Learning

Abstract

Single-cell RNA sequencing (scRNA-seq) techniques for measuring gene expression in individual cells have developed rapidly. Recently, the identification of cell types in scRNA-seq analysis has been accomplished using deep learning. Most methods utilize a dataset containing cell-type labels to train the model and then apply this model to other datasets. However, the integration of multiple datasets leads to unexpected batch effects caused by differences in laboratories, experimenters, and sequencing techniques. As the batch effect interrupts the biological signal of interest, an effective batch correction method is essential. In this article, we present scUDAS, a cell-type prediction model for scRNA-seq that utilizes unsupervised domain adaptation and semi-supervised learning (SSL) to reduce the differences in distributions between datasets. First, we pretrain the proposed model based on the source dataset, which contained cell-type information. Subsequently, scUDAS is trained on the target dataset by leveraging adversarial training to align the distribution of the target dataset with that of the source dataset. Finally, scUDAS was retrained to improve its performance through SSL by leveraging both the source and target datasets with consistency regularization. scUDAS outperformed the other deep learning-based batch correction models by appropriately removing the batch effect. scUDAS is publicly available at https://github.com/cbi-bioinfo/scUDAS.

Keywords

cell-type classification semi-supervised learning single-cell RNA sequencing unsupervised domain adaptation

1. INTRODUCTION

Recently, single-cell RNA sequencing (scRNA-seq) techniques (Gierahn et al., 2017; Picelli et al., 2013; Zheng et al., 2017; Hashimshony et al., 2016) have developed rapidly, enabling researchers to study the transcriptome heterogeneity in various tissues and diseases at the individual cell level. The critical step in scRNA-seq analysis is the identification of cell types (Regev et al., 2017; Trapnell, 2015), which typically clusters the cells and then annotates them using marker genes for each cell cluster (Soneson and Robinson, 2018). However, this approach requires the identification of cluster-specific marker genes and their comparison with known marker genes, which is a demanding and time-consuming task (Abdelaal et al., 2019).

With the growing availability of scRNA-seq datasets containing cell-type annotations, machine learning techniques have been developed to automate and improve the accuracy of cell type identification in new datasets. Traditionally, unsupervised learning methods—such as principal component analysis (PCA) and clustering algorithms—have been employed to identify marker genes for each cell type using reference datasets and to annotate subgroups of cells. For example, RaceID (Grün et al., 2015) distinguished different cell types in complex mixtures using k-means clustering, while shared nearest neighbor (SNN)-Clip (Xu and Su, 2015) clustered single-cell transcriptomes through SNN graph construction. To reveal hierarchical biological structures, DendroSplit (Zhang et al., 2018) proposed an interpretable clustering framework based on a separation score with feature selection. Other approaches like SIMLR (Wang et al., 2018) and MPSSC (Park and Zhao, 2018) utilized multiple kernel learning and spectral clustering to infer cell-to-cell similarities in heterogeneous samples. RAFSIL applied a random forest model in an unsupervised manner to learn cell-type similarities for exploratory analysis (Pouyan and Kostka, 2018), and SinNLRR (Zheng et al., 2019) identified cell types by extracting non-negative, low-rank representations of gene expression matrices from candidate subspaces.

Despite the advantage of not requiring predefined cell-type labels, unsupervised methods often perform poorly when marker genes are inadequately selected due to limited prior knowledge (Zhao et al., 2020). To address this limitation, several supervised machine learning approaches have been introduced for more accurate and robust cell type annotation. Scmap selected the top N residuals from a linear model—designed to capture the relationship between gene expression values and dropout rates in the reference dataset—as informative features (Kiselev et al., 2018). Based on these features, new cells were projected into the reference space to identify the most similar cell subtype. CaSTLe selected features with the highest mean expression and mutual information between genes and cell subtypes (Lieberman et al., 2018), then built a classification model using XGBoost, enhanced with transfer learning, to classify cells in the target dataset. ScPred applied singular value decomposition to decompose the gene expression matrix and extract key features, which were then used to train a support vector machine for classification Alquicira-Hernandez et al. (2019). CHETAH constructed a classification tree using reference profiles averaged over each cell subtype (De Kanter et al., 2019); cells were classified by traversing the tree based on similarity to each node. Both Garnett (Pliner et al., 2019) and CellAssign (Zhang et al., 2019) performed cell subtype assignment using user-defined marker gene sets and raw scRNA-seq counts. Garnett trained a classifier using elastic-net regularization, while CellAssign employed a probabilistic model to assign each cell to a subtype.

More recently, deep learning-based methods leveraging representation learning have been widely adopted due to their ability to capture informative features in latent space. ACTINN (Ma and Pellegrini, 2020) used a deep neural network for direct cell type assignment, while scDAE (Choi et al., 2021) employed a denoising autoencoder to extract representative features for classification. sigGCN (Wang et al., 2021b) introduced a graph convolutional network framework that integrates gene expression data with gene interaction networks to improve feature extraction and cell type prediction.

Although existing approaches have improved cell-type annotation accuracy, their performance often relies on the assumption that training and test datasets share similar data distributions. In practice, however, scRNA-seq datasets are frequently generated across different laboratories, sequencing platforms, and experimental protocols, which introduce batch effects and lead to distribution shifts between datasets (Li et al., 2026). As a result, supervised models trained on labeled reference datasets may generalize poorly when applied to new datasets generated under different experimental conditions (Luecken et al., 2022). Conversely, while unsupervised approaches do not require labeled data, they remain sensitive to technical variation and noise, particularly when clustering structures are influenced by batch effects rather than true biological differences (Hrovatin et al., 2025). Despite advances in sequencing technologies and data processing pipelines, batch effects remain a major challenge in large-scale single-cell data integration (Andreatta et al., 2024; Nesari et al., 2026). With the rapid expansion of public single-cell atlases and multi-study datasets, researchers increasingly integrate data generated from diverse experimental sources. Differences in library preparation protocols, sequencing platforms, and sample handling procedures can introduce systematic technical variation that shifts gene expression distributions across datasets (Tzec-Interián et al., 2025). These batch-driven discrepancies may obscure true biological signals and lead to inaccurate clustering, cell-type annotation, and downstream analyses if not properly corrected (Tirosh, 2026). Therefore, effective strategies for mitigating batch effects are essential for reliable integration and analysis of multi-dataset scRNA-seq data.

Batch effect removal methods based on pairwise analyses have been proposed. MNNCorrect (Haghverdi et al., 2018) assumes that the batch effect is orthogonal to the biological manifold and calculates correction vectors in a high-dimensional log-expression space by finding the mutual nearest neighbor (MNNs) pairs of cells. BBKNN (Polański et al., 2020) is a graph-based algorithm that identifies MNNs in a batch-corrected neighborhood graph. Harmony (Korsunsky et al., 2019) projected cells into a PCA space and clustered similar cells from different batches by calculating a correction factor based on the centroids of each cluster. Seurat 3.0 (Stuart et al., 2019) used a canonical correlation analysis (Hardoon et al., 2004) approach to find anchor cell pairs between the reference and target datasets.

To address these challenges, domain adaptation techniques have been introduced to reduce distribution discrepancies between datasets. In particular, unsupervised domain adaptation (UDA), which is closely related to transfer learning (Weiss et al., 2016), enables models trained on a labeled source dataset to be adapted to an unlabeled target dataset with a different distribution. By learning domain-invariant representations, UDA aims to mitigate the impact of batch effects while preserving biologically meaningful information across datasets. For UDA, the DANN (Ganin et al., 2016) and ADDA (Tzeng et al., 2017) utilize an adversarial objective with a discriminator and feature extractor. The discriminator distinguishes whether the feature is from the source or target domain. In contrast, the feature extractor deceives the discriminator by generating similar source and target feature distributions. It attempts to reduce the discrepancy between the distributions of source and target domains by learning domain-invariant features.

UDA was used to alleviate the differences caused by batch effects. iMap (Wang et al., 2021a) is a deep learning-based method that uses an autoencoder and generative adversarial networks. The MNN pairs are used to integrate data across batches. scANVI (Xu et al., 2021) leverages variational inference to integrate multiple datasets with a single generative model correcting batch effect. Specifically, scANVI extends the scVI framework by incorporating semi-supervised learning (SSL) to jointly model cell-type labels and batch variation in a unified probabilistic latent space. This approach enables the transfer of cell-type annotations from labeled datasets to unlabeled datasets while accounting for technical variability across experiments. scNym (Kimmel and Kelley, 2021) presented a semi-supervised adversarial neural network-based framework to effectively learn representations of cell identity that transfer annotations across datasets obtained from different experiments. The model employs adversarial domain adaptation and mixup-based SSL to encourage domain-invariant feature representations between source and target datasets. As a result, scNym improves the generalization of cell-type classifiers when applied to new datasets affected by batch effects. scSemiCluster (Chen et al., 2021) integrates reference and target data for training via structural similarity regularization. In this framework, domain adaptation is achieved by encouraging consistency between the structural relationships of cells across datasets, thereby aligning the latent representations of the source and target domains. This strategy enables the model to leverage labeled reference datasets while preserving the intrinsic structure of unlabeled target datasets. scAdapt (Zhou et al., 2021) is a virtual adversarial domain adaptation network that trains a feature generator utilizing both labeled source and unlabeled target datasets and aligns the centroids of each dataset. However, a shared feature generator for two separate domains renders the optimization poorly conditioned (Tzeng et al., 2017). More recently, additional approaches have been proposed to further improve cross-dataset cell-type annotation under batch effects. For example, CellPredX (Liu et al., 2026) introduces a deep learning framework that integrates domain adaptation and deep metric learning to align embeddings across datasets while incorporating an attention mechanism to enhance discriminative representation learning. HiCat (Bi et al., 2025) proposes a semi-supervised annotation framework that performs batch correction using Harmony (Korsunsky et al., 2019) and constructs embeddings by combining multiple low-dimensional representations derived from PCA, UMAP, and clustering results, which are subsequently used by a CatBoost classifier for cell-type prediction. MNN-based methods (Haghverdi et al., 2018; Polański et al., 2020; Korsunsky et al., 2019; Stuart et al., 2019; Wang et al., 2021a) assume that batch effect variation is smaller than biological differences in order to find the mutual nearest neighboring pairs of cells between different domains (Haghverdi et al., 2018). Therefore, when the domains are highly dissimilar, MNN-based methods tend to misidentify the nearest neighbors for cells from the same cell type across batches (Yang et al., 2021; Wang et al., 2022).

Motivated by these challenges, we propose scUDAS, a framework for cell-type prediction in scRNA-seq data based on UDA and SSL. Our motivation is to address the distribution shift caused by batch effects when transferring knowledge from labeled reference datasets to unlabeled datasets generated from different experimental conditions. Although various deep learning-based cell-type prediction methods have been proposed, the challenges of a shared feature extractor and the assumption of smaller biological differences remain. Considering these challenges, we constructed a network for separate domains to include information on each domain and facilitate optimization. First, we pretrained our model with a source dataset containing cell type information. Subsequently, UDA was applied to the target dataset via adversarial training (AT) to eliminate batch effects by aligning the target dataset with the domain and class distributions of the source dataset. Finally, scUDAS was retrained using a source dataset with ground truth cell types and a target dataset with consistency regularization (Bachman et al., 2014; Laine and Aila, 2016; Sajjadi et al., 2016) through SSL (Berthelot et al., 2019b; Sohn et al., 2020; Zhang et al., 2021). Experimentally, scUDAS matched the distributions of the source and target datasets and improved the cell-type prediction performance compared with existing deep-learning-based batch correction models.

2. METHODS

2.1. Data collection

We used multiple publicly available scRNA-seq datasets to evaluate the performance and generalizability of scUDAS. First, human peripheral blood mononuclear cell (PBMC) and human pancreas datasets were obtained from the SeuratData package (Stoeckius et al., 2017). For the PBMC data, seven batches generated using different sequencing platforms were included (Table 1): Smart-seq2, CEL-Seq2, 10x Chromium v2, 10x Chromium v3, Seq-Well, Drop-seq, and inDrop. The human pancreas dataset consists of five studies generated using distinct platforms: Baron (inDrop) (Baron et al., 2016), Muraro (CEL-Seq2) (Muraro et al., 2016), Xin (SMARTer) (Xin et al., 2016), Segerstolpe (Smart-seq2) (Segerstolpe et al., 2016), and Lawlor (Fluidigm C1) (Lawlor et al., 2017). These datasets provide diverse batch effects arising from differences in experimental protocols and sequencing technologies.

Table 1.
Datasets Used for scUDAS Evaluation

Dataset Source # of cells Target # of cells # of cell types

PBMC Drop-seq 6510 Smart-seq2 526 7

6510 CEL-Seq2 526 7

6584 10x Chromium v2 3362 9

6556 10x Chromium v3 3222 8

5902 Seq-Well 3727 7

6584 inDrop 6584 9

Pancreas Baron 7757 Muraro 1941 7

5452 Xin 1407 3

7742 Segerstolpe 1696 6

7487 Lawlor 580 5

Bone marrow 10x Chromium v3 10949 Smart-seq2 3017 16

Lymph node 10x Chromium v3 9413 Smart-seq2 1657 13

Smaller 10x Chromium v3 1508 Smart-seq2 3016 15

Larger 10x Chromium v3 6032 Smart-seq2 3016 15

Balance 10x Chromium v3 2678 Smart-seq2 2678 15

Imbalance 10x Chromium v3 2678 Smart-seq2 2136 15

Cross-tissue Blood 4155 Bone marrow 5624 8

Dataset	Source	# of cells	Target	# of cells	# of cell types
PBMC	Drop-seq	6510	Smart-seq2	526	7
		6510	CEL-Seq2	526	7
		6584	10x Chromium v2	3362	9
		6556	10x Chromium v3	3222	8
		5902	Seq-Well	3727	7
		6584	inDrop	6584	9
Pancreas	Baron	7757	Muraro	1941	7
		5452	Xin	1407	3
		7742	Segerstolpe	1696	6
		7487	Lawlor	580	5
Bone marrow	10x Chromium v3	10949	Smart-seq2	3017	16
Lymph node	10x Chromium v3	9413	Smart-seq2	1657	13
Smaller	10x Chromium v3	1508	Smart-seq2	3016	15
Larger	10x Chromium v3	6032	Smart-seq2	3016	15
Balance	10x Chromium v3	2678	Smart-seq2	2678	15
Imbalance	10x Chromium v3	2678	Smart-seq2	2136	15
Cross-tissue	Blood	4155	Bone marrow	5624	8

To further evaluate model performance under more complex and realistic biological settings, we additionally incorporated bone marrow and lymph node datasets from the Tabula Sapiens atlas (Consortium et al., 2022). Compared with the PBMC and pancreas datasets, these tissues contain larger numbers of cells and more diverse cell-type compositions, thereby providing more challenging evaluation scenarios. For model training, datasets with larger sample sizes—Drop-seq (PBMC), Baron (pancreas), and 10x Chromium v3 (bone marrow and lymph node)—were used as source datasets, while the remaining datasets were treated as target datasets.

Using the bone marrow dataset, we further constructed four experimental scenarios to explicitly model practical complexities commonly encountered in real-world applications. These scenarios include imbalanced sample sizes between source and target datasets (larger, where the source dataset is twice the size of the target, and smaller, where it is half the size), as well as settings with partial label overlap and skewed cell-type distributions. Specifically, we considered a balanced setting, in which each cell type has equal representation in both source and target datasets, and an imbalanced setting, in which cell-type proportions follow a decreasing sequence with a ratio of 0.8. These configurations simulate realistic conditions in which certain cell types may be overrepresented or underrepresented across datasets. In addition, we conducted cross-tissue experiments to assess the robustness and generalizability of scUDAS under substantial biological variation. In this setting, a peripheral blood dataset from Tabula Sapiens atlas was used as the source, and the bone marrow dataset was used as the target, representing a challenging domain shift across related but distinct tissues. A detailed summary of the dataset is provided in Table 1. To examine the batch effects presented in the dataset, UMAP visualization of the dataset were provided in Supplementary Data S1, which reveals that cells primarily cluster according to sequencing platform rather than biological cell type prior to batch correction, indicating the presence of substantial batch-driven differences among the datasets.

2.2. Overall framework

Our model framework comprises four phases: (1) preprocessing, (2) pre-training, (3) AT, and (4) SSL. The overall framework of scUDAS is shown in Figure 1. Our model consists of feature extractors $F_{s}$ and $F_{t}$ for the source and target, respectively, a classifier C followed by a softmax layer, and a domain discriminator D. We utilized a source dataset $D^{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}$ and target dataset $D^{t} = {x_{i}^{t}}_{i = 1}^{n_{t}}$ with different distributions to train our model. The source dataset $x^{s}$ contains $n_{s}$ cells with cell type information $y^{s}$ and the target dataset $x^{t}$ contains $n_{t}$ cells without cell type information. Suppose that the label space $y^{s} \in {1, 2, \dots, K}$ where K is the number of classes in both the source and target datasets.

FIG. 1.

Illustration of the proposed framework via adversarial training and semi-supervised learning utilizing single-cell RNA sequencing.

(1)

Preprocessing

scRNA-seq datasets were normalized using the Seurat R package (Stuart et al., 2019). Normalization with the log-transformation formula is

x_{i, j} = \log (1 + \frac{c_{i, j} \times 10, 000}{m_{i}})

where

x_{i, j}

is the normalized count matrix of the jth gene in the ith cell,

c_{i, j}

is the molecular count of the gene, and

m_{i}

is the sum of all the molecule counts of the ith cell. Common genes and cell types in both source and target datasets were extracted. We selected 2000 genes that had high variance to identify genes that were more variable across the cells and reduced the dimensions. The ground-truth cell types for the source datasets were used to train the model.

(2)

Pretraining

To train the classifier for cell-type prediction, we pretrained a feature extractor for source $F_{s}$ and classifier $C$ using only the source dataset. Because the source dataset contains cell-type information, the classifier loss can be calculated by utilizing ground truth labels $y^{s}$ and the classifier output. The feature extractor for the source $F_{s}$ and classifier $C$ are trained based on the focal loss to deal with class-label imbalance (Lin et al., 2017) as follows:

L_{F L} = - \frac{1}{n_{s}} \sum_{j = 1}^{n_{s}} \sum_{i = 1}^{K} y_{i, j}^{s} {(1 - {\hat{y}}_{i, j}^{s})}^{γ} \log ({\hat{y}}_{i, j}^{s}),

where

{\hat{y}}^{s}

is the predicted cell-type probability distribution for the source dataset,

y^{s}

is the ground truth label, and

γ

is a focusing parameter used to adjust the rate at which samples that are easy to classify with a high predictive probability are down-weighted.

(3)

Adversarial training

After pretraining, the feature extractor for the target $F_{t}$ and domain discriminator D are trained to match the source and target distributions. We initialized the feature extractor for target $F_{t}$ using the parameters of the feature extractor for source $F_{s}$ . To learn domain-invariant feature representation, AT was introduced for domain alignment. The two components of our model competed with each other. The feature extractor for the target $F_{t}$ is trained to transfer the distribution of the target to the source such that discriminator $D$ cannot distinguish between the source and target distributions. Meanwhile, the discriminator $D$ is trained to distinguish between the source and the target distributions. $L_{D A}$ loss can be expressed as follows:

\begin{array}{l} \min_{F_{t}} \max_{D} L_{D A} = \frac{1}{n_{s}} \sum_{x^{s} \in D^{s}} \log (D (F_{s} (x^{s}))) \\ + \frac{1}{n_{t}} \sum_{x^{t} \in D^{t}} \log (1 - D (F_{t} (x^{t}))) . \end{array}

The feature extractor for the target $F_{t}$ attempts to minimize the $L_{D A}$ loss to fool the domain discriminator $D$ , whereas the domain discriminator $D$ attempts to maximize $L_{D A}$ loss to distinguish the distribution of the source and target datasets.

Although we matched the domain distributions for source and target datasets, the class distributions for each cell type were not considered. To address this problem, we define the constraint loss for the class alignment, which considers the distribution of each class. Because the target dataset does not contain ground-truth labels, pseudo-labels are first generated using the classifier trained during the source pretraining stage. Specifically, for each target cell $x_{t}$ , the pseudo-label is assigned as the class with the highest predicted probability from the classifier C. The centroid of class k in the target domain $C_{k}^{t}$ is then computed using the pseudo-labeled target sample as:

C_{k}^{t} = \frac{1}{n_{k, t}} \sum_{i = 1}^{n_{k, t}} F_{t} (x_{k, i}^{t}),

where

n_{k, t}

denotes the number of target cells assigned to class k, and

F_{t} (x_{k, i}^{t})

represents the latent feature embedding of the i-th target cell in class k. In the proposed method, we do not apply an explicit probability threshold when generating pseudo-labels for centroid estimation. Instead, all target samples contribute to the centroid calculation. Because each centroid is computed by averaging the embeddings of multiple cells within the predicted class, the influence of individual noisy pseudo-labels can be reduced.

The class alignment loss is defined as follows:

\begin{array}{l} L_{C A} = \frac{1}{K} \sum_{k = 1}^{K} ‖ C_{k}^{s} - C_{k}^{t} ‖_{2} \\ + \frac{1}{K} \sum_{k = 1}^{K} \frac{1}{n_{k, t}} \sum_{i = 1}^{n_{k, t}} ‖ C_{k}^{t} - F_{t} (x_{k, i}^{t}) ‖_{2}, \end{array}

where

C_{k}^{s}

and

C_{k}^{t}

are centroids for each cell type

k

of source

s

and target

t

, and

n_{k, t}

is the number of cells for each cell type k in the target dataset with

K

cell types. The first term was used for class alignment to match the cell-type distributions of the source and target. The second term measures the distance between the centroid of each cell type and cells in the target dataset. They reduced the intra-class distance for cell-type compactness. In practical scenarios, it is possible for certain cell types present in the source dataset to be absent from the target dataset. In such cases, no target samples will be assigned to the corresponding pseudo-label class k, and therefore, the class alignment loss

L_{C A}

will be computed only for classes that have at least one assigned target sample during pseudo-label generation. Combined with the loss function for AT is defined as follows:

L_{A T} = L_{D A} + λ L_{C A},

where

λ

is a hyperparameter that controls the balance between the domain and class distribution alignment loss.

(4)

Semi-supervised learning

After matching the distributions of the source and target datasets, we retrained the classifier using the source and target datasets to generalize and improve the performance of scUDAS. The source dataset was based on ground truth cell types, and the target dataset was based on consistency regularization. Because pseudo-labels may contain noise, we incorporate a SSL stage with consistency regularization to improve the robustness of the model. Consistency regularization encourages the classifier to produce stable predictions under different perturbations of the same input sample, which helps reduce the influence of incorrect pseudo-labels. Consistency regularization assumes that the classifier output is invariant to input perturbations. Therefore, we applied Gaussian noise as a data augmentation strategy to generate perturbed versions of the target samples. Gaussian noise is added to the normalized input features, where the injected noise acts as a small perturbation on the normalized feature space rather than altering the underlying raw count structure of the data. This perturbation encourages the model to produce consistent predictions for slightly different representations of the same cell, thereby improving model robustness and stabilizing pseudo-label predictions. Similar stochastic perturbation strategies have been explored in deep learning frameworks for single-cell transcriptomic analysis to improve representation learning and generalization performance (Han et al., 2022; Xu et al., 2026). Specifically, Gaussian noise is added to the target input with different noise intensities to generate weak and strong augmentations. The noise rates were set to 0.2 for weak augmentation and 0.8 for strong augmentation.

The loss for the source dataset $L_{C E}$ uses cross-entropy loss. The loss for the target dataset $L_{C L}$ uses the consistency loss, which is computed $l_{2}$ norm between the model-predicted probability distributions for the augmented target data. The semi-supervised loss $L_{SSL}$ is a combination of target and source losses, as follows:

L_{C E} = - \frac{1}{n_{s}} \sum_{j = 1}^{n_{s}} \sum_{i = 1}^{K} y_{i, j}^{s} \log ({\hat{y}}_{i, j}^{s})

L_{C L} = \frac{1}{n_{t}} \sum_{j = 1}^{n_{t}} {‖ {\hat{y}}_{j}^{σ_{1}} - {\hat{y}}_{j}^{σ_{2}} ‖}_{2}

L_{SSL} = L_{C E} + α (t) L_{C L},

where

σ

is the noise rate with

σ_{1} = 0.2

σ_{2} = 0.8

{\hat{y}}^{σ_{1}}

, and

{\hat{y}}^{σ_{2}}

are the model-predicted probability distributions for the target dataset augmented by each noise rate. Here,

α (t)

is a time-dependent weighting function that controls the balance between the supervised loss on the source dataset and the consistency loss on the target dataset (Lee et al., 2013; Grandvalet and Bengio, 2004):

\begin{array}{l} α (t) = {\begin{matrix} 0, & t < T_{1} \\ \frac{t - T_{1}}{T_{2} - T_{1}} α_{f}, & T_{1} \leq t < T_{2}, \\ α_{f}, & T_{2} \leq t \end{matrix} \end{array}

where t is the current epoch with

α_{f} = 1, T_{1} = 0, T_{2} = 1000

. The value of

α (t)

gradually increases during training, allowing the model to rely more on reliable source supervision in the early stages while progressively incorporating information from the target dataset as training proceeds. This scheduling strategy helps reduce the risk of incorrect pseudo-labels creating a feedback loop during the early phases of training.

2.3. Hyperparameter setting

In scUDAS, both the source and target feature extractors and the domain discriminator consist of two fully connected layers. The feature extractors use 1024 and 512 hidden units, while the discriminator uses 128 and 64 hidden units. The classifier comprises three fully connected layers with 256, 128, and 64 hidden units, followed by a softmax output layer. Hyperparameters—including the number of layers, the number of hidden units for each module, and the regularization parameter $λ$ —were optimized using the PBMC dataset, with Drop-seq as the source and inDrop as the target. Each hyperparameter configuration was evaluated through five repeated experiments, and the configuration achieving the highest average accuracy and F1-score was selected. A summary of the hyperparameter optimization results is provided in Supplementary Data S2. Model optimization was performed using the Adam optimizer (Kingma and Ba, 2014). The learning rate and maximum number of training epochs were set to 1e-4 and 1000 for pretraining, and 1e-5 and 3000 for the SSL stage. During AT, the learning rate for the target feature extractor was set to 1e-6, while the domain discriminator used a learning rate of 1e-5 for the first 1000 epochs and 1e-6 thereafter. The regularization parameter $λ$ was fixed at 0.1. Additionally, dropout and batch normalization were applied to prevent overfitting and early stopping strategy was adopted, where the training was terminated if the training loss did not improve for 100 consecutive epochs.

3. RESULTS

3.1. Performance evaluation of scUDAS with competing methods

We evaluated the performance of scUDAS in predicting cell types across multiple target datasets using publicly available scRNA-seq data generated from diverse sequencing platforms, as described in the “Data collection” section. To assess its effectiveness in mitigating batch effects, we compared scUDAS with six representative cell-type prediction methods: CellPredX (Liu et al., 2026), HiCat (Bi et al., 2025), scAdapt (Zhou et al., 2021), scSemiCluster (Chen et al., 2021), scNym (Kimmel and Kelley, 2021), scANVI (Xu et al., 2021), Seurat (Butler et al., 2018), and scmap (Kiselev et al., 2018). All competing methods were evaluated using their default or recommended hyperparameter settings, while scUDAS was applied with consistent hyperparameter values across all datasets. Model performance was assessed using accuracy and weighted F1-score.

We first evaluated scUDAS and the competing methods on four tissue datasets: PBMC, pancreas, bone marrow, and lymph node. As shown in Table 2 and Supplementary Data S3, S4, scUDAS consistently outperformed other batch correction and cell-type prediction models across these datasets, achieving the highest accuracy and F1-score in most cases. The only exception was the 10x Chromium v2 PBMC target dataset, where scUDAS achieved a competitive average F1-score of 93.24%, ranking second to scAdapt (95.38%).

Table 2.
Average F1-Scores for Cell-Type Classification Performance of scUDAS and Competing Methods over Five Repeated Experiments

Dataset Target scUDAS CellPredX HiCat scANVI scAdapt scNym scSemiCluster Seurat Scmap

PBMC Smart-seq2 88.68 87.65 80.23 82.75 85.61 86.78 79.58 79.02 52.68

CEL-Seq2 87.62 83.12 76.98 82.93 85.95 82.10 78.34 76.56 69.18

10x Chromium v2 93.24 92.41 92.13 92.57 95.38 91.63 84.88 91.68 68.15

10x Chromium v3 92.79 89.16 80.34 78.72 92.23 83.53 76.70 81.53 74.20

Seq-Well 84.74 83.19 80.03 74.63 73.20 84.34 70.11 76.15 54.14

inDrop 81.55 74.97 73.95 71.51 76.03 81.49 61.61 74.83 44.29

Pancreas Muraro 95.94 94.99 85.73 93.48 95.80 82.67 89.38 82.33 93.00

Xin 99.86 90.07 76.84 99.79 99.65 66.63 79.76 99.57 99.86

Segerstolpe 98.97 94.79 95.47 96.33 98.94 90.98 93.03 84.60 97.10

Lawlor 98.96 95.46 96.90 97.89 95.68 97.96 85.93 97.04 98.96

Bone marrow Smart-seq2 71.07 64.41 75.34 70.81 69.34 8.76 62.41 62.09 59.77

Lymph node Smart-seq2 39.61 35.88 32.83 35.31 38.83 25.80 34.93 32.36 37.81

Balance Smart-seq2 72.98 66.65 72.00 67.78 74.97 3.93 69.11 69.93 66.32

Imbalance Smart-seq2 77.09 65.18 75.25 67.31 76.69 3.32 71.69 68.13 65.86

Larger Smart-seq2 77.50 71.20 76.95 71.21 76.86 10.21 71.98 73.94 68.11

Smaller Smart-seq2 77.73 69.72 76.56 68.02 76.89 10.56 70.08 71.46 68.36

Cross-tissue Bone marrow 95.71 75.29 91.79 78.24 93.69 10.69 95.35 93.16 91.60

Dataset	Target	scUDAS	CellPredX	HiCat	scANVI	scAdapt	scNym	scSemiCluster	Seurat	Scmap
PBMC	Smart-seq2	88.68	87.65	80.23	82.75	85.61	86.78	79.58	79.02	52.68
	CEL-Seq2	87.62	83.12	76.98	82.93	85.95	82.10	78.34	76.56	69.18
	10x Chromium v2	93.24	92.41	92.13	92.57	95.38	91.63	84.88	91.68	68.15
	10x Chromium v3	92.79	89.16	80.34	78.72	92.23	83.53	76.70	81.53	74.20
	Seq-Well	84.74	83.19	80.03	74.63	73.20	84.34	70.11	76.15	54.14
	inDrop	81.55	74.97	73.95	71.51	76.03	81.49	61.61	74.83	44.29
Pancreas	Muraro	95.94	94.99	85.73	93.48	95.80	82.67	89.38	82.33	93.00
	Xin	99.86	90.07	76.84	99.79	99.65	66.63	79.76	99.57	99.86
	Segerstolpe	98.97	94.79	95.47	96.33	98.94	90.98	93.03	84.60	97.10
	Lawlor	98.96	95.46	96.90	97.89	95.68	97.96	85.93	97.04	98.96
Bone marrow	Smart-seq2	71.07	64.41	75.34	70.81	69.34	8.76	62.41	62.09	59.77
Lymph node	Smart-seq2	39.61	35.88	32.83	35.31	38.83	25.80	34.93	32.36	37.81
Balance	Smart-seq2	72.98	66.65	72.00	67.78	74.97	3.93	69.11	69.93	66.32
Imbalance	Smart-seq2	77.09	65.18	75.25	67.31	76.69	3.32	71.69	68.13	65.86
Larger	Smart-seq2	77.50	71.20	76.95	71.21	76.86	10.21	71.98	73.94	68.11
Smaller	Smart-seq2	77.73	69.72	76.56	68.02	76.89	10.56	70.08	71.46	68.36
Cross-tissue	Bone marrow	95.71	75.29	91.79	78.24	93.69	10.69	95.35	93.16	91.60

Bold values indicate the best performance among all methods.

Notably, several competing methods, including CellPredX, HiCat, and scAdapt, exhibited reduced discriminative ability for closely related cell types when the cell-type distributions between the source and target datasets differed substantially, such as in the inDrop and Seq-Well target datasets. For example, in the Drop-seq source dataset, the number of CD14 + monocytes were 357, whereas the corresponding target datasets contained substantially more CD14 + monocytes (2,038 in inDrop and 1,255 in Seq-Well). As a result, the limited number of CD14 + monocytes in the source dataset constrained the ability of several models to correctly distinguish these cells in the target datasets. To further investigate these effects, we added Sankey plots illustrating the cell-type assignment results of scUDAS and competing methods. These visualizations explicitly demonstrate how each method assigns target cells across closely related cell types. In the inDrop dataset, scUDAS accurately distinguished CD14+ and CD16+ monocytes, achieving average cell-type–specific accuracies of 88.04% and 93.32%, respectively, whereas most competing methods misclassified approximately half of the CD14+ monocytes as CD16+ monocytes (Fig. 2). A similar pattern was observed in the Seq-Well target dataset, where competing methods incorrectly reassigned a substantial fraction of CD14+ monocytes to dendritic cells and CD4⁺ T cells, while scUDAS largely preserved the correct cell-type identities (Supplementary Data S5).

FIG. 2.

Sankey plots comparing scUDAS and other methods for inDrop target dataset.

We further evaluated scUDAS under realistic and challenging experimental scenarios using the bone marrow dataset, explicitly modeling practical complexities commonly encountered in real-world applications. These scenarios included imbalanced sample sizes between source and target datasets, partial label overlap, and skewed cell-type distributions. In addition, to assess robustness under substantial biological variation, we conducted a cross-tissue validation experiment in which the source and target datasets were derived from different but biologically related tissues. This setting represents a challenging domain shift, as PBMCs originate from hematopoietic stem cells in the bone marrow but exhibit distinct transcriptional profiles due to tissue-specific differentiation and physiological conditions (see Data collection section for details).

As summarized in Table 2 and Supplementary Data S3 and S4, scUDAS achieved the best overall prediction performance across these scenarios, demonstrating strong generalizability across domains with pronounced biological and distributional differences. In scenarios with imbalanced sample sizes (Imbalance, Larger, and Smaller), scUDAS consistently achieved the highest average F1-score (approximately 77%) across most settings, whereas competing methods exhibited greater performance variability. In the cross-tissue experiment, scUDAS again achieved the highest average F1-score (95.71%). In contrast, scNym showed consistently lower performance across these challenging scenarios, failing to correctly discriminate several major cell types—including natural killer cells, monocytes, naïve B cells, memory B cells, and erythrocytes—and instead misassigning a large fraction of target cells to the erythrocyte class. These results highlight the inherent difficulty of cross-tissue domain adaptation and further underscore the robustness and discriminative capability of scUDAS under such conditions.

3.2. Visualization of representation space

To intuitively observe the effectiveness of the UDA, we used a UMAP (McInnes et al., 2018) plot to visualize the representation space of batch correction methods. Figure 3 and Supplementary Data S6 show the representation space of raw, scUDAS, CellPredX, HiCat, scAdapt, scSemiCluster, scNym, scANVI, and Seurat. Raw represents the original data that did not remove the batch effect between the Drop-seq and Smart-seq2 datasets. We found that each dataset was clustered separately, and the cell types were mixed. scUDAS and scAdapt reduced the differences in distribution between datasets, and the cell types were clearly separated. In contrast, CellPredX also demonstrated reasonable cell-type separation but still exhibits noticeable batch-specific clustering patterns. HiCat showed stronger batch mixing across datasets, indicating effective batch correction; however, the resulting representation space does not clearly separate several cell types. scNym and scSemiCluster showed some degree of cell type clustering; they were less effective in mitigating batch effects, as batch-specific distinctions remained prominent. For example, CD14+ monocyte and CD16+ monocyte cells were too close to each other on Smart-seq2 and far away from the same cell types on Drop-seq and Smart-seq2. scUDAS trains the features of the target to match the features of the source and generates a classifier with clear boundaries. scANVI and Seurat also reduced batch-driven distinctions; however, scANVI exhibited multiple sub-clusters within the same cell type, indicating fragmented representations, while Seurat produced more dispersed and scattered embeddings, resulting in less compact cell-type clusters (Supplementary Data S6). This demonstrates the ability of scUDAS to effectively remove the batch effect and align the distributions of the same cell types between different sequencing techniques.

FIG. 3.

UMAP visualization plots of the batch (top row) and cell type (bottom row) between Drop-seq and Smart-seq2 compared to other batch correction models.

3.3. Stability of scUDAS according to sample size

We assessed the ability of scUDAS to accurately predict cell types when the available source datasets were limited. We evaluated the performance of the model using various training set sizes including ratios of 10%, 15%, 20%, 25%, and 30% of the source dataset. For comparison, we also included a baseline model, which consists of a classifier trained solely on the source dataset using Eq. (2), without AT or SSL. The results of training on Drop-seq for the source dataset and applying it to Smart-seq2 and Seq-Well for the target datasets are shown in Figure 4. CellPredX, scUDAS and scAdapt showed stable performances for each sample size. However, HiCat, scSemiCluster, and scANVI variabilities depended on the sample size. In Smart-seq2, the accuracy of scUDAS was over 88.4% for each sample size. scNym and the baseline model relied on the number of samples from the source because the higher the number of source samples, the better the accuracy. scUDAS was not significantly affected by the sample size, with 88.4% accuracy, even at a 10% sample size.

FIG. 4.

Accuracy variation with increasing source sample size in the target datasets of Smart-seq2 and Seq-Well.

In Seq-Well, scUDAS performed well, achieving an accuracy of over 83.85% across all sample sizes tested. The baseline model increases from 77.43% to 81.3% as the sample size increases. Moreover, it outperformed other batch correction models when the sample size was >25%. We observed that the accuracy of scUDAS was further improved after applying AT and SSL compared with the baseline model. These results indicate that AT and SSL substantially enhance model robustness under limited labeled data.

3.4. Ablation study

To assess the contributions of AT and SSL, we conducted a series of ablation experiments in which individual components of scUDAS were removed. The results are summarized in Table 3. The baseline model corresponds to a pre-trained classifier trained solely on the source dataset using Eq. (2) and directly applied to the target dataset without any domain adaptation. In Table 3, w/o SSL denotes the removal of the SSL component in Eq. (9), while retaining AT. w/o AT indicates the exclusion of AT in Eq. (6), while keeping SSL. w/o CA refers to the removal of the constraint loss in Eq. (5), which is used for pseudo-label–based class alignment during AT.

Table 3.
Ablation Study Results Showing the Average F1-Score with 95% Confidence Interval of scUDAS and Its Variants

Dataset Target Baseline w/o SSL w/o AT w/o CA scUDAS

PBMC Smart-seq2 77.67 ± 1.23 88.00 ± 0.27 88.56 ± 0.56 88.11 ± 0.34 88.68 ± 0.23

CEL-Seq2 76.91 ± 0.77 86.87 ± 0.39 85.48 ± 0.98 86.87 ± 0.73 87.62 ± 0.45

10x Chromium v2 92.47 ± 0.23 93.22 ± 0.36 93.21 ± 0.48 92.85 ± 0.17 93.24 ± 0.65

10x Chromium v3 83.39 ± 1.25 90.91 ± 0.46 89.98 ± 0.35 89.64 ± 0.15 92.79 ± 0.27

Seq-Well 82.84 ± 0.53 82.72 ± 0.57 84.26 ± 0.28 82.67 ± 0.36 84.74 ± 0.10

inDrop 73.34 ± 1.65 77.21 ± 0.82 80.52 ± 0.17 76.88 ± 0.90 81.55 ± 0.26

Pancreas Muraro 85.39 ± 0.63 95.37 ± 0.18 87.67 ± 0.18 92.32 ± 0.45 95.94 ± 0.07

Xin 99.27 ± 0.14 99.85 ± 0.03 99.90 ± 0.03 99.86 ± 0.04 99.86 ± 0.03

Segerstolpe 92.28 ± 0.44 98.25 ± 0.54 94.22 ± 0.73 95.92 ± 0.45 98.97 ± 0.52

Lawlor 97.82 ± 0.17 97.93 ± 0.17 98.06 ± 0.19 97.96 ± 0.11 98.96 ± 0.04

Dataset	Target	Baseline	w/o SSL	w/o AT	w/o CA	scUDAS
PBMC	Smart-seq2	77.67 ± 1.23	88.00 ± 0.27	88.56 ± 0.56	88.11 ± 0.34	88.68 ± 0.23
	CEL-Seq2	76.91 ± 0.77	86.87 ± 0.39	85.48 ± 0.98	86.87 ± 0.73	87.62 ± 0.45
	10x Chromium v2	92.47 ± 0.23	93.22 ± 0.36	93.21 ± 0.48	92.85 ± 0.17	93.24 ± 0.65
	10x Chromium v3	83.39 ± 1.25	90.91 ± 0.46	89.98 ± 0.35	89.64 ± 0.15	92.79 ± 0.27
	Seq-Well	82.84 ± 0.53	82.72 ± 0.57	84.26 ± 0.28	82.67 ± 0.36	84.74 ± 0.10
	inDrop	73.34 ± 1.65	77.21 ± 0.82	80.52 ± 0.17	76.88 ± 0.90	81.55 ± 0.26
Pancreas	Muraro	85.39 ± 0.63	95.37 ± 0.18	87.67 ± 0.18	92.32 ± 0.45	95.94 ± 0.07
	Xin	99.27 ± 0.14	99.85 ± 0.03	99.90 ± 0.03	99.86 ± 0.04	99.86 ± 0.03
	Segerstolpe	92.28 ± 0.44	98.25 ± 0.54	94.22 ± 0.73	95.92 ± 0.45	98.97 ± 0.52
	Lawlor	97.82 ± 0.17	97.93 ± 0.17	98.06 ± 0.19	97.96 ± 0.11	98.96 ± 0.04

scUDAS combined with AT and SSL obtained the highest average accuracy and F1-score on most datasets, and the contributions of AT and SSL differed depending on the datasets. From the results (Table 3, Supplementary Data S7), we found that SSL was effective in inDrop, which has a large number of samples. The performance was significantly improved by SSL of 80.52% compared with AT of 77.21%. This is a reasonable result because deep learning models tend to improve their performance when using more data during training. Therefore, the effect of SSL, which trains a classifier using both source and target datasets, is significant for improving performance. A comparison of these two components confirmed that both AT and SSL play crucial roles in mitigating batch effects and enhancing model performance. As expected, the baseline model—which was trained only on the source dataset—yielded the lowest performance. Even without the constraint loss, the model significantly outperformed the baseline across most target datasets. For example, average F1-score increased from 77.67% to 88.11% on Smart-seq2 and from 76.91% to 86.87% on CEL-Seq2. Incorporating the constraint loss further boosted accuracy slightly, reaching 88.68% and 87.62% on Smart-seq2 and CEL-Seq2, respectively. These results indicate that while class alignment via AT contributes to performance gains, its effect is incremental rather than dominant.

To further examine the choice of latent centroid-based alignment (LCA) over alternative alignment strategies, we conducted additional experiments comparing LCA with a class-conditional maximum mean discrepancy (MMD)–based alignment variant. Specifically, we implemented a variant of scUDAS in which the LCA term (Eq. [5]) was replaced by a class-wise MMD objective. Rather than aligning class centroids, this variant aligns the source and target feature distributions for each class by minimizing the discrepancy between their kernel mean embeddings using an RBF kernel, thereby enforcing distributional similarity through all pairwise relationships. We evaluated the variant on the PBMC datasets, repeating each experiment five times and reporting average accuracy and F1-scores. As summarized in Table 4, the LCA-based scUDAS consistently achieved higher cell-type identification performance than the MMD-based variant across most target datasets. These results indicate that explicit centroid alignment provides stronger preservation of class structure under batch shifts than implicit distributional alignment via MMD, which does not explicitly enforce intra-class compactness or inter-class separation. Moreover, we observed that without centroid compactness, the MMD-based alignment exhibited reduced discriminative ability, particularly for closely related cell types. For example, in the inDrop (target) experiment, the average cell-type–specific accuracies for CD14+ and CD16+ monocytes were 88.04% and 93.32%, respectively, using LCA-based scUDAS, whereas the MMD-based variant achieved 81.81% and 90.26%. In summary, while MMD-based alignment provides a mechanism for matching class-conditional distributions, our results demonstrate that explicit centroid alignment via LCA more effectively preserves discriminative class structure and yields more reliable cell-type identification under batch variability.

Table 4.

Average Cell-Type Classification Performance (with 95% Confidence Intervals) on the PBMC Dataset for scUDAS and Its MMD-Based Variant

	scUDAS		scUDAS (MMD)
Target	Accuarcy	F1-score	Accuarcy	F1-score
Smart-seq2	88.63 ± 0.25	88.68 ± 0.23	86.12 ± 0.13	86.50 ± 0.15
CEL-Seq2	87.41 ± 0.48	87.62 ± 0.45	85.58 ± 0.23	85.60 ± 0.21
10x Chromium v2	93.92 ± 0.27	93.24 ± 0.65	93.32 ± 0.21	93.06 ± 0.24
10x Chromium v3	92.77 ± 0.26	92.79 ± 0.27	92.79 ± 0.21	92.52 ± 0.22
Seq-Well	84.95 ± 0.16	84.74 ± 0.10	82.71 ± 0.12	82.48 ± 0.15
inDrop	83.49 ± 0.33	81.55 ± 0.26	76.34 ± 0.23	76.68 ± 0.21

3.5. Computational efficiency and scalability

To evaluate the computational efficiency and scalability of scUDAS on larger datasets, we measured both runtime and average memory consumption using a large-scale bone marrow dataset. Each experiment was repeated five times, and average values were reported. scUDAS was compared with representative deep learning–based methods under the same experimental conditions. For all competing methods, we followed the default or recommended configurations provided in their original implementations. We note, however, that most competing methods do not explicitly specify or expose key hyperparameters—such as learning rate and batch size—which are known to directly influence both runtime and memory usage. Therefore, although we strictly adhered to the authors’ recommended settings to ensure fairness and reproducibility, differences in computational efficiency across methods should be interpreted with this limitation in mind. As summarized in Table 5, scUDAS demonstrated computational efficiency and scalability on the large-scale dataset, with an average runtime of 1,171.48 s and an average memory consumption of 2,887.04 MB. Overall, these results indicate that scUDAS does not exhibit a substantial difference in efficiency or scalability when applied to large datasets compared with other deep learning-based models evaluated under similar experimental settings.

Table 5.
Average Runtime and Memory Consumption with 95% Confidence Intervals for Bone Marrow Cell-Type Identification Using scUDAS and Other Deep Learning–Based Methods, Averaged over Five Runs

Metric scUDAS scANVI scSemiCluster scAdapt scNym CellPredX

Runtime (s) 1171.48 ± 23.72 771.76 ± 43.47 2012.11 ± 106.69 27333.81 ± 31.26 427.68 ± 46.98 144.71 ± 13.42

Memory (MB) 2887.04 ± 8.80 1945.73 ± 38.17 2595.76 ± 116.14 3432.55 ± 5.78 2622.95 ± 2.74 2693.62 ± 164.31

Metric	scUDAS	scANVI	scSemiCluster	scAdapt	scNym	CellPredX
Runtime (s)	1171.48 ± 23.72	771.76 ± 43.47	2012.11 ± 106.69	27333.81 ± 31.26	427.68 ± 46.98	144.71 ± 13.42
Memory (MB)	2887.04 ± 8.80	1945.73 ± 38.17	2595.76 ± 116.14	3432.55 ± 5.78	2622.95 ± 2.74	2693.62 ± 164.31

Bold values indicate the best performance among all methods.

4. DISCUSSION AND CONCLUSIONS

In this study, we proposed scUDAS, a cell-type prediction model based on AT and SSL. scUDAS matched the distributions of datasets generated by different scRNA-seq techniques and improved classification accuracy.

We first evaluated the cell-type prediction performance of scUDAS. Through a comparison with the baseline model, we demonstrated that removing batch effects improved the cell-type prediction performance across all datasets. The results of scUDAS outperformed those of the other single-cell batch correction models, achieving the highest accuracy. These results demonstrate that scUDAS is an effective model for alleviating the batch effect and accurately predicting cell types from scRNA-seq data. The ablation experiments, combined with AT and SSL, were effective in removing the batch effect and improving the classifier performance.

We visualized the representation space using a UMAP plot. We validated that AT and SSL matched the distribution of target datasets as a source data distribution and properly transferred the cell types from the source dataset to the target dataset.

In addition, we investigated the stability of scUDAS based on sample size. In the experiment with reduced sample sizes, scUDAS was not significantly affected by the sample size compared to the other models, indicating that the model is stable and can be trained sufficiently even with a small sample size.

Overall, scUDAS can predict cell types of scRNA-seq without ground truth labels by transferring cell type information from the source dataset to the target dataset. In future work, we will consider expanding our model to combine datasets from multiple target domains to alleviate the efforts to predict cell types in each domain.

AUTHORS’ CONTRIBUTIONS

C.P.: Conceptualization, methodology, software, investigation, writing—original draft, visualization; J.M.C.: Conceptualization, data curation, software, investigation, writing—original draft, writing-review and editing; H.C.: Writing—review and editing, supervision, project administration, funding acquisition.

Footnotes

ACKNOWLEDGMENT

The authors acknowledge the use of ChatGPT () to improve the grammar and the writing style of the article.

AUTHOR DISCLOSURE STATEMENT

No competing financial interests exist.

FUNDING INFORMATION

This research was supported by the “Korea National Institute of Health” (KNIH) research project (project No. 2024-ER-0801-01), by the Bio&Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. RS-2025–18732993), and by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. RS-2026–25476743).

Supplemental Material

References

Abdelaal

, Michielsen

, Cats

, et al. A comparison of automatic cell identification methods for single-cell rna sequencing data. Genome Biol, 2019; 20(1):194–119.

Alquicira-Hernandez

, Sathe

, Ji

, et al. scpred: Accurate supervised method for cell-type classification from single-cell rna-seq data. Genome Biol, 2019; 20(1):264–217.

Andreatta

, Hérault

, Gueguen

, et al. Semi-supervised integration of single-cell transcriptomics data. Nat Commun, 2024; 15(1):872.

Bachman

, Alsharif

, Precup

. Learning with pseudo-ensembles. In Ghahramani

, Welling

, Cortes

, Lawrence

, and Weinberger

, eds, Advances in Neural Information Processing Systems, MIT Press. 27, 2014.

Baron

, Veres

, Wolock

, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst, 2016; 3(4):346–360.e4.

Berthelot

, Carlini

, Goodfellow

, et al. Mixmatch: A holistic approach to semi-supervised learning. Adv Neural Inf Process Syst, 2019b;32.

, Bai

, Zhang

. Hicat: A semi-supervised approach for cell type annotation. Brief Bioinform, 2025; 26(4):bbaf428.

Butler

, Hoffman

, Smibert

, et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol, 2018; 36(5):411–420.

Chen

, He

, Zhai

, et al. Single-cell rna-seq data semi-supervised clustering and annotation via structural regularized domain adaptation. Bioinformatics, 2021; 37(6):775–784.

10.

Choi

, Rhee

J-K

, Chae

. Cell subtype classification via representation learning based on a denoising autoencoder for single-cell rna sequencing. IEEE Access, 2021; 9:14540–14548.

11.

Consortium

TTS

, Jones

, Karkanias

, et al. The tabula sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science (1979), 2022; 376(6594):eabl4896.

12.

De Kanter

, Lijnzaad

, Candelli

, et al. Chetah: A selective, hierarchical cell type identification method for single-cell rna sequencing. Nucleic Acids Res, 2019; 47(16):e95.

13.

Ganin

, Ustinova

, Ajakan

, et al. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 2016; 17(1):2096–2030.

14.

Gierahn

, Wadsworth

, Hughes

, et al. Seq-well: Portable, low-cost rna sequencing of single cells at high throughput. Nat Methods, 2017; 14(4):395–398.

15.

Grandvalet

, Bengio

. Semi-supervised learning by entropy minimization. Adv Neural Inf Process Syst, 2004; 17.

16.

Grün

, Lyubimova

, Kester

, et al. Single-cell messenger rna sequencing reveals rare intestinal cell types. Nature, 2015; 525(7568):251–255.

17.

Haghverdi

, Lun

, Morgan

, et al. Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol, 2018; 36(5):421–427.

18.

Han

, Cheng

, Chen

, et al. Self-supervised contrastive learning for integrative single cell rna-seq data analysis. Brief Bioinform, 2022; 23(5):bbac377.

19.

Hardoon

, Szedmak

, Shawe-Taylor

. Canonical correlation analysis: An overview with application to learning methods. Neural Comput, 2004; 16(12):2639–2664.

20.

Hashimshony

, Senderovich

, Avital

, et al. Cel-seq2: Sensitive highly-multiplexed single-cell rna-seq. Genome Biol, 2016; 17(1):77.

21.

Hrovatin

, Moinfar

, Zappia

, et al. Integrating single-cell rna-seq datasets with substantial batch effects. BMC Genomics, 2025; 26(1):974.

22.

Kimmel

, Kelley

. Semisupervised adversarial neural networks for single-cell classification. Genome Res, 2021; 31(10):1781–1793.

23.

Kingma

, Ba

. Adam: A method for stochastic optimization. arXiv Preprint arXiv, 2014.

24.

Kiselev

, Yiu

, Hemberg

. scmap: Projection of single-cell rna-seq data across data sets. Nat Methods, 2018; 15(5):359–362.

25.

Korsunsky

, Millard

, Fan

, et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods, 2019; 16(12):1289–1296.

26.

Laine

, Aila

. Temporal ensembling for semi-supervised learning. arXiv Preprint arXiv2016161002242.

27.

Lawlor

, George

, Bolisetty

, et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetes. Genome Res, 2017; 27(2):208–222.

28.

Lee

D-H

, et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning , ICML, volume 3, page 896, 2013.

29.

, Lücken

, Marioni

, et al. Toward informed batch correction for single-cell transcriptome integration. Nat Comput Sci, 2026; 6(2) pages:123–133.

30.

Lieberman

, Rokach

, Shay

. Castle–classification of single cells by transfer learning: Harnessing the power of publicly available single cell rna sequencing experiments to annotate new experiments. PLoS One, 2018; 13(10):e0205499.

31.

Lin

T-Y

, Goyal

, Girshick

, et al. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988, 2017.

32.

Liu

, Xia

, Yan

, et al. Cellpredx, a computational framework for cross-data type, cross-sample, and cross-protocol cell type annotation through domain adaptation and deep metric learning. PLoS Comput Biol, 2026; 22(1):e1013824.

33.

Luecken

, Büttner

, Chaichoompu

, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods, 2022; 19(1):41–50.

34.

, Pellegrini

. Actinn: Automated identification of cell types in single cell rna sequencing. Bioinformatics, 2020; 36(2):533–538.

35.

McInnes

, Healy

, Melville

. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv Preprint, 2018.

36.

Muraro

, Dharmadhikari

, Grün

, et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst, 2016; 3(4):385–394.e3.

37.

Nesari

, MotieGhader

, Ghorbian

. Advances and challenges in single-cell rna sequencing data analysis: A comprehensive review. Brief Bioinform, 2026; 27(1):bbaf723.

38.

Park

, Zhao

. Spectral clustering based on learning similarity matrix. Bioinformatics, 2018; 34(12):2069–2076.

39.

Picelli

, Björklund

ÅK

, Faridani

, et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat Methods, 2013; 10(11):1096–1098.

40.

Pliner

, Shendure

, Trapnell

. Supervised classification enables rapid annotation of cell atlases. Nat Methods, 2019; 16(10):983–986.

41.

Polański

, Young

, Miao

, et al. Bbknn: Fast batch alignment of single cell transcriptomes. Bioinformatics, 2020; 36(3):964–965.

42.

Pouyan

, Kostka

. Random forest based similarity learning for single cell rna sequencing data. Bioinformatics, 2018; 34(13):i79–i88.

43.

Regev

, Teichmann

, Lander

, et al.; Human Cell Atlas Meeting Participants. Science forum: The human cell atlas. Elife, 2017; 6:e27041.

44.

Sajjadi

, Javanmardi

, Tasdizen

. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Adv Neural Inf Process Syst, 2016; 29.

45.

Schnepf

. From prey via endosymbiont to plastids: comparative studies in dinoflagellates. In Lewin

R. A

, editor, Origins of Plastids, pages 53–76. Chapman and Hall, New York, 2nd edition, 1993.

46.

Segerstolpe

, Palasantza

, Eliasson

, et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab, 2016; 24(4):593–607.

47.

Sohn

, Berthelot

, Carlini

, et al. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv Neural Inf Process Syst, 2020; 33:596–608.

48.

Soneson

, Robinson

. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods, 2018; 15(4):255–261.

49.

Stoeckius

, Hafemeister

, Stephenson

, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods, 2017; 14(9):865–868.

50.

Stuart

, Butler

, Hoffman

, et al. Comprehensive integration of single-cell data. Cell, 2019; 177(7):1888–1902.e21.

51.

Tirosh

. Pitfalls in analysis and interpretation of single-cell rna-seq data in cancer. Neurooncol Adv, 2026; 8(Suppl 1):i57–i60.

52.

Trapnell

. Defining cell types and states with single-cell genomics. Genome Res, 2015; 25(10):1491–1498.

53.

Tzec-Interián

, González-Padilla

, Góngora-Castillo

. Bioinformatics perspectives on transcriptomics: A comprehensive review of bulk and single-cell rna sequencing analyses. Quant Biol, 2025; 13(2):e78.

54.

Tzeng

, Hoffman

, Saenko

, et al. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.

55.

Wang

, Ramazzotti

, De Sano

, et al. Simlr: A tool for large-scale genomic analyses by multi-kernel learning. Proteomics, 2018; 18(2):1700232.

56.

Wang

, Hou

, Zhang

, et al. imap: Integration of multiple single-cell datasets by adversarial paired transfer networks. Genome Biol, 2021a;22(1):63–24.

57.

Wang

, Bai

, Nabavi

. Single-cell classification using graph convolutional networks. BMC Bioinformatics, 2021b;22(1):364.

58.

Wang

, Wang

, Zhang

, et al. Hdmc: A novel deep learning-based framework for removing batch effects in single-cell rna-seq data. Bioinformatics, 2022; 38(5):1295–1303.

59.

Weiss

, Khoshgoftaar

, Wang

. A survey of transfer learning. J Big Data, 2016; 3(1):1–40.

60.

Xin

, Kim

, Okamoto

, et al. Rna sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab, 2016; 24(4):608–615.

61.

, Su

. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 2015; 31(12):1974–1980.

62.

, Lopez

, Mehlman

, et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol, 2021; 17(1):e9620.

63.

, Yan

, Zhang

, et al. A clustering method for single-cell rna sequencing data based on denoising and masking learning. Front Bioinform, 2026; 6:1758257.

64.

Yang

, Li

, Qian

, et al. Smnn: Batch effect correction for single-cell rna-seq data via supervised mutual nearest neighbor detection. Brief Bioinform, 2021; 22(3):bbaa097.

65.

Zhang

, O’Flanagan

, Chavez

, et al. Probabilistic cell-type assignment of single-cell rna-seq for tumor microenvironment profiling. Nat Methods, 2019; 16(10):1007–1015.

66.

Zhang

, Wang

, Hou

, et al. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inf Process Syst, 2021; 34:18408–18419.

67.

Zhang

, Fan

, et al. An interpretable framework for clustering single-cell rna-seq datasets. BMC Bioinformatics, 2018; 19(1):93.

68.

Zhao

, Wu

, Fang

, et al. Evaluation of single-cell classifiers for single-cell rna sequencing data sets. Brief Bioinform, 2020; 21(5):1581–1595.

69.

Zheng

, Terry

, Belgrader

, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun, 2017; 8(1):14049–14012.

70.

Zheng

, Li

, Liang

, et al. Sinnlrr: A robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics, 2019; 35(19):3642–3650.

71.

Zhou

, Chai

, Zeng

, et al. scadapt: Virtual adversarial domain adaptation network for single cell rna-seq data classification across platforms and species. Brief Bioinform, 2021; 22(6):bbab281.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

4.26 MB

0.00 MB