Abstract
Background:
Deep learning advances medical imaging segmentation, but the insufficient diversity of datasets limits its performance. The AMOS22 dataset addresses this by providing large-scale, varied clinical data to enhance algorithm robustness.
Purpose
This study develops and validates CADSTransN-Net (Convolutional Attention and Deep Supervision TransN-Net) to optimize abdominal organ segmentation for the AMOS22 challenge.
Methods
CADSTransN-Net integrates three core innovations: a novel N-shaped feature flow path (departing from symmetric architectures for efficient encoder-decoder fusion), a convolutional attention mechanism (prioritizing anatomically relevant regions), and layer-wise deep supervision (ensuring meticulous gradient propagation and faster convergence).
Results
Evaluated on the full AMOS22 dataset, CADSTransN-Net achieved outstanding comprehensive performance: average Dice Similarity Coefficient (DSC) of 0.907, Normalized Surface Dice (NSD) of 0.850, 95th Percentile Hausdorff Distance (HD(95%)) of 3.98 mm, Average Surface Distance (ASD) of 0.75 mm, Absolute Volumetric Difference (AVD) of 39,755.88 mm3, and Relative Volumetric Difference (RVD) of 1.53%. These metrics confirm its high accuracy in region overlap, boundary consistency, and volume estimation for multi-modal abdominal multi-organ segmentation.
Conclusions
CADSTransN-Net effectively meets AMOS22's challenges, delivering robust performance across region, boundary, and volume metrics. It provides a reliable solution for multi-modal abdominal multi-organ segmentation, with significant clinical potential for tasks such as surgical navigation.
Introduction
Segmenting abdominal organs from computed tomography (CT) and magnetic resonance imaging (MRI) scans is a crucial task in medical image analysis, as its accuracy significantly impacts diagnostic precision, treatment planning, and surgical guidance. 1 Recent advances in deep learning have significantly improved the accuracy of automatic multi-organ segmentation. However, the lack of diversity in existing datasets restricts the performance of deep learning models. 2 These datasets often lack sufficient variation in disease states, imaging modalities, and anatomical differences. Such variations are crucial for developing models that can accurately reflect real-world clinical scenarios. Moreover, the high costs associated with collecting and annotating 3D medical data further limit the scope of these datasets, often resulting in a focus on a limited number of organs and insufficient sample sizes. Such constraints hinder the development of models that can robustly handle the complexities of real-world medical imaging tasks. In response to these limitations, the AMOS dataset provides comprehensive data specifically for the abdominal organ segmentation task. 3 It includes 500 CT and 100 MRI scans collected from a diverse range of patients across multiple centers, vendors, modalities, stages, and disease states. Each scan is carefully annotated with voxel-level accuracy for 15 different abdominal organs, offering unprecedented detail and variability. The AMOS dataset offers a robust platform for developing more advanced segmentation algorithms, thereby facilitating comprehensive evaluations of deep-learning models for abdominal image segmentation across various clinical conditions.
Recent studies have demonstrated the potential of deep learning to overcome the challenges associated with medical image segmentation.4,5 However, traditional convolutional neural networks (CNNs) frequently face challenges in managing the complex spatial relationships and variability inherent in medical images. Some studies have begun integrating CNNs with innovative mechanisms, such as Transformers and attention models.6,7 These approaches utilize transformers to capture long-range dependencies within image data and develop a contextual understanding of organ structure, offering new methods to tackle issues like complex spatial relationships and variability. The urgency of developing more accurate and robust segmentation models is driven not only by the clinical demand for precise diagnostics and treatment but also by the need for efficient and reliable analysis tools to manage the increasing volume of medical imaging data. As medical datasets become larger and more complex, the ability of models to scale and maintain high accuracy is critical. Moreover, incorporating advanced models into clinical practice can significantly enhance the efficiency of medical procedures and improve patient outcomes.
This study develops a new deep-learning model that enhances abdominal image segmentation performance by integrating TransN-Net with a CA mechanism and DS module. The Transformer layer in CADSTransN-Net employs a multi-head self-attention mechanism, which differs from traditional CNNs and enables the capture of long-distance dependencies in image data. This mechanism allows the model to integrate distant contextual information, improving segmentation accuracy and detail. In medical image processing, introducing the Transformer layer optimizes the model's ability to process complex tissue structures and non-homogeneous image features through the self-attention mechanism, which is often difficult to achieve with traditional CNN structures.
The CA layer is designed to focus the model's learning on more anatomically relevant regions by dynamically adjusting the response of the feature map. This fine spatial attention capability is not present in traditional attention mechanisms. In experiments, the CA layer demonstrated enhanced sensitivity to detailed features, which is particularly important for identifying and segmenting small organs or structures with complex morphology in medical images, thereby significantly improving segmentation accuracy.
The DS strategy introduces loss functions at multiple depth levels of the CADSTransN-Net model, enabling it to more effectively utilize information from intermediate layers during training, accelerate convergence, and improve adaptability to various segmentation tasks. Compared with traditional single-output supervised models, deep supervision allows the model to correct errors early in training, which is particularly important for training deep networks. It helps avoid the gradient vanishing problem and improves the model's training stability and performance.
The CA mechanism enables the model to focus on key features, thereby enhancing the detection of details crucial for accurate segmentation. At the same time, the DS module can effectively utilize the information of the intermediate layers during the training process, promoting enhanced convergence and generalization. The primary contributions of this research are as follows:
The CADSTransN-Net model integrates the strengths of both Transformer and N-Net architectures. It introduces a cross-layer branch connection and path fusion mechanism, and designs an N-shaped feature flow path. In this structure, the encoded features from specific intermediate layers are directly transmitted to higher-resolution decoding layers for fusion, enhancing feature extraction capabilities. This integrated approach more effectively captures the spatial hierarchy and contextual information required for precise organ segmentation. The CADSTransN-Net model uses a CA mechanism to dynamically refine feature maps and focus on relevant anatomical regions, enhancing the model's ability to depict complex abdominal structures. The CADSTransN-Net model employs a DS module to ensure detailed gradient propagation and rapid convergence. This multi-level supervision fosters a more comprehensive optimization of the network during training. It enhances the utilization of information from intermediate layers and enables self-correction at various levels, thereby boosting overall learning efficiency and improving model performance.
The CADSTransN-Net model achieves excellent abdominal organ segmentation capabilities, meets the technical requirements of medical image segmentation, and enhances the diagnosis process and treatment planning.
Related research
Due to the development of deep learning technology, abdominal organ segmentation has achieved remarkable progress, with performance significantly better than traditional image processing methods. CNN has become the mainstay of image segmentation tasks due to its ability to learn hierarchical features directly from data. 8 U-Net was developed for biomedical image segmentation and has quickly become a benchmark model for medical images, as its structure is well-suited for capturing the spatial hierarchy of various segmentation tasks. 9 Despite the excellent performance of CNN in medical image segmentation, challenges such as the high variability of patient anatomy, the presence of noise and artifacts in imaging data, and the need for substantial computing resources remain.
Researchers often consider automatic segmentation of abdominal organs a resolved issue, given that state-of-the-art methods achieve results comparable to inter-rater variability on many benchmark datasets. 10 However, most of these datasets predominantly feature cases from a single center, stage, vendor, or disease, raising questions about the generalizability of these high performances to other datasets. Ma et al. 11 conducted a comprehensive study on the segmentation of the liver, kidney, spleen, and pancreas using the large and diverse AbdomenCT-1K dataset. Their findings revealed that current state-of-the-art methods face challenges in generalization across different medical centers, stages, and diseases that the models have not previously encountered.
Gibson et al. 12 studied the automatic segmentation of abdominal anatomical structures on CT images. They proposed a deep learning-based registration-free segmentation algorithm for eight organs related to endoscopic pancreatic and biliary surgical navigation, including the pancreas, gastrointestinal tract (esophagus, stomach, and duodenum), and peripheral organs (liver, spleen, left kidney, and gallbladder). When they cross-validated the proposed method on a multicenter dataset of 90 subjects, it produced significantly higher DSC on all organs and lower mean absolute distances for most organs, demonstrating that deep learning-based segmentation represents a registration-free approach for multi-organ abdominal CT segmentation that may support image-guided navigation in gastrointestinal endoscopy procedures.
Shen et al. 13 developed a U-Net-based segmentation model tailored for hepatobiliary and pancreatic surgery-related organs, including the pancreas, duodenum, gallbladder, liver, and stomach. The model addresses the complexity of CT backgrounds and the variability in size and shape among different organs. To enhance feature extraction, they introduced a spatial attention block designed to learn a spatial attention map through explicit external supervision, effectively highlighting regions of interest. Additionally, they incorporated a deformable convolution block, which adapts the receptive fields for each organ by using trainable offsets to accommodate shape and size variations. The model proposed by Shen et al. significantly enhances overall segmentation performance, achieving an average DSC of 80.46%, thereby establishing it as a competitive method for multi-organ segmentation.
Irshad et al. 14 observed that the complex representation and indistinct boundaries of abdominal organs impede the accuracy of deep learning-based segmentation methods. They proposed enhancing abdominal image segmentation by incorporating organ boundary prediction as a supplementary task. Their approach involved training a 3D encoder-decoder network through multi-task learning, which enabled the simultaneous segmentation of abdominal organs and their boundaries. The study explored two network topologies within a unified multi-task framework, differing in the extent of shared weights between the two tasks. This multi-task training approach proved advantageous, compelling the network to concentrate on the obscure boundaries of organs and, consequently, enhancing the segmentation accuracy of abdominal CT data.
In the abdominal MRI segmentation, Amjad et al. 10 identified that deep learning efforts have predominantly focused on developing automatic segmentation models for a single MRI sequence. They introduced a multi-sequence deep learning-based automatic segmentation model (mS-DLAS) for multi-sequence abdominal MRI. The mS-DLAS model, utilizing the 3DResUnet network, was trained and tested on four T1- and T2-weighted MRIs obtained during routine RT simulation from 71 patients with abdominal tumors. The method also used strategies such as data preprocessing, Z-normalization, and data augmentation to improve model performance. Furthermore, two sequence-specific models, T1-weighted (T1-M) and T2-weighted (T2-M), were also developed to assess the performance of sequence-specific DLAS. The final DLAS model demonstrated its efficacy by generating accurate contours of 12 upper abdominal organs for each test case within 21 s.
Labeling multiple organs on a single MR sequence is a time-consuming and labor-intensive task, and manual labeling across multiple MR sequences introduces even greater complexity. One approach to alleviating the burden of manual annotation involves training a model on one sequence and generalizing it to others; however, the presence of domain gaps often results in poor generalization performance. To address the significant challenges in MR image segmentation, Xue et al. 15 introduced a unified framework named OMUDA for one-to-many unsupervised domain adaptive segmentation. This framework utilizes the disentanglement of content and style to effectively transform images from the source domain into those of multiple target domains. OMUDA also employs generator reconstruction and style constraints to enhance cross-modal structural consistency and minimize domain aliasing. The effectiveness of OMUDA is demonstrated by its average Dice Similarity Coefficients (DSCs) across multiple sequences and organs, with values of 85.51% on the internal test set, 82.66% on the AMOS22 dataset, and 91.38% on the CHAOS dataset. These quantitative results confirm the segmentation performance and training efficiency of OMUDA, proving its practical applicability in the early stages of product development.
Although recent years have seen significant progress in the automatic segmentation of abdominal multi-organ structures from CT/MRI scans, the comprehensive evaluation of model capabilities continues to be hindered by the lack of large-scale benchmarks from varied clinical scenarios. The high costs associated with collecting and labeling 3D medical data limit most deep learning models to datasets with a small number of organs of interest or samples. This limitation restricts the capabilities of modern deep learning models and complicates the provision of a comprehensive and fair assessment of different methods. To address these challenges, Yuanfeng et al. 16 introduced AMOS, a large-scale and diverse clinical dataset specifically for abdominal organ segmentation. AMOS comprises 500 CT and 100 MRI, collected from multi-center, multi-vendor, multi-modal, multi-stage, and multi-disease settings, each annotated at the voxel level for 15 abdominal organs. This dataset provides a challenging environment and a testbed for studying robust segmentation algorithms across various targets and scenarios. Furthermore, Yuanfeng et al. benchmarked several state-of-the-art medical segmentation models to evaluate the performance of existing methods on the AMOS dataset.
AMOS 2022 3 encompasses two distinct tasks: CT-only abdominal organ segmentation and combined CT/MRI abdominal organ segmentation. The CT-only task is a fundamental routine that evaluates the performance of various segmentation methods across a large and diverse collection of 500 CT scans. The CT/MRI abdominal organ segmentation task incorporates an additional 100 MRI scans, all annotated consistently with the CT scans, to expand the scope of imaging modalities. In this cross-modality context, developing a unified algorithm that can accurately segment abdominal organs from both CT and MRI images is crucial.
CT scans provide excellent resolution and speed, making them ideal for visualizing bone structures and detecting tumors or abnormalities that exhibit high contrast with surrounding tissues. However, they often fall short of providing a clear contrast for soft tissues. On the other hand, while MRI excels in soft tissue contrast and delivers diverse physiological information, it is generally more costly, requires longer scanning times, and is more susceptible to motion artifacts. Relying solely on CT or MRI can lead to incomplete or biased medical interpretations and diagnoses, potentially missing critical information accessible only through alternative modalities. Given the unique characteristics of these imaging modalities, this study aims to develop a versatile image segmentation framework that is highly effective for both CT and MRI scans. This framework seeks to enhance the comprehensiveness and accuracy of medical analyses by leveraging the unique advantages of each technology without requiring extensive modality-specific adjustments. The ultimate goal is to advance the performance of abdominal imaging technology segmentation, thereby contributing to the field by developing a model that performs effectively across different imaging modalities. These improvements will enhance accuracy and utility in clinical settings, ultimately leading to improved patient outcomes.
Methods
This paper proposes a CADSTransN-Net model to tackle the CT/MRI abdominal organ segmentation task in the AMOS 2022 challenge. The CADSTransN-Net synergistically combines the Transformer and N-Net models, leveraging their advantages in processing large receptive fields and precise positioning to enhance the feature extraction capabilities of medical image segmentation. This integration enhances the model's capability for feature extraction in medical image segmentation. Additionally, the CA mechanism and DS technology are incorporated into the CADSTransN-Net structure to improve segmentation accuracy and efficiency. Figure 1 shows the CADSTransN-Net model structure.

The CADSTransN-Net model structure.
The following provides a detailed description of the various components and functionalities within the CADSTransN-Net model architecture:
By combining these components, the CADSTransN-Net model effectively leverages the advantages of the Transformer architecture in capturing long-range dependencies and the N-Net architecture in precise localization, as well as CA mechanisms and DS to improve overall segmentation performance. This comprehensive approach enables the model to achieve high-precision segmentation of abdominal organs from CT and MRI scans.
TransN-Net
TransN-Net is a 3D medical image segmentation model built upon an improved version of the classic U-Net architecture. Its core innovation lies in redesigning the traditional symmetrical U-shaped structure into an "N-shaped feature flow path" and integrating a Transformer module at the bottom of the network to capture global contextual information. This architecture fully leverages the strength of CNNs in extracting local features and the capability of Transformers in modeling long-range dependencies. As a result, it enhances the model's ability to perceive complex anatomical structures while maintaining high-resolution detail reconstruction, thereby improving the accuracy and robustness of medical image segmentation.
As illustrated in Figure 2, the network architecture of TransN-Net comprises three main components: an encoder, a Transformer module, and a decoder. The entire network exhibits a topological layout resembling an "N-shaped" path. Compared to the U-shaped structure of the classic U-Net, this design is more flexible, breaks the constraint of structural symmetry, and constructs an asymmetric yet efficient feature interaction mechanism that flows from shallow to deep and then back to shallow layers.

Structure of the TransN-Net model.
Feature flow and the formation mechanism of the N-type structure
TransN-Net takes a three-dimensional medical image
Unlike U-Net, which performs symmetric up-sampling directly from the bottom layer, TransN-Net introduces cross-layer branch connections and path fusion mechanisms. The proposed architecture allows features from specific intermediate encoder layers to be transmitted directly to decoder layers with higher resolution, breaking away from the constraint of symmetric skip connections.
The feature flow trajectory in the encoding path forms an N-shaped topology. Features are not only propagated downward for semantic abstraction, but also flow across layers toward different decoding levels. The proposed framework establishes an asymmetric information pathway following the pattern of shallow → deep → shallow. The decoder layers are no longer limited to receiving features from their symmetric encoder counterparts; instead, they aggregate information from multiple encoding layers.
Through the construction of additional connection branches, features from intermediate layers (e.g., the second or third encoder layers) are directly fused into multiple asymmetric positions in the decoder. This multi-path feature fusion enhances semantic information flow and improves the model's perception of multi-scale structures. The N-type structure also avoids the bottleneck of information being funneled exclusively through the network's bottom layer, as seen in the U-Net. By enabling feature jumps directly at intermediate layers, it reduces the risk of gradient vanishing and promotes training stability.
Transformer block
The Transformer model, first proposed by Vaswani et al. 17 in 2017, utilizes a self-attention mechanism that enables the model to process data while considering the internal relationships within the sequence. This structure is exceptionally suited for handling sequential data, such as text or time series. However, its versatility extends to various other data types, including images. The key advantage of multi-head self-attention in Transformer layers over traditional convolutional layers is its ability to capture long-range dependencies in data. Traditional convolutional layers process data using filters that operate on local receptive fields. This localized operation limits their ability to incorporate information from distant parts of the input, unless very deep or extensive networks are used. In contrast, the multi-head self-attention mechanism directly calculates the interaction between any two elements in the input, regardless of their distance in the sequence or spatial arrangement.
The multi-head self-attention mechanism enables a model to attend to different positions within the input sequence, capturing various features from distinct subspaces at different locations. Each ‘head’ in the multi-head self-attention can potentially focus on different aspects of the input data, providing a richer representation of the input by aggregating diverse perspectives across multiple positions and feature subspaces. Referring to Figure 3, let us explore how the Transformer layer functions within this architecture:

The transformer block.
By employing multi-head self-attention, Transformer models not only learn isolated local patterns but also integrate global information, which traditional convolutional layers cannot efficiently achieve. This Transformer layer structure enables the model to effectively learn complex patterns and dependencies in medical images, which are crucial for accurate segmentation and analysis.
Multi-path connections enhance information propagation
To further strengthen cross-level feature fusion, TransN-Net introduces a multi-path branch connection mechanism that enables features from multiple encoder layers to be directly and concurrently transmitted to any decoder layer, rather than being restricted to symmetric skip connections. This design significantly expands the space for feature aggregation. Each decoder node not only integrates its own up-sampled features but also simultaneously fuses feature representations from several encoder layers. Theoretically, this structure improves the redundancy of information representation and enhances model robustness.
This formulation can be interpreted as an ordered aggregation operation over a set of features. By applying the cross-layer transformation function
This design is particularly advantageous for medical image segmentation tasks, where organs often exhibit blurred boundaries and intricate structural variations. The multi-path fusion strategy helps the network maintain high sensitivity to subtle anatomical details while preserving overall structural integrity.
CA module
The CA module specifically engineers dynamic refinement of feature maps at each stage of the CADSTransN-Net. It achieves this by closely focusing on the region of interest in the abdominal area and learning spatial dependencies to adjust feature responses in each channel. Importantly, this enhancement does not significantly increase computational complexity. These adjustments collectively enhance the model's ability to perform high-accuracy segmentation. Figure 4 shows the CA module used in this paper.

Ca module structure.
The CA module dynamically adjusts the significance of the network's feature maps by assigning variable attention weights based on the differing information content at each position. The following is the detailed calculation process:
The input feature map X passes through a convolution layer to reduce the channel dimension of the feature map. This convolution layer utilizes a convolution kernel of a specific size to modify the number of channels in the feature map without altering its spatial dimensions. The calculation method is Equation (6):
Among them,
This step ensures that the feature map after the first convolution layer is nonlinear before being passed to the next layer so that the model can capture more complex features.
The number of channels of the feature map is expanded from
The dimension of The Sigmoid function converts the convolution result A into a weight value within the range [0, 1]. These values adjust the importance of each element in the original input feature map. Equation (9) illustrates the calculation method:
This step ensures that the generated attention weights are positive and constrained between 0 and 1, suitable for use as weights.
The final step applies the attention map A’ to the original feature map X to adjust the strength of each feature position through element-wise multiplication, also known as the Hadamard product. The calculation method is Equation (10):
By adjusting the weights of the original feature map, the model focuses on areas considered more important while suppressing the influence of less important ones. This adjustment allows CADSTransN-Net to process input features dynamically, thereby enhancing the model's performance and flexibility.
DS module
The DS module is a technique used in deep learning models, especially in tasks such as medical image segmentation. Its primary purpose is to enhance the learning ability and performance of the model by introducing additional supervised signals at different levels of the model. 18 The CADSTransN-Net model integrates DS operations across three distinct layers of the decoding stage, allowing the model to calculate loss not only in the last layer but also in multiple intermediate layers. Consequently, each intermediate layer possesses its own loss function, enabling direct optimization. Figure 5 shows the DS module. The architecture comprises three main parts: feature extraction, a multi-level supervision unit, and final prediction.

The DS module. (a) feature extraction, (b) multi-level supervision unit, (c) final prediction.
In the feature extraction stage (Figure 5(a)), the DS module employs average pooling to reduce the dimension of the feature map by calculating the average value of each pixel. This reduction helps decrease computational demands in subsequent layers while retaining essential feature information. Subsequently, the convolutional layer processes the input data using a filter and activates it through the ReLU function, enhancing the network's nonlinear capabilities and improving the training process.
In the multi-level supervision unit (Figure 5(b)), each unit comprises Dropout, ReLU activation, and inner product layers. The Dropout is a regularization technique that helps prevent network overfitting by randomly deactivating some neurons. This random deactivation allows the network to develop more robust feature representations. Following each Dropout, a ReLU activation layer preserves the network's nonlinear characteristics. The inner product layer's output in each DS unit directly compares with the ground truth labels to generate a loss, thereby facilitating optimization at specific network depths. This method ensures that the model fully leverages the features of each level, allowing each layer to contribute to the overall task.
The final prediction unit features an inner product layer (Figure 5(c)). This layer produces the terminal prediction result, aiming at the overall target of the network. The system compares this output with the final ground truth labels of the layer to calculate key performance indicators, such as accuracy and loss. These indicators directly determine the effectiveness of the network training.
Considering that the model utilizes the DS module, the output of each layer employs the same cross-entropy loss function, and the calculation of the total loss must follow Equations (11) and (12).
The DS module enables the CADSTransN-Net model to receive feedback not only at the final output stage but also at the intermediate stages. This multi-level supervision allows the model to utilize information from intermediate layers more effectively during training, thereby optimizing learning efficiency and enhancing model performance. The multi-level supervision module facilitates self-correction at various levels, proving particularly beneficial for solving complex pattern recognition problems. Additionally, the DS module mitigates the issue of vanishing gradients, thus making the training of deep networks more feasible.
Experimental results
Dataset
The AMOS22 dataset is the core part of the Multimodal Abdominal Multi-Organ Segmentation Challenge, which aims to advance and evaluate medical image segmentation technology. Table 1 shows the composition of the AMOS22 data. The dataset comprises 500 CT scans and 100 MRI scans, with the training set, validation set, and test set containing 240, 120, and 240 scans, respectively. Each scan provides voxel-level annotations of 15 different abdominal organs. Including spleen, right kidney, left kidney, gallbladder, esophagus, liver, stomach, aorta, vena cava, pancreas, right adrenal gland, left adrenal gland, duodenum, bladder, prostate/uterus.
AMOS22 data composition.
Figure 6 illustrates an example of dataset visualization using ITK-Snap medical image visualization software. Figure 6(a) displays the axial plane image of amos_0001, Figure 6(b) shows the sagittal plane image, Figure 6(c) presents the coronal plane image, and Figure 6(d) features the 3D view. For additional details about AMOS22, please visit the official challenge page at AMOS22 Challenge (https://amos22.grand-challenge.org).

Example of dataset visualization. (a) axial plane, (b) sagittal plane, (c) coronal plane, (d) 3D view.
We use the loss function to quantify the error between the model's predictions and the actual values, guiding the optimization of the model's parameters. Specifically, we employ cross-entropy loss (CE loss) as the training loss, applying it to the outputs of multiple stages for the DS module.
19
This approach enables the model to learn effective feature representations at each layer, thus enhancing the training process and overall model performance. Equation (13) illustrates the calculation of CE loss, which involves the ground truth label values and the predicted probabilities for the segmentation targets. CEloss is not only simple to implement but also offers high computational efficiency.
Experimental results
The Amos22 Challenge evaluates the performance of models using two metrics: the DSC and NSD. In addition to DSC and NSD, this study further adopts four clinically relevant metrics—HD (95%), ASD, AVD, and RVD to comprehensively evaluate segmentation performance from boundary accuracy and volume consistency perspectives, which are critical for clinical applicability in abdominal multi-organ segmentation.
The DSC is a statistical measure that quantifies the similarity between two sample sets and is widely used in medical image segmentation to assess the accuracy of segmentation results. Equation (14) shows the calculation method for DSC, where X and Y represent the two sample sets, and |X| and |Y| denote the number of elements in each set.
NSD refers to the degree of boundary coincidence within a given error range and is used to evaluate the results of boundary-based segmentation. The NSD metric reports the good surface portion compared to the total surface (the sum of the predicted surface area and the ground truth surface area). A higher value represents high performance. Equation (15) presents the calculation method for NSD.
HD (95%) quantifies the maximum boundary deviation between the predicted segmentation and ground truth, while excluding the top 5% of extreme outlier distances. This metric avoids overestimation caused by individual abnormal boundary points, making it more robust and clinically meaningful than the full Hausdorff Distance—especially for small organs (e.g., gallbladder, adrenal glands) where local segmentation errors may otherwise distort evaluation results. Equation (16) presents the calculation method for HD (95%).
ASD complements HD (95%) by measuring the average of minimum Euclidean distances between all corresponding boundary points of the ground truth and predicted segmentation. It reflects the overall smoothness and consistency of the segmentation boundary, rather than focusing on extreme values, making it suitable for evaluating the overall quality of large organs (e.g., liver, spleen) with extensive boundary surfaces. Equation (17) presents the calculation method for ASD.
AVD quantifies the actual numerical difference between the volume of the predicted segmentation region and the ground truth organ volume. It directly reflects the accuracy of the model in reconstructing the spatial size of abdominal organs, which is essential for clinical tasks such as evaluating organ atrophy or enlargement. Equation (18) presents the calculation method for AVD.
RVD normalizes AVD by the ground truth organ volume, converting the absolute volume difference into a percentage of the ground truth organ volume. This metric eliminates the impact of organ size differences. AVD may be acceptable for the liver (a large organ), but excessive for the gallbladder (a small organ)—making it suitable for comparing volume accuracy across different abdominal organs. Equation (19) presents the calculation method for RVD.
We tested the performance using the same training, validation, and test sets, and the data-splitting strategy is identical to that used in the AMOS22 competition. We trained all models on a computer equipped with an NVIDIA GeForce RTX 4090 D 24GB GPU. For training parameters, we set the initial learning rate to 0.001 (applied uniformly across all models) and adopted a ReduceLROnPlateau schedule. Suppose the validation Dice score does not improve for 30 consecutive epochs. We reduce the learning rate by a factor of 0.2 (with a threshold of 1e-3 to prevent trivial adjustments) until it drops below 1e-6 or we have completed the total number of epochs (set to 10,000). We used the Adam optimizer for all models, configuring it with default momentum parameters (β1 = 0.9, β2 = 0.999, and ε = 1e-8) and a weight decay of 3e-5 for L2 regularization. We applied gradient clipping with a maximum norm of 12 to prevent gradient explosion. We set the batch size to 2 and implemented complementary regularization strategies: data augmentation (3D rotation within [−30°, 30°], scaling in [0.7, 1.4], symmetric mirroring with 0.5 probability, and disabled elastic deformation to avoid anatomical noise) and dropout with a probability of 0 (to retain fine-grained anatomical features), paired with L2 regularization to suppress overfitting.
Table 2 focuses on validating the necessity of the N-shaped feature flow path. We replaced the original N-shaped flow (derived from N-Net) with three mainstream feature flow architectures: U-Net (symmetric encoder-decoder with skip connections), ResU-Net (U-Net integrated with residual blocks), and DenseU-Net (U-Net with dense connection blocks). The experiments were conducted under the same experimental settings (using the full AMOS22 dataset, consistent training hyperparameters, and evaluation metrics of DSC and NSD).
Segmentation results of U-Net, ResU-Net, DenseU-Net, and N-Net models.
Exegesis: Spleen - spleen, R.Kid - right kidney, L.Kid - left kidney, Gall - gallbladder, Eso - esophagus, Liver - liver, Stom - stomach, Aorta - aorta, Pos - postcava, Panc - pancreas, RAG - right adrenal gland, LAG - left adrenal gland, Duo - duodenum, Blad - bladder, Pros - prostate/uterus.
The N-Net model outperforms U-Net, ResU-Net, and DenseU-Net across overall performance metrics, achieving the highest average Dice Similarity Coefficient (DSC) of 89.64 and average Normalized Surface Dice (NSD) of 82.76, which are 2.59 and 1.48 percentage points higher than DenseU-Net (the second-best model), respectively, and significantly superior to U-Net (+3.29 DSC, +2.57 NSD) and ResU-Net (+3.03 DSC, +2.53 NSD), validating the effectiveness of the N-shaped feature flow path design in multi-organ segmentation tasks. N-Net shows remarkable advantages in segmenting several challenging organs, particularly the prostate (Pros), pancreas (Panc), and esophagus (Eso); the most significant improvement is observed in prostate segmentation, with DSC increasing by 12.87 percentage points and NSD by 10.33 percentage points compared to ResU-Net (the best-performing comparative model), demonstrating that N-Net's feature fusion strategy substantially enhances segmentation accuracy for small organs and those with blurred boundaries. A special case is the gallbladder (Gall) segmentation, where DenseU-Net achieves the best performance (DSC 89.61, NSD 84.41). At the same time, N-Net ranks third (DSC 83.40), possibly due to the gallbladder's small volume and susceptibility to interference from surrounding tissues in images, suggesting varying adaptability of different feature flow paths to specific organs. For larger organs with relatively clear boundaries, such as the liver (Liver) and spleen (Spleen), all four models achieve high DSC (>95%). However, N-Net still maintains a slight advantage, especially in spleen segmentation, where its NSD is 1.87 percentage points higher than the second-ranked ResU-Net, indicating its superiority in predicting organ surface contours. DenseU-Net performs second-best in most organs and ranks first in bladder (Blad) segmentation with a DSC of 93.29, highlighting the advantage of dense connection structures in predicting regional consistency for specific organs; ResU-Net excels in stomach (Stom) segmentation (DSC 93.71), reflecting the effectiveness of residual connections in capturing boundaries of complex structures; U-Net, as the baseline model, shows stable but unremarkable overall performance. In contrast, N-Net outperforms other models in segmenting most organs (11/15) and particularly excels in improving the accuracy of hard-to-segment organs, confirming the universality and effectiveness of its feature flow design (Table 3).
Segmentation results of N-Net, TransN-Net, TransN-Net + CA, TransN-Net + DS, and CADSTransN-Net models.
N-Net has demonstrated strong capabilities in abdominal CT/MRI segmentation, achieving a DSC score of 89.64 and an NSD score of 82.76, thereby affirming its prominence in medical image segmentation. The TransN-Net model combines the strengths of the Transformer and N-Net architectures, providing enhanced feature extraction capabilities. This combination effectively manages the contextual information essential for spatial hierarchies and detailed organ segmentation. As a result, it surpasses N-Net's performance in abdominal CT/MRI segmentation, obtaining a DSC score of 89.74 and an NSD score of 83.09. The CA module dynamically optimizes the feature map by prioritizing relevant anatomical regions, focusing on critical areas while reducing the impact of less pertinent ones. Integrating this module into TransN-Net enhances performance and flexibility, achieving a DSC score of 90.28 and an NSD score of 84.19. The DS module setting ensures that the model receives feedback at the final output stage and throughout various intermediate stages. This approach optimizes the use of information from intermediate layers, enhancing overall learning efficiency and model performance. The TransN-Net + DS module achieves a DSC score of 90.31 and an NSD score of 84.12. CADSTransN-Net utilizes both the CA and the DS modules, achieving a DSC score of 90.76 and an NSD score of 85.00 in abdominal CT/MRI segmentation. Additionally, it excels in organ segmentation, establishing CADSTransN-Net as a promising solution for abdominal organ segmentation.
Table 4 presents a comprehensive comparison of core performance metrics for eight segmentation models on the AMOS22 dataset, with transparent performance gradients observed as model complexity and improved modules increase. Among the basic CNN models, U-Net (the foundational model) shows the lowest performance across all metrics (DSC = 86.35%, HD(95%) = 5.59 mm, RVD = 2.19%), revealing limitations in accurate boundary capture and volume estimation for multi-organ segmentation; ResU-Net, which incorporates residual connections, outperforms U-Net with a 0.26-percentage-point increase in DSC, a 0.39 mm reduction in HD(95%), and a 2512.24 mm3 decrease in AVD, demonstrating that residual connections effectively alleviate gradient vanishing and enhance feature reuse. DenseU-Net, leveraging dense connections, further improves performance to 87.05% DSC, a 4.02 mm HD (95%) (the best among basic models), and 2.09% RVD, validating its advantage in segmenting the acceptable boundaries of small organs, such as the gallbladder and adrenal glands. Moving to advanced models, N-Net achieves a significant performance leap compared to DenseU-Net, with a 2.59-percentage-point increase in DSC (from 87.05% to 89.64%), a 1.48-percentage-point increase in NSD (from 81.28% to 82.76%), and a 3611.78 mm3 reduction in AVD, likely due to its optimized feature fusion modules (e.g., cross-scale feature concatenation) that address the challenge of significant multi-organ scale differences. TransN-Net, which adds Transformer attention to N-Net, only sees a 0.1-percentage-point DSC increase (from 89.64% to 89.74%) but a rise in HD(95%) to 5.56 mm and RVD to 2.22%, possibly because Transformers struggle with capturing local features of small organs (attention weights tend to focus on large organs, leading to boundary deviations in small ones). Among the fused improved models, TransN-Net + CA (with channel attention) outperforms TransN-Net with a 0.54-percentage-point DSC increase (from 89.74% to 90.28%), a 1.27 mm HD(95%) reduction (from 5.56 mm to 4.29 mm), and a 1.81% RVD, showing that channel attention can precisely focus on organ-specific feature channels (e.g., grayscale difference channels between the liver and spleen) to correct boundary deviations; TransN-Net + DS (with dense dilated convolution) reaches 90.31% DSC, a minimal 0.78 mm ASD, and 49,101.66 mm3 AVD, confirming the advantage of dilated convolution in expanding the receptive field while preserving resolution, making it suitable for complete segmentation of large organs like the liver and stomach. CADSTransN-Net, a multi-module fused model, achieves the best performance across all metrics (90.76% DSC, 85.00% NSD, 3.98 mm HD(95%), 1.53% RVD), proving that its "Transformer + channel attention + dense dilated convolution" architecture meets the multi-organ segmentation requirements of complete large-organ segmentation, accurate small-organ boundary capture, and precise volume estimation.
Comparison of core performance metrics of different segmentation models in AMOS22.
Figure 7 provides a detailed visualization of the segmentation results for amos_0008 using ITK-Snap medical image visualization software. The first row displays the ground truth, while each subsequent row showcases the segmentation outcomes from different models.

Ground truth and prediction results of amos_0008. (a) axial plane, (b) sagittal plane, (c) coronal plane, (d) 3D view.
The first image in each row of Figure 7 depicts the 79th axial plane of amos_0008, allowing viewers to examine the segmentation across different models: N-Net, TransN-Net, TransN-Net + CA, TransN-Net + DS, and the fully integrated CADSTransN-Net. These models are shown from the second row onwards, illustrating incremental improvements in segmentation accuracy.
Figure 7(b) extends the comparison by displaying the 386th sagittal plane. This visualization highlights how each model manages complex anatomical structures when viewed from this specific orientation.
Similarly, Figure 7(c) presents the 401st coronal plane. It further showcases the models’ abilities to handle intricate anatomical details from a different perspective.
Finally, Figure 7(d) presents a 3D view of the segmented organs, offering a holistic and spatial understanding of the model's performance. This 3D representation effectively highlights the depth and precision of organ delineation for each model, with the CADSTransN-Net demonstrating significantly improved accuracy and detail, particularly when directly compared to the conventional N-Net model.
This layered visual analysis not only showcases the specific capabilities of each segmentation approach but also underscores the progressive enhancements achieved by incorporating CA and DS techniques in the TransN-Net architecture, culminating in the advanced performance of the CADSTransN-Net model.
Tables 2–4 demonstrate the excellent performance of CADSTransN-Net in abdominal CT/MRI segmentation, achieving a DSC of 90.76 and an NSD score of 85.00. These metrics highlight the high accuracy and precision of CADSTransN-Net in delineating individual organs in abdominal images. Furthermore, Figure 7 vividly illustrates the ability of CADSTransN-Net to accurately segment each organ class, providing clear visual evidence of its effectiveness. These results confirm the usefulness of CADSTransN-Net in complex medical imaging tasks and demonstrate its potential to significantly enhance the diagnostic process by providing precise anatomical details.
Discussion
Table 5 shows a detailed comparison of CADSTransN-Net and state-of-the-art methods in 15 categories of abdominal CT/MRI segmentation results, with two decimal places for each result. Amos22 ranked all participants, with yushi13 winning first place in the CT/MRI segmentation challenge, achieving a DSC score of 89.72 and an NSD score of 81.50. 3 Details of the segmentation results are available on the Amos22 official leaderboard (https://amos22.grand-challenge.org/evaluation/amos-ctmri-regular-evaluation/leaderboard). The CADSTransN-Net, as proposed in this paper, surpassed yushi13 in most organ segmentation categories, achieving higher DSC and NSD scores. These results demonstrate the superior segmentation capability of CADSTransN-Net.
Comparison with state-of-the-art methods.
Qi et al. 20 developed a Siamese framework for GMIM, featuring an online branch and a target branch, to guide the network in learning correlations between organs and tissues by reconstructing the original image from partial observations. The online branch utilizes an adaptive hierarchical masking strategy to identify boundaries or areas with minor contextual changes within the image, and it learns high-level semantic representations from deeper layers of the multi-scale encoder. Meanwhile, the target branch supports contrastive learning by providing representations that help reduce redundancy. GMIM standardizes identification by collectively treating the left and right kidneys as a single kidney and both LAG and RAG as AG. This method achieved a DSC score of 57.86 and an NSD score of 71.32. Unlike GMIM, which primarily uses a hierarchical masking strategy for identification and redundancy reduction, CADSTransN-Net employs a CA mechanism and DS that directly enhance the model's focus on relevant anatomical features and facilitate gradient propagation at multiple network depths. These features significantly contribute to the model's superior DSC and NSD scores, where CADSTransN-Net achieves a DSC of 90.76 and an NSD of 85.00, surpassing GMIM's performance.
Huang et al. 21 developed the A-Eval benchmark, a cross-dataset evaluation tool for abdominal multi-organ segmentation across five different datasets. This benchmark evaluates the generalizability of various models across diverse data usage scenarios, including independent training on a single dataset, leveraging unlabeled data via pseudo-labeling, combining different modalities, and joint training on all available datasets. Additionally, the study examined the impact of model size on cross-dataset generalization, highlighting the importance of effective data utilization in enhancing model generalizability. This research provides valuable insights into assembling large-scale datasets and refining training strategies. Ultimately, their efforts yielded segmentation results for 9 out of the 15 abdominal organs, achieving a DSC of 87.66 and an NSD score of 92.53. While their approach underscores the importance of model scalability and data diversity, CADSTransN-Net extends these principles by integrating multiple advanced architectural elements that not only handle data diversity efficiently but also ensure high precision in segmentations across various imaging conditions. This results in CADSTransN-Net's robust performance across different organs.
Lee et al. 22 discovered that the 3D Medical VIT (SwinUNETR) achieved state-of-the-art performance across several 3D volumetric data benchmarks, including 3D medical image segmentation. The approach combines Layered Transformers, such as the Swin Transformer, with elements from convolutional neural networks, significantly improving the practical feasibility of volumetric segmentation in 3D medical datasets. The success of this hybrid model is mainly due to the extensive receptive field provided by non-local self-attention and the high number of model parameters. They developed a lightweight volumetric convolutional neural network, 3D UX-Net, which integrates convolutional layers into a hierarchical Transformer framework to enhance the robustness of volume segmentation. Specifically, they revisited volumetric depthwise convolutions with large kernel sizes (starting from 7 × 7 × 7), inspired by the Swin Transformer, to expand the global receptive field. By replacing the Multi-Layer Perceptron (MLP) in the Swin Transformer block with point-wise depthwise convolutions and reducing the number of normalization and activation layers, 3D UX-Net not only surpassed the top score previously held by yushi13 in the CT/MRI segmentation challenge but also achieved a DSC score of 90.00, with no NSD score reported. Their model, while effective in providing an extensive receptive field through non-local self-attention, contrasts with the CADSTransN-Net, which leverages a specialized blend of Transformer layers for global contextual awareness and N-Net architectures for precise localization. Additionally, CADSTransN-Net's inclusion of a CA mechanism uniquely refines feature maps to focus more intensively on critical structures, leading to an improvement in segmentation accuracy, evidenced by surpassing the previously high scores in both DSC and NSD.
CADSTransN-Net integrates the robust feature extraction capabilities of the N-Net architecture with the global contextual awareness of Transformer layers, and CA mechanisms and DS strategies further enhance this integration. Unlike traditional models that rely heavily on local contextual information, the Transformer layers in CADSTransN-Net process image data with an emphasis on long-range dependencies—this enables the model to integrate information across the entire image. This capability is particularly beneficial for medical imaging, where understanding the relationship between adjacent anatomical structures is crucial.
The CA mechanism introduces a novel dimension to the model, dynamically focusing on anatomically significant areas and refining feature maps more effectively than traditional attention mechanisms. This targeted approach enables CADSTransN-Net to improve segmentation precision, particularly in regions with complex morphology or closely packed organs. It stands in sharp contrast to models that apply uniform attention across all features—such models can dilute focus on the critical regions required for accurate medical analysis.
Furthermore, the DS modules introduced into the model architecture enhance learning from the network's earliest layers. By implementing loss functions at multiple depths, CADSTransN-Net corrects errors not only at the output layer but throughout the entire network, resulting in a more robust and error-resilient model. This method addresses a key limitation of traditional approaches, which often struggle with gradient vanishing or exploding—especially in deep neural networks.
Experimental evaluations underscore the superior performance of CADSTransN-Net, which achieves high scores in both DSC and NSD. These results reflect the model's effectiveness in handling the variability inherent in medical images, such as differences in organ size, shape, and location across patients. CADSTransN-Net's ability to outperform traditional segmentation methods highlights its potential to enhance diagnostic accuracy and efficiency in clinical settings.
In conclusion, the CADSTransN-Net model not only advances the technological capabilities of segmentation algorithms but also aligns closely with clinical needs by providing a versatile, robust, and efficient tool for medical image analysis. Its development marks a considerable advancement in applying deep learning to healthcare, offering significant improvements in patient care and operational efficiency in medical diagnostics.
Conclusion
This paper proposes CADSTransN-Net, an innovative model that fuses Transformer and N-Net architectures, designed explicitly for abdominal organ segmentation from CT and MRI scans. To address key challenges in medical image segmentation, the model integrates enhanced Coordinate Attention mechanisms and incorporates a Deep Supervision module—two core design optimizations that collectively elevate the precision of abdominal organ delineation. Empirical results demonstrate that CADSTransN-Net outperforms conventional methods in capturing complex anatomical structures, validating the effectiveness of its architecture fusion and module enhancements. Beyond technical advancements, the model exhibits significant potential for practical clinical integration, as its reliable segmentation performance can support downstream clinical workflows, such as preoperative planning and postoperative evaluation, by providing accurate anatomical references. Looking forward, CADSTransN-Net offers a promising direction for future research in medical imaging technology—for instance, it can serve as a baseline framework to explore multi-modal data fusion, or be extended to segment more complex anatomical regions, further bridging the gap between AI-driven segmentation and clinical practice.
Footnotes
Ethical considerations
Not applicable.
Informed consent statement
Not applicable.
Author contributions
Data curation, GW and YT; Formal analysis, LM; Investigation, TM; Methodology, PS; Validation, ZC.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the Guangxi Natural Science Foundation (2026GXNSFBA00640003), Guangxi Key Laboratory of Automatic Detecting Technology and Instruments (YQ26102), Middle-aged and Young Teachers’ Basic Ability Promotion Project of Guangxi (2025KY0258), Guilin University of Electronic Technology Scientific Research Fund Project (UF24014Y), National Natural Science Foundation of China (62263006), 2021 Director's Fund of the Guangxi Key Laboratory for Automatic Detection Technology and Instruments (YQ21107).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
