A Multi-Task Segmentation and Classification Network Based on Ultrasound Images for Predicting the Grading of Ascites in the Abdominal Cavity

Abstract

Abdominal trauma with bleeding is a leading cause of post-traumatic death, and detecting free fluid in the abdomen or hemoperitoneum can provide critical guidance for clinical management. Rapid and accurate diagnosis of abdominal bleeding using ultrasound is significant for making decisions regarding the need for surgical intervention. This study introduces a multi-task network for the segmentation and classification of ascites in ultrasound images. The network utilizes a U-Net backbone with a ResNext encoder as the basic architecture for the segmentation and classification models. The segmentation network includes a Frequency Channel Attention (FCA) attention module, which effectively broadens the range of captured information and enhances the robustness of channel representation. Furthermore, an Enhanced Channel Attention Multi Feature Fusion (EMFF) was used to extract the interdependencies between feature channels by combining high-order and low-order feature mappings, thereby improving segmentation accuracy. Lastly, a classification branch was created to classify ascites by sharing encoder features. Experiments on the collected ascites ultrasound dataset demonstrated that the proposed method achieved a segmentation Dice of 85.28% and a classification accuracy of 86.18%. It outperformed the leading multi-task SOTA method by 0.7% in Dice and 2.03% in accuracy, establishing a new benchmark for simultaneous ascites assessment. This study showed that the proposed network is valuable for the preliminary diagnosis of ascites in ultrasound and can serve as a potential auxiliary tool for clinical ascites examination in emergency situations.

Keywords

abdominal ascites ultrasound image deep learning multi-task learning attention mechanism

Introduction

Abdominal trauma is a prevalent injury that is frequently complicated by active bleeding due to liver or spleen damage, constituting a main cause of death following trauma.¹ As a key indicator of abdominal diseases and organ injuries, ascites is critical for the early diagnosis and grading of abdominal trauma. Its detection enables rapid triage and timely identification of life-threatening cases, leading to lower mortality rates. Rapid and accurate diagnosis of ascites in the abdominal region is significant for making accurate decisions on the need for surgery.² Ultrasound imaging is important in the initial assessment of patients suspected of abdominal injury.³ The non-invasive, rapid, and repeatable nature of ultrasound makes it a valuable tool in emergency situations.⁴

Ultrasound-based grading of ascites per clinical guidelines comprises three levels (Figure 1).⁵ Grade 1 ascites (<3 cm) is defined by its sonographic detection alone, absent clinical signs like distension or shifting dullness. Grade 2 (3–10 cm) manifests as moderate symmetrical distension, with shifting dullness being variable. Grade 3 (>10 cm) demonstrates overt distension, positive shifting dullness, and possible protuberance or umbilical hernia.⁵ The diagnosis of ascites by ultrasound may be affected by the subjectivity of expert diagnosis and the experience of ultrasound technicians. Moreover, for small or occult ascites, it is often difficult to observe with the naked eye, leading to a high rate of missed diagnoses. The aforementioned issues have prompted the exploration of using computer-aided diagnosis (CAD) to improve the accuracy and efficiency for the diagnosis of ascites or abdominal bleeding.⁶

Figure 1.

Typical ultrasound images of ascites samples with different degree of ascites. (a) Grade 1 ascites, (b) Grade 2 ascites and (C) Grade 3 ascites.

Deep learning applications offer promising solutions for automating and enhancing the interpretation of medical images.^7,8 Leveraging its capacity to discern complex data patterns, deep learning could be used as a powerful tool for improving diagnostic accuracy, thus showing great potential in CAD of abdominal bleeding.^6,9
-11 The main tasks of deep learning in the diagnosis of abdominal bleeding are segmentation and classification. Segmentation can accurately identify bleeding sources and detects free fluid in targeted abdominal regions. The free fluid typically accumulated in regions such as the hepatorenal fossa, splenorenal fossa, and pelvic cavity. The occurrence of free fluid or ascites strongly suggest the possibility of severe intra-abdominal injury and bleeding, potentially necessitating emergency laparotomy, blood transfusion, and other life-saving measures.¹² Classification provides accurate grading of ascites. Classification thereby reflect the extent of fluid accumulation in the abdominal cavity and assist physicians in assessing the severity of the patient’s condition. Grading of ascites provides an important basis for clinicians to formulate personalized treatment plans. For example, a small amount of ascites (Grade 1) may only require observation or conservative treatment, while moderate or large amounts of ascites (Grade 2 or 3) may require more aggressive treatment measures, such as diuretics, abdominal paracentesis, and drainage.¹³ Grading of ascites can also predict the prognosis of patients to some extent. Generally, patients with persistently increasing or uncontrollable ascites have a poorer prognosis and require closer observation and treatment.¹⁴ Winkel, David J. et al.¹⁵ constructed CNN-based a machine learning model for predicting abdominal bleeding to differentiate free-gas, free-fluid, or fat-stranding based on CT images, with 85% sensitivity and 95% specificity. Ko et al.⁹ developed an AI segmentation algorithm for the quantification of ascites based on abdominal CT, achieving an mIOU of 0.87. Lin et al.⁶ applied the U-Net to automatic segmentation of ascites in portable ultrasound, yielding Dice coefficients of 0.65 to 0.79 and confirming the viability of deep learning for this application. However, their work just analyzed the segmentation of ascites and did not include the automatic classification of abdominal bleeding. In addition, the classification of ascites necessitates further manual diagnosis and was just binary classification as positive or negative.

In this work, a U-Net-based model for the multi-task segmentation and classification of ascites using ultrasound images was developed. To the best of our knowledge, this is the first study utilizing deep learning for simultaneously segmenting and classifying ascites using ultrasound images. The contributions of this work are as follows: (1) A shared-encoder multi-task framework that jointly optimizes lesion localization and grading via synergistic segmentation and classification branches. (2) Frequency-domain Channel Attention (FCA) block that amplifies salient encoder features to suppress complex ultrasound backgrounds. (3) Enhanced Multi-scale Feature Fusion (EMFF) module with channel-wise attention, aggregating skip-connected details for accurate segmentation.

Methods

Overview of Multi-Task Ascites Segmentation and Classification Network (MTASCNet)

MTASCNet takes abdominal ultrasound images as input and produces ascites segmentation and the probability of ascites grading. By leveraging a feature sharing mechanism, MTASCNet enhances the model’s analytical and processing capabilities for ascites regions. As shown in Figure 2, the network is composed of a segmentation branch and a classification branch. The segmentation branch employs a UNet architecture, with its encoder consisting of a series of ResNeXt blocks.¹⁶ The Frequency Channel Attention (FCA)¹⁷ is utilized to assign weights to different channel features output by the encoder, thereby amplifying the salience of informative features. An Enhanced Channel Attention Multi Feature Fusion (EMFF) module was designed to capture channel dependencies across feature maps by merging multi-scale information, thereby enhancing spatial representations of lesions throughout decoder stages. The decoder output stage ultimately enables ascites region segmentation in ultrasound images. Additionally, a classification branch attached to the UNet encoder final layer—processed through a Fully Connected layer and Softmax—predicts ascites severity for each input.

Figure 2.

Overview of MTASCNet.

Design of Encoder

ResNeXt extends ResNet by aggregating diverse feature representations through multi-path cardinality. Figure 3 illustrates the structural differences between these architectures. The ResNeXt backbone comprises stacked Bottleneck layers—specialized residual units for complex pattern extraction. Each Bottleneck implements three sequential convolutions: 1 × 1 convolution for reducing the size of the input data. Subsequently, 3 × 3 grouped convolution for capturing detailed patterns, and 1 × 1 convolution for restoring the original size of the data. Batch normalization and ReLU activation follow each convolutional operation for feature stabilization.

Figure 3.

Architecture comparison of ResNet and ResNeXt building blocks, with each layer denoting input channels, kernel size, and output channels: (a) ResNet structure and (b) ResNeXt structure.

Assuming $X = [x_{1}, x_{2}, \dots, x_{n}]$ is an input feature vector with $n$ channels, $w_{i}$ is a filter’s weight for the $i$ -th channel. The sum of the input feature $x_{i}$ and the corresponding weight $w_{i}$ is:

\sum_{i = 1}^{n} w_{i} x_{i}

(1)

The grouped convolution operation in the ResNeXt architecture can be represented as:

X = \sum_{i = 1}^{j} p_{i} (X)

(2)

where $X$ is the output feature map, $j$ is the number of groups, and $p_{i} (X)$ denotes the transformation function applied to the input $X$ by the $i$ -th group. The output of the Bottleneck layer in the ResNeXt model can be expressed as:

X = X + \sum_{i = 1}^{C} p_{i} (X)

(3)

These Bottleneck layers are arranged in stages, each progressively down sampling spatial resolution while expanding feature depth.

FCA Attention Block

Ultrasound imaging typically exhibits marked speckle noise and poor contrast, often causing informative features such as ascitic regions to be submerged in background clutter. Traditional convolution may lead to misclassification of objects due to the capture of local feature information. The Squeeze-and-Excitation (SE)¹⁸ attention mechanism enhances local feature representation by assigning different weights to various channels. However, the SE attention mechanism primarily uses Global Average Pooling (GAP), which retains information mainly from the lowest frequency components, neglecting other frequency components. These other frequency components potentially contain crucial channel weights and comprehensive information patterns. A more potent mechanism is therefore required to accentuate the feature channels that carry rich diagnostic information. To holistically model contextual dependencies across the frequency spectrum, the Frequency Channel Attention (FCA) mechanism was adopted. The FCA attention extends GAP to include multiple frequency components of the Discrete Cosine Transform (DCT), thus integrating a broader spectrum of frequency components. This approach effectively broadens the range of captured information and enhances the robustness of channel representation. Thus, FCA attention was employed to model spatial relationships across channel feature maps. The framework of FCA attention is depicted in Figure 4.

Figure 4.

Frequency channel attention module.

Assuming the last layer of the Resnext Encoder outputs a feature map denoted as $X$ . The input tensor $X$ undergoes channel-wise partitioning into n distinct segments, expressed as $[X_{0}, X_{1}, \dots, X_{n - 1}]$ , where each segment $X_{i} \in ℝ^{C^{'} \times L}$ with $C' = C / n$ and $L = H \times W$ , requiring $C$ to be a multiple of $n$ . Each partitioned segment is then associated with a specific 2D DCT frequency band, enabling the DCT coefficients to serve as compressed representations of channel-wise attention mechanisms. This yields the following formulation:

\begin{array}{l} F r e q^{i} = D C T^{u_{i}} (X^{i}) \\ = \sum_{l = 0}^{L - 1} X_{:, l}^{i} \cos (\frac{π k}{L} (i + \frac{1}{2})) s . t . i \in {0, 1, \dots, L - 1} \end{array}

(4)

where $u_{i}$ represents the frequency component indicators corresponding to $X_{i}$ , and $F r e q^{i} \in ℝ^{C^{'}}$ is the $C^{'}$ - dimensional vector after the compression.

Subsequently, the full frequency descriptor $F r e q$ is formed by concatenation:

F r e q = c o m p r e s s (X) = c a t ([F r e q^{0}, F r e q^{1}, \dots, F r e q^{n - 1}])

(5)

$F r e q \in ℝ^{C}$ is the obtained multi-spectral vector. Thus, the entire FCA framework can be represented as:

a t t e n = s i g m o i d (f c (F r e q))

(6)

FCA extends the original GAP operation of the SE attention mechanism to a framework that includes multiple frequency components, enriching the compressed channel information and capturing spatial dependencies between arbitrary channel feature maps. Following filtering and weight recalibration, frequency-domain signals are converted back to the spatial domain through inverse transformation. This procedure effectively assigns more precise weights to the feature maps of different channels output by the encoder, thereby amplifying information-rich channels and enhancing the model’s sensitivity to ascitic-region features.

Enhanced Channel Attention Multi Feature Fusion Module

In U-Net skip connections, concatenating high-level encoder features with up-sampled decoder maps fails to exploit the intrinsic relationships across scales. High-level features capture strong semantic meaning with coarse spatial precision, while low-level features retain fine-grained spatial details but carry weaker semantic information. An intelligent fusion mechanism is therefore required to establish channel-wise dependencies between them, so as to enrich spatial details during decoding. Multi-scale feature fusion is crucial for accurate segmentation, as high-level features encode semantic context while low-level features preserve spatial detail.

Drawing inspiration from the Efficient Channel Attention (ECA) mechanism¹⁹ and multi-scale feature fusion strategies,²⁰ we designed the Enhanced Multi-scale Feature Fusion (EMFF) module to address a critical challenge in ultrasound imaging: speckle noise significantly compromises boundary delineation. The core insight behind EMFF lies in symmetrically applying 1D convolution-based channel attention to both high-level semantic and low-level spatial feature maps before their integration. This adaptive recalibration of channel weights serves a dual purpose—attenuating noise interference while amplifying anatomical boundary signals. The rationale for this hierarchical processing stems from the complementary nature of features at different scales. High-level features encode rich semantic context essential for global understanding, whereas low-level features preserve fine-grained edge information crucial for detail restoration. By leveraging ECA to learn channel-wise importance from these multi-level representations, EMFF selectively enhances task-relevant feature maps and suppresses those contributing minimally to ascites segmentation. This dynamic weighting mechanism ensures that informative channels are emphasized, enabling more precise boundary localization despite the inherent artifacts of ultrasound imaging.

The EMFF is illustrated in Figure 5. The enhanced channel attention mechanism was applied to both high-order and low-order features. The aim is to increase the weight of significant information in each feature channel for the segmentation task and to disregard useless feature information.

Figure 5.

Enhanced channel attention multi feature fusion module.

Firstly, the high-level features $X_{H i n p u t} * \in ℝ^{C \times H \times W}$ are input into both 1×1 and $3 \times 3$ convolutional layers to obtain $X_{H i n p u t} \in ℝ^{C \times H \times W}$ . The definition is as follows:

X_{H i n p u t} = C o n v 1 \times 1 (C o n v 3 \times 3 ({X_{H i n p u t}}^{*}))

(7)

$X_{H i n p u t}$ and $X_{L i n p u t}$ have the same number of channels. Subsequently, GAP is employed to compress each feature, resulting in a feature representation with the dimensions of $C \times 1 \times 1$ :

X = \sum_{i = 1}^{j} p_{i} (X)

(8)

Following the GAP, a one-dimensional convolution with a kernel size of K is applied to GAP(U) to rapidly extract the local feature relationships across the K channels. The activation values of the one-dimensional convolutional output are computed using the sigmoid function, yielding weights $ω \in ℝ^{C \times 1 \times 1}$ , reflecting local channel interdependencies and relative importance. The sigmoid function and the weights ω are illustrated in equations (9) and (10), where $C 1 D$ denotes one-dimensional convolution.

S i g m o i d (x) = \frac{1}{1 + e^{- x}}

(9)

z_{1} = S i g m o i d ({C1D}_{K} (G A P (X_{L input})))

(10)

Based on equation (10), low-dimensional and high-dimensional attention weights are obtained to enhance important channel features by assigning them higher values and to autonomously suppress ineffective channel features by assigning them lower values:

z_{1} = S i g m o i d ({C1D}_{K} (G A P (X_{L input})))

(11)

z_{2} = S i g m o i d ({C1D}_{K} (G A P (X_{H input})))

(12)

z = F_{a d d} (z_{1}, z_{2})

(13)

The $X_{H o u t p u t}$ is obtained by channel-wise multiplication $X_{H i n t p u t}$ and $z$ :

{X_{H}}^{*} = {Conv}_{1 \times 1} ({Conv}_{3 \times 3} (concat (X_{L i n p u t}, X_{H o u t p u t}))

(14)

To enhance feature representation and enrich semantic information, the ${X_{H}}^{*}$ was obtained by concatenating $X_{L i n p u t}$ and $X_{H o u t p u t}$ . The final output of EMFF is obtained through two $3 \times 3$ convolutional layers that capture semantic information:

{X_{H}}^{*} = {Conv}_{1 \times 1} ({Conv}_{3 \times 3} (concat (X_{L i n p u t}, X_{H o u t p u t}))

(15)

Datasets and Preprocessing

This retrospective study was approved by the Ethics Committee of the Second Affiliated Hospital of Naval Medical University (2024SL024). A comprehensive dataset of 542 images from 315 unique patients (Male/Female: [163/152]) was constructed. Images were acquired using four systems: Mindray DC-70Pro (297 images), Siemens Acuson Sequoia (129 images), GE LOGIQ E9 (55 images), and Philips LU22 (61 images). The dataset includes three clinically stratified categories with 103 cases of mild ascites (Ascites-1), 394 cases of moderate ascites (Ascites-2), and 45 cases of severe ascites (Ascites-3). The dataset focuses on Grades 1 to 3, as the study aims to assist in grading severity for rapid surgical triage in trauma patients where free fluid presence is already suspected, rather than healthy screening (Grade 0). All annotations were performed using ITK-SNAP. In cases of disagreement between the two primary radiologists, a consensus was reached via discussion, the senior radiologist corrected approximately 15% of the initial annotations. The dataset was partitioned into training (80%) and testing (20%) sets at the patient-level using stratified sampling to maintain class distribution, with fivefold cross-validation implemented to maximize data utilization. During preprocessing, all DICOM images were converted to PNG format. Original resolutions were preserved during annotation, with resizing to 224 × 224 performed only at network input stage.

Loss Function

The Cross-Entropy loss (CELoss)²¹ function is employed to constrain the training process of the classification model, with the calculation formula given by:

L_{cls} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{i k} \log p (x_{i k})

(16)

where $N$ is the number of samples, $k$ is the number of category. The $y_{i k}$ is the true value (0 or 1) of the k-th category for the i-th sample, which is also known as one-hot encoding. The $p (x_{i k})$ is the predicted probability of the k-th category for the i-th sample (output from the softmax function).

In segmentation tasks, a Dice coefficient-based segmentation loss²² was utilized to address the class imbalance issue between the foreground and background in images. The segmentation loss is defined as:

L_{dice} (P_{seg}, Y_{seg}) = 1 - \frac{2 P_{seg} Y_{seg} + 1}{P_{seg} + Y_{seg} + 1}

(17)

where $L_{d i c e}$ represents the Dice segmentation loss, $P_{seg}$ and $Y_{seg}$ indicating the predicted segmentation output from the proposed network and the corresponding ground truth, respectively.

The classification loss $L_{cls}$ and the segmentation loss $L_{seg}$ are integrated into a multi-task loss via a hyperparameter $λ$ , formulated as:

L_{total} = λ L_{cls} + (1 - λ) L_{seg}

(18)

Where $L_{total}$ represents the multi-task loss, and $λ \in [0, 1]$ is the weight assigned to the classification task.

Experiment Details

To address the limited sample size in severe ascites cases, an adaptive augmentation approach was implemented involving random rotation within ± 15 °, horizontal and vertical flipping with 50% probability. Given the class imbalance, severe ascites cases received three times more augmentations than other categories.

Training employed the Adam optimizer with a warmup period of five epochs, starting with an initial learning rate of 1e-4 that decayed by 0.1 at 50 and 75 epochs. Momentum parameters were set at β1 = .9 and β2 = .99 with L2 regularization through 0.0001 weight decay. Hyperparameters were empirically optimized using the validation folds during the cross-validation process. Batch processing utilized a size of 8, running for a maximum of 100 epochs. An early stopping mechanism was implemented, which monitored the combined multi-task validation loss (equation (18)); training was explicitly halted if this validation loss failed to decrease for 15 consecutive epochs to prevent overfitting. All experiments were conducted on an NVIDIA V100-SXM2 to 16 GB GPU with Xeon Gold 6248 R CPU using PyTorch 1.13.0 framework. The implementation leveraged Python 3.9 as primary programming language under Windows 10 Professional operating system.

Evaluation Metrics

Four commonly used metrics were employed to evaluate the effectiveness of ascites segmentation: Dice similarity coefficient (Dice), Jaccard index (Jaccard), 95% asymmetric Hausdorff distance (95HD), and average surface distance (ASD).²¹ The calculation methods for these indices are as follows:

Dice = \frac{2 | A \cap B |}{| A | + | B |}

(19)

Jaccard = \frac{| A \cap B |}{| A \cup B |}

(20)

95HD = \max (\frac{1}{n} \sum_{i = 1}^{n} HD (A_{i}, B), \frac{1}{m} \sum_{j = 1}^{m} HD (B_{j}, A))

(21)

ASD = \frac{1}{| A | + | B |} \sum_{x \in A} \min_{y \in B} d (x, y) + \sum_{y \in B} \min_{x \in A} d (y, x)

(22)

Where $A$ and $B$ denote the predicted segmentation and ground truth, respectively, and $x$ and $y$ representing the pixel values in $A$ and B. Dice and Jaccard metrics are responsive to ascites volume, whereas 95HD reflects contour accuracy.

For the ascites classification task, recall (REC), precision (PRE), and accuracy (ACC)²¹ were used for quantitative evaluation:

REC = \frac{T P}{T P + F N}

(23)

PRE = \frac{T P}{T P + F P}

(24)

ACC = \frac{T P + T N}{T P + F P + T N + F N}

(25)

Where TP, FP, TN, and FN represent the counts of true positives, false positives, true negatives, and false negatives, respectively.

Results

Selection of Feature Extraction Backbone

The backbone selection process involved three-phase evaluation: (1) architectural suitability for medical imaging, (2) computational efficiency, and (3) feature extraction capability. Systematic comparisons of eight modern architectures (ResNet50,²³ VGG16,²⁴ Xception,²⁵ EfficientNetB4,²⁶ DenseNet,²⁷ MobileNet,²⁸ MiT,²⁹ and ResNeXt¹⁶) using fivefold cross-validation were conducted. The reported metrics represent the average performance on the strictly patient-isolated test sets across all fivefolds. To ensure a fair comparison, while the general data augmentation and epoch scheduling remained consistent, the specific hyperparameters (e.g., initial learning rate and weight decay) for each baseline model were independently optimized via grid search on the validation folds to guarantee optimal convergence for each respective architecture. Each model was evaluated without additional modules to isolate backbone performance. The comparative segmentation and classification outcomes are detailed in Table 1. Among these, the ResNeXt-based model demonstrated superior performance across four key segmentation metrics—Dice (84.16%), Jaccard (77.68%), 95HD (10.40), and ASD (1.87)—as well as four classification metrics—ACC (85.07%), PRE (85.58%), REC (73.31%), and F1 score (84.62%). ResNeXt50 emerged as optimal due to its unique combination of: (1) grouped convolutions (32 × 4d template) that efficiently capture ultrasound speckle patterns, (2) higher parameter efficiency reducing overfitting risk. These findings suggest that the ResNeXt architecture is particularly well-suited for the analysis of the ultrasound abdominal dataset.

Table 1.

The Ascites Segmentation Results of Different Comparison Methods.

Backbone MA	Segmentation (Mean ± Std)				Classification (Mean ± Std)
Backbone MA	Dice	Jaccard	95HD	ASD	ACC	PRE	REC	F1
ResNet²³	83.14% ± 0.0206	75.42% ± 0.0206	10.52 ± 0.7946	2.28 ± 0.3418	85.07% ± 0.0293	85.34% ± 0.0299	73.12% ± 0.0272	84.57% ± 0.0313
Vgg²⁴	83.10% ± 0.0118	76.30% ± 0.0124	10.90 ± 1.276	2.00 ± 0.3626	75.50% ± 0.1072	61.12% ± 0.1952	44.36% ± 0.1351	66.93% ± 0.1615
Xception²⁵	84.06% ± 0.0181	77.58% ± 0.0197	10.59 ± 0.7647	1.93 ± 0.3068	84.15% ± 0.0359	84.34% ± 0.0398	71.98% ± 0.0434	83.90% ± 0.0392
EffNetb4²⁶	83.88% ± 0.0205	76.04% ± 0.0220	10.59 ± 0.8302	2.26 ± 0.3429	83.97% ± 0.0574	84.50% ± 0.0503	68.98% ± 0.0635	83.15% ± 0.0642
DenseNet²⁷	80.06% ± 0.0258	71.54% ± 0.0257	11.69 ± 0.7671	2.30 ± 0.1933	84.92% ± 0.0355	84.87% ± 0.0369	73.22% ± 0.0379	84.42% ± 0.0383
MobileNet²⁸	70.08% ± 0.0251	58.86% ± 0.0211	18.51 ± 1.272	4.31 ± 0.5324	74.93% ± 0.0556	73.27% ± 0.0812	50.05% ± 0.0817	72.38% ± 0.0743
Mit²⁹	74.64% ± 0.0304	65.02% ± 0.0302	13.71 ± 1.267	3.09 ± 0.3958	72.72% ± 0.0826	53.57% ± 0.1198	33.33% ± 0.0038	61.50% ± 0.1094
Ours (ResNext)	84.16% ± 0.0183	77.68% ± 0.0210	10.40 ± 0.915	1.87 ± 0.2402	85.07% ± 0.0363	85.58% ± 0.0321	73.31% ± 0.0591	84.62% ± 0.0407

The best results are marked with bold text.

Ablation Experiment

In this section, the effectiveness of the ResNeXt backbone, FCA, and EMFF modules within proposed network are verified. The baseline was constructed as follows: (1) replacing the MTASCNet ResNeXt backbone with the ResNet backbone, (2) removing the FCA from MTASCNet, and (3) removing the EMFF from MTASCNet. Table 2 compares the full model with all ablated variants. Figure 6 intuitively illustrates the segmentation results generated by different components. In terms of segmentation, the baseline achieved Dice, Jaccard, 95HD, and ASD scores of 83.14%, 75.42%, 10.52, and 2.28, respectively. In classification, the baseline achieved ACC, PRE, REC, and F1 score of 85.07%, 85.34%, 74.12%, and 84.97%, respectively. Compared to the baseline, the baseline + ResNeXt model showed an improvement of 1.02% in Dice. By comparing columns 2, 3, and 4 of Figure 6, it can be observed that the addition of ResNeXt helps the network to better focus on the location of the lesions, reducing the interference from complex backgrounds, which allows for the detection of smaller lesions and a reduction in false positives and false negatives. Secondly, the performance of the FCA module was evaluated. The FCA module was further added to the baseline and denoted this model as ResNextUNet + FCA. Compared to the baseline, the ResNextUNet + FCA model improved in Dice and Acc by 1.46% and 0.75%, respectively. Figure 6, particularly columns 2, 3, and 5, indicates that the inclusion of FCA enables the network to better understand the edge integrity of the target area, demonstrating that FCA effectively broadens the range of captured information. Thirdly, the EMFF module was added to the baseline, referred to as ResNextUNet + EMFF, and assessed its performance. The results showed that after adding the EMFF module, Dice and Acc were improved by 2.16% and 1.02%, respectively. The comparison between columns 2, 3, and 6 of Figure 6 indicates that the addition of the EMFF module allows the network to better preserve the structural integrity of the effusion area. Due to the complex mixture and similar contrast of foreground and background information, basic skip connections struggle to accurately extract useful information. To overcome this limitation, the EMFF module enhances the skip connections by integrating a multi-scale enhanced channel attention module, thereby capturing more accurate and detailed edge and structural information in US images. Ablation analysis confirms that each module of MTASCNet contributes to improving the segmentation and classification performance of US images.

Table 2.

The Ablation Study of Each Component of the MTASCNet.

Model	Segmentation (Mean ± Std)				Classification (Mean ± Std)
Model	Dice	Jaccard	95HD	ASD	ACC	PRE	REC	F1
ResUNet	83.14% ± 0.0206	75.42% ± 0.0206	10.52 ± 0.7946	2.28 ± 0.3418	85.07% ± 0.0293	85.34% ± 0.0299	74.12% ± 0.0272	84.97% ± 0.0313
ResNextUNet	84.16% ± 0.0183	77.68% ± 0.0210	10.40 ± 0.915	1.87 ± 0.2402	85.07% ± 0.0363	85.58% ± 0.0321	73.31% ± 0.0591	84.62% ± 0.0407
ResNextUNet + FCA	84.60% ± 0.0262	78.04% ± 0.0265	10.30 ± 1.093	2.40 ± 0.4692	85.82% ± 0.0533	85.87% ± 0.0551	77.07% ± 0.0851	85.36% ± 0.0592
ResNextUNet + MFBP	85.20% ± 0.0243	78.30% ± 0.0257	10.38 ± 0.8592	1.95 ± 0.2804	86.09% ± 0.0579	86.30% ± 0.0571	80.06% ± 0.1088	85.64% ± 0.0622
ResNextUNet + FCA+MFBP (MTASCNet)	85.28% ± 0.0254	78.48% ± 0.0266	10.06 ± 0.9276	2.05 ± 0.2548	86.18% ± 0.0397	86.49% ± 0.0429	73.76% ± 0.0726	85.69% ± 0.0465

The best results are marked with bold text.

Figure 6.

Visual results of segmentation with ablation experiments. From left to right are images of Input, Ground truth, UNet, ResNextUNet, ResNextUNet + FCA, ResNextUNet + EMFF and MTASCNet.

Grouped Test Results

Ultrasound images of ascites vary significantly across different grades. Comparative experiments were conducted on Ascites-1, Ascites-2, and Ascites-3 images to assess the network’s robustness in segmenting ascites of varying severities. To mitigate the influence of the notable differences in ultrasound images that display peritoneal effusion across different regions, the dataset was grouped by grade and tests. Three sets of models were trained using 103 Ascites-1, 394 Ascites-2, and 45 Ascites-3 images, respectively, and reported the final average segmentation results using fivefold cross-validation. As shown in Table 3, the results indicated that the average Dice coefficients for ascites grades 1, 2, and 3 were 59.26%, 86.46%, and 84.88%, respectively. The segmentation accuracy for Ascites-1 and Ascites-3 was lower than that for Ascites-2.

Table 3.

Comparative Segmentation Results on Ascites-1, Ascites-2, and Ascites-3 Ultrasound Images.

Groups	Dice	Jaccard	95HD	ASD
Ascites-1	59.26% ± 0.0518	49.22% ± 0.0491	30.34 ± 8.162	10.89 ± 5.814
Ascites-2	86.46% ± 0.0341	79.38% ± 0.0405	9.92 ± 2.642	1.69 ± 0.7989
Ascites-3	84.88% ± 0.0222	75.08% ± 0.0294	11.75 ± 4.372	1.71 ± 1.075
All	85.28% ± 0.0254	78.48% ± 0.0266	10.06 ± 0.927	2.05 ± 0.2548

Comparison With State-of-the-Art (SOTA)

Table 4 compares the proposed MTASCNet with 16 advanced methods, including six of the latest single-task classification methods (Vgg16,²⁴ ResNet18,²³ ResNet50,²³ DenseNet,²⁷ EfficientNetB4,²⁶ ViT,³⁰ and Swin Transformer³¹), six of the latest single-task segmentation methods (Unet,³² UNet ++,³³ FPN,³⁴ MANet,²⁰ MDANet,³⁵ and MFMSNet³⁶), and three state-of-the-art multi-task learning methods (ASCNet,³⁷ LogoNet,³⁸ and Aumente-Maestro et al.³⁷). All models were trained and tested using the same data partitioning and the results of fivefold cross-validation are reported. To establish a rigorous baseline, we implemented a structured radiomics-based pipeline. This approach involved extracting a comprehensive set of 107 hand-crafted features—including 2D shape, first-order statistics, and high-order texture matrices (GLCM, GLRLM, GLSZM, GLDM, and NGTDM)—via the PyRadiomics library. To handle the high-dimensionality of the feature space, we used a Lasso-based feature selection (using an L1 Logistic Regression) to identify the most discriminative radiomic signatures. Classification was performed using a Random Forest model. As shown in Table 4, the Radiomics-based pipeline achieved a high mean accuracy of 83.57%, outperforming several deep learning architectures such as VGG 77.49%, ResNet50 76.76%, and even Transformer-based models like ViT 73.06%. Among the baseline models, only EfficientNet-b4 84.14% ± 0.0240 slightly surpassed the Radiomics approach. Our proposed model achieved the best overall performance with an accuracy of 86.18%, representing a 2.61% improvement over the Radiomics baseline.

Table 4.

Quantitative Performance Against State-of-the-Art Methods.

Task	Model	Segmentation (Mean ± Std)				Classification (Mean ± Std)
Task	Model	Dice	Jaccard	95HD	ASD	ACC	PRE	REC	F1
Classfication	Radiomics					83.57% ± 0.0243	80.41% ± 0.0687	60.12% ± 0.0409	63.00% ± 0.0399
	Vgg²⁴	-	-	-	-	77.49% ± 0.0098	75.50% ± 0.0761	51.01% ± 0.0267	54.65% ± 0.0301
	ResNet18²³	-	-	-	-	81.00% ± 0.0109	72.66% ± 0.1035	61.39% ± 0.0663	64.83% ± 0.0782
	ResNet50²³	-	-	-	-	76.76% ± 0.0098	74.50% ± 0.1741	43.92% ± 0.0297	45.91% ± 0.0480
	DenseNet²⁷	-	-	-	-	83.22% ± 0.0140	82.35% ± 0.0503	66.78% ± 0.0339	71.14% ± 0.0219
	EffNetb4²⁶	-	-	-	-	84.14% ± 0.0240	81.23% ± 0.0560	69.70% ± 0.0530	73.25% ± 0.0518
	Vit³⁰	-	-	-	-	73.06% ± 0.0043	34.43% ± 0.1344	34.43% ± 0.0154	30.22% ± 0.0295
	Swin³¹	-	-	-	-	72.69% ± 0.0038	24.23% ± 0.0013	33.33% ± 0.0000	28.06% ± 0.0009
Segmentation	Unet³²	85.10% ± 0.02219	77.90% ± 0.02358	9.27 ± 0.5223	1.72 ± 0.2422	-	-	-	-
	Unet ++³³	81.84% ± 0.01957	73.56% ± 0.01976	12.20 ± 0.8238	2.54 ± 0.1405	-	-	-	-
	FPN³⁴	83.16% ± 0.02157	75.46% ± 0.02328	5.93 ± 0.3461	1.21 ± 0.165	-	-	-	-
	MANet²⁰	85.36% ± 0.02566	78.54% ± 0.03064	8.79 ± 0.7762	1.77 ± 0.3333	-	-	-	-
	MDA-Net³⁵	83.58% ± 0.02758	76.48% ± 0.02936	6.17 ± 0.6236	1.11 ± 0.1504	-	-	-	-
	MFMSNet³⁶	78.46% ± 0.01686	68.92% ± 0.018	13.16 ± 0.6978	3.16 ± 0.3352	-	-	-	-
Multi-task	ACSNet³⁷	84.58% ± 0.019	77.28% ± 0.021	10.76 ± 1.026	2.28 ± 0.577	84.15% ± 0.05357	85.28% ± 0.04831	74.95% ± 0.0678	83.98% ± 0.05089
	LogoNet³⁸	79.10% ± 0.02456	72.26% ± 0.021	3.63 ± 0.1773	0.59 ± 0.06405	83.41% ± 0.03878	84.37% ± 0.03288	69.50% ± 0.03103	83.02% ± 0.04372
	Aumente-Maestro et al.³⁷	56.52% ± 0.02283	44.40% ± 0.02033	17.35 ± 1.487	5.67 ± 0.712	77.88% ± 0.05649	78.37% ± 0.03267	48.71% ± 0.04788	73.65% ± 0.06515
	MTASCNet	85.28% ± 0.02546	78.48% ± 0.02669	10.06 ± 0.9276	2.05 ± 0.2548	86.18% ± 0.03977	86.49% ± 0.04298	73.76% ± 0.07262	85.69% ± 0.04658

Best results are highlighted in bold.

Discussion

In this study, a multi-task deep network was developed for the analysis of ascites in ultrasound images. The proposed model achieved satisfied segmentation and classification performance compared with the state-of-the-art networks. The segmentation Dice of the proposed network for ultrasound ascites reaches 85.28%, and the classification accuracy reaches 86.18%. The results indicate that the proposed network may be useful in detecting and grading ascites in ultrasound imaging.

Ascites segmentation and classification in ultrasound imaging confront substantial challenges, including variable lesion size, indistinct and irregular boundaries, and poor image signal-to-noise ratios.³⁹ A multi-task learning model was designed specifically for the segmentation and classification of ascites in abdominal ultrasound images. The effectiveness of the components and parameters of MTASCNet was evaluated through ablation studies. The feature extraction capability of ResNext was explored. Based on the strong feature extraction ability of ResNext, MTASCNet gradually extracts basic patterns from the shallow layers and then identifies complex high-level semantic features. It can perform accurate feature analysis of the ascites region and effectively ignore background information from normal tissue areas, using contextual information to improve accuracy. Subsequently, the extracted deep features are optimized through the FCA module. FCA broadens the range of captured information and enhances the robustness of channel representation by extending traditional attention’s GAP to include multiple frequency components with DCT, thereby capturing spatial dependencies between arbitrary channel feature maps and enhancing the model’s ability to accurately recognize and utilize contextual information. At the same time, EMFF integrates contextual information from multiple resolutions to adapt to the morphological changes of ascites, further improving the model’s accuracy in the segmentation of ultrasound ascites. Ablation experiments proved the effectiveness of the components and parameters of MTASCNet. Interestingly, although incorporating additional modules typically elevates overfitting risk in small datasets, the integration of FCA and EMFF enhanced generalization performance. This outcome arises because these attention mechanisms function as structural regularizers rather than simple capacity augmentations. Through adaptive suppression of irrelevant background speckle noise and enforced focus on salient frequency components and morphological boundaries, they effectively constrain the hypothesis space. This prevents the network from memorizing noisy artifacts, thereby mitigating overfitting despite increased model complexity. The performance of MTASCNet was further explored in diagnosing Ascites-1, Ascites-2, and Ascites-3 ultrasound images. Test results showed that MTASCNet achieved a Dice of 59.26% in segmenting Ascites-1 type US images, 86.46% in segmenting Ascites-2 type US images, and 84.88% in segmenting Ascites-3 type US images. After visualizing the segmentation results, we found that the segmentation of images with large ascites regions was more accurate, whereas for small ascites regions, the image segmentation accuracy was lower and more prone to recognition errors. Ascites-1 images contain comparatively small lesions, whereas Ascites-2 images exhibit larger fluid collections, which may explain why MTASCNet’s segmentation of Ascites-2 is superior to Ascites-1. As for Ascites-3, due to the limited training data, the model’s fitting ability is relatively poor.

In classification tasks, the MTASCNet can accurately segment ascites regions and provide classification of ascites. Deep Learning has previously been applied to the automatic classification of ascites to assess the prognosis of advanced schistosomiasis.⁴⁰ Studies has proposed deep learning models for the classification of ascites using the CT images.^9,15 However, CT imaging is a time-consuming and radiation-emitting examination method, and not usually the clinical first choice for ascites examination. Lin et al.⁶ used Unet to segment ascites regions in ultrasound images, followed by binary classification necessitating further manual diagnosis. To the best of our knowledge, this study represents the first application of deep learning for the simultaneous segmentation and classification of ascites using ultrasound images. The comparison analysis showed that the CNN-based single-task classification networks had high accuracy, while transformer-based networks suffered from severe collapse. The Radiomics baseline achieved a remarkable 83.57% accuracy, outperforming several deep models (e.g., VGG, ViT). This success stems from the task’s reliance on geometric dimensions (area, diameter), which Radiomics calculates directly and precisely. While complex architectures like Transformers struggle with ultrasound speckle noise, the Radiomics pipeline—refined by Lasso selection—remains robust. Our proposed model achieved the peak accuracy of 86.18% by combining this geometric sensitivity with deep learning’s superior noise suppression. It effectively transcends manual feature engineering by capturing complex non-linear information while maintaining clinical interpretability. Since single-task classification networks take global abdominal images as input, the CNN’s ability to convolve local features may better distinguish ascites regions from normal tissues.⁴¹ Moreover, multi-task networks, by sharing features from segmentation networks, can better focus on ascites regions, thus producing accurate classification results.^21,37 Similarly, in segmentation tasks, proposed network outperformed the segmentation results of single-task and other multi-task segmentation networks. This is due to multi-task network’s incorporation of attention mechanisms, which help the model more accurately detect salient features, especially in detecting small ascites regions in ultrasound images, enhancing the sensitivity to small target segmentation.

In most urgent and emergency situations, clinicians are expected to make diagnosis in real-time during ultrasound scanning, which is usually time-consuming and labor-intensive. There is a potential for missed diagnoses in cases with minimal effusion.⁴² Artificial intelligence algorithms can help maintain real-time ultrasound diagnostic support with high diagnostic accuracy. Currently, we are incorporating the MTASCNet algorithm into portable ultrasound for the real-time diagnosis of ascites.

This study may have some limitations. Firstly, there may be a classification bias due to the dataset imbalance between ascites samples of grades 1 to 3. In the future, more data of ascites grades 1 and 3 will be included in the training to mitigate the impact of sample imbalance. Secondly, a reliable clinical evaluation requires a larger number of ascites samples to prevent overfitting that leads to an overestimation of performance. In future studies, we will collect ultrasound images from different centers for external validation to demonstrate the generalizability of MTASCNet.

Conclusion

In this work introduces the MTASCNet, a multi-task framework for simultaneous ascites segmentation and grading in abdominal ultrasound imaging. To suppress noise from redundant or ambiguous features, the FCA and EMFF modules were applied to propagate salient contextual information from the encoder to the decoder through attention mechanisms and multi-scale feature enhancement mechanisms. The effectiveness of MTASCNet was evaluated using a collected dataset of abdominal ultrasound.

Experimental results demonstrate that the proposed MTASCNet outperforms mainstream multi-task learning methods, yielding a segmentation Dice of 85.28% and classification accuracy of 86.18%, effectively overcoming traditional challenges of speckle noise and poor boundary contrast in ultrasound imaging. The results suggest that the proposed framework may serve as a preliminary computer-aided tool for ascites assessment. Future work will explore prospective clinical validation to assess real-world diagnostic utility.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the Second Affiliated Hospital of Naval Medical University. The computations in this research were performed using the CFFF platform of Fudan University.

ORCID iDs

Feng Xie

Chengcheng Liu

Dean Ta

Data Availability Statement

The datasets collected and/or analyzed during this study are available from the corresponding authors upon reasonable request.*

References

Glen

Constanti

Brohi

Assessment and initial management of major trauma: summary of NICE guidance. BMJ. 2016;353:i3051.

Armstrong

Mooney

Paltiel

Barnewolt

Dionigi

Arbuthnot

, et al. Contrast enhanced ultrasound for the evaluation of blunt pediatric abdominal trauma. J Pediatr Surg. 2018;53(3):548-52.

Desai

Harris

Extended focused assessment with sonography in trauma. BJA Educ. 2018;18(2):57-62.

Tang

Luo

Meng

Zhu

, et al. Contrast-enhanced ultrasound imaging of active bleeding associated with hepatic and splenic trauma. Radiol Med. 2011;116(7):1076-82.

Tonon

Piano

Gambino

Romano

Pilutti

Incicco

, et al. Outcomes and mortality of Grade 1 ascites and recurrent ascites in patients with cirrhosis. Clin Gastroenterol Hepatol. 2021;19(2):358-366.e8.

Lin

Cao

Lin

Liang

, et al. Deep learning for emergency ascites diagnosis using ultrasonography images. J Appl Clin Med Phys. 2022;23(7):e13695.

Guo

Tao

Feng

MUCM-FLLs: multimodal ultrasound-based classification model for focal liver lesions. Biomed Signal Process Control. 2025;107:107864.

Taheri

Rahbar

Improving breast cancer classification in fine-grain ultrasound images through feature discrimination and a transfer learning approach. Biomed Signal Process Control. 2025;106:107690.

Huh

Kim

Chung

Kim

, et al. A deep residual U-net algorithm for automatic detection and quantification of ascites on Abdominopelvic computed tomography images acquired in the emergency department: model development and validation. J Med Internet Res. 2022;24(1):e34415.

10.

Rodriguez-Takeuchi

Sousa-Plata

Man

Vidarson

Rayner

Mohanta

, et al. Characterization and quantification of fluid in the abdomen by ultrasound and magnetic resonance imaging in children with clinical suspicion of appendicitis. Abdom Radiol. 2024;49(4):1031-41.

11.

Liu

Zhang

Zeng

Fan

Feng

, et al. FNBUI-NET: a multi-task model for fetal nasal bone ultrasound image defect detection and classification. Biomed Signal Process Control. 2025;104:107586.

12.

Rudralingam

Footitt

Layton

Ascites matters. Ultrasound. 2017;25(2):69-79.

13.

Biggins

Angeli

Garcia-Tsao

Ginès

Ling

Nadim

, et al. Diagnosis, evaluation, and management of ascites, spontaneous bacterial peritonitis and hepatorenal syndrome: 2021 practice guidance by the American Association for the study of liver diseases. Hepatology. 2021;74(2):1014-48.

14.

Aithal

Palaniyappan

China

Härmälä

Macken

Ryan

, et al. Guidelines on the management of ascites in cirrhosis. Gut. 2021;70(1):9-29.

15.

Winkel

Heye

Weikert

Boll

Stieltjes

Evaluation of an AI-based detection software for acute findings in abdominal computed tomography scans: toward an automated work list prioritization of routine CT examinations. Investig Radiol. 2019;54(1):55-9.

16.

Xie

Girshick

Dollár

Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:1492-500.

17.

Qin

Zhang

FcaNet: frequency channel attention networks. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. 2020:783-92.

18.

Shen

Sun

Albanie

Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. 2018. Salt Lake City, Utah.

19.

Wang

Zhu

, eds. ECA-Net: efficient channel attention for deep convolutional neural networks. In: Conference on Computer Vision and Pattern Recognition (CVPR2020) June 16 - 18, The Washington State Convention Center.

20.

Fan

Wang

MA-net: a multi-scale attention network for liver and tumor segmentation. IEEE Access. 2020;8:179656-65.

21.

Yang

Wang

Multi-task learning for segmentation and classification of breast tumors from ultrasound images. Comput Biol Med. 2024;173:108319.

22.

Huang

Cheng

Wang

Medical image segmentation based on dynamic positioning and region-aware attention. Pattern Recognit. 2024;151:151.

23.

Zhang

Ren

Sun

Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) June 27-30, 2016, Las Vegas, NV, USA. 2016:770-8.

24.

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition. Computer Science. 2014. arXiv:1409.1556.

25.

Chollet

Xception eds. Deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26 2017, Honolulu, HI, USA.

26.

Tan

QV.

EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, June 9-15, 2019, Long Beach Convention & Entertainment Center in Long Beach, California, ICML. 2019:6105-14.

27.

Huang

Liu

Laurens

VDM

Weinberger

KQ.

Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, July 21-26 2017, Honolulu, HI, USA. IEEE Computer Society. 2016:4700-4708.

28.

Howard

Zhu

Chen

Kalenichenko

Wang

Weyand

, et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:170404861 2017.

29.

Xie

Wang

Anandkumar

Alvarez

Luo

SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in neural information processing systems. 2021:12077-90.

30.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Houlsby

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv201011929. 2020.

31.

Liu

Lin

Cao

Wei

Zhang

, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), October 11-17, 2021, Montreal, BC, Canada. 2021:10012-22.

32.

Ronneberger

Fischer

Brox

eds. U-net: convolutional networks for biomedical image segmentation. In: 18th International Conference, October 5-9, 2015, Munich, Germany; 2015.

33.

Zhou

Siddiquee

MMR

Tajbakhsh

Liang

, eds. UNet++: a nested U-net architecture for medical image segmentation. In: 4th Deep Learning in Medical Image Analysis (DLMIA) Workshop, September 20, 2018, Granada, Spain; 2018.

34.

Lin

Dollar

Girshick

Hariharan

Belongie

Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 21-26, 2017, Honolulu, Hawaii, Hawaii Convention Center. 2017.

35.

Iqbal

Sharif

MDA-net: multiscale dual attention-based network for breast lesion segmentation using ultrasound images. J King Saud Univ - Comput Inf Sci. 2022;34(9):7283-99.

36.

Yao

MFMSNet: a multi-frequency and multi-scale interactive CNN-transformer hybrid network for breast ultrasound image segmentation. Comput Biol Med. 2024;177:108616.

37.

Aumente-Maestro

Díez

Remeseiro

A multi-task framework for breast cancer segmentation and classification in ultrasound imaging. Comput Methods Programs Biomed. 2025;260:108540.

38.

Zhao

Chen

Yang

Luo

YJ.

A local and global feature disentangled network: toward classification of benign-malignant thyroid nodules from ultrasound image. IEEE Trans Med Imaging. 2022;41(6):1497-509.

39.

Song

KD.

Current status of deep learning applications in abdominal ultrasonography. Ultrasonography. 2021;40(2):177-82.

40.

Jiang

Deng

Zhou

Ren

Cai

, et al. Machine learning algorithms to predict the 1 year unfavourable prognosis for advanced schistosomiasis. Int J Parasitol. 2021;51(11):959-65.

41.

Azad

Kazerouni

Heidari

Aghdam

Molaei

Jia

, et al. Advances in medical image analysis with vision transformers: a comprehensive review. Med Image Anal. 2024;91:103000.

42.

Moore

Wong

Gines

Bernardi

Ochs

Salerno

, et al. The management of ascites in cirrhosis: report on the consensus conference of the International Ascites Club. Hepatology. 2003;38(1):258-66.