Dual-Branch Person Re-Identification Method Based on Feature Consistency

Abstract

Person re-identification (Re-ID) is vital for intelligent surveillance. Although many existing methods incorporate multi-scale modules to enhance feature discriminability, they often overlook inter-group feature consistency under cross-camera scale and view variations, which can lead to embedding drift and unstable retrieval. To address this issue, we propose a dual-branch Re-ID framework based on feature consistency. First, we introduce a vertical feature-map segmentation strategy that decouples high-level global features into complementary upper- and lower-region representations in a single forward pass. These regional features are then processed by independent bottlenecks and classifiers, improving local semantic discriminability while maintaining global contextual cues. Second, we propose a Geometric-Distribution Alignment Loss (GDALoss) to explicitly enhance robustness to scale and horizontal-flip variations by minimizing the geometric and distributional discrepancies between differently transformed samples of the same identity in the embedding space. Extensive experiments on three benchmarks demonstrate consistent improvements over the baseline. On Market-1501, our method increases mAP by 1.4% and Rank-1 by 1.5%. On DukeMTMC-ReID, it improves mAP by 1.1% and Rank-1 by 2.9%. On MSMT17, it raises mAP by 3.1% and Rank-1 by 5.3%, validating the effectiveness and robustness of the proposed approach.

Keywords

Person re-identification local information multi-scale dual-branch feature consistency

1 Introduction

Person re-identification (Re-ID) aims to match pedestrian images across non-overlapping camera views. Given a query image, the system is required to retrieve all images of the same individual from a gallery collected by different cameras (Geng et al., 2025; Li et al., 2024a; Wang et al., 2024). Re-ID plays an important role in security surveillance, video tracking, and criminal investigation. In practical deployments, however, relying on facial attributes or motion trajectories is often costly or unreliable; therefore, most Re-ID systems primarily depend on visual appearance features. Despite significant progress, resolution and scale discrepancies (Zhang et al., 2024), viewpoint variations (Wu et al., 2024), and background clutter (Gu et al., 2022) remain fundamental challenges. Moreover, imperfect pedestrian detection (e.g., misaligned bounding boxes and partial occlusions) further amplifies feature inconsistency across cameras. Under such complex and variable conditions, learning robust and stable representations is crucial for reliable cross-camera retrieval.

Before deep learning, Re-ID mainly relied on hand-crafted features and metric learning (Khan et al., 2024). Deep networks have since enabled more discriminative representations, yet early CNN-based methods largely emphasized global features and often overlooked fine-grained cues. To compensate for this limitation, recent studies (Liu et al., 2023; Wang et al., 2022; Yu et al., 2025) exploit local representations by partitioning the human body, frequently with pose estimation or parsing guidance. While effective, these approaches typically introduce additional estimators or auxiliary branches, increasing model complexity and reducing deployment efficiency.

In parallel, loss function design for Re-ID has evolved from ID and verification objectives (Ye et al., 2021) to metric learning losses such as triplet loss (Schroff et al., 2015) and center loss (Wen et al., 2016), which encourage intra-class compactness and inter-class separation. However, an important research problem remains insufficiently addressed: how to explicitly enforce feature consistency for semantically corresponding samples under scale and view transformations and cross-camera variations. Existing metric losses exhibit notable limitations in this regard. Triplet loss depends on hard-sample mining and mainly enforces relative ordering, without providing an explicit constraint that stabilizes paired representations across views and scales. Center loss encourages samples to cluster around a class center but does not model cross-view geometric relations. Consequently, maintaining consistent feature geometry and distribution under scale and view changes remains difficult, which can degrade the overall ranking quality in challenging multi-camera settings.

Based on the above observations, we identify the following gaps in current Re-ID literature: (G1) Many part and pose guided methods rely on extra estimators or auxiliary branches, which increases computational cost and complicates deployment. (G2) Mainstream metric objectives focus on separation in the embedding space but do not explicitly enforce cross-source (cross-view and scale) feature consistency for corresponding samples. (G3) Although multi-scale augmentation is widely used, most methods lack a unified constraint that aligns both the geometric structure and the distribution of features under scale and flip transformations, leaving scale view-induced distribution shift under-constrained.

To bridge these gaps, we propose a dual-branch Re-ID framework based on feature consistency. We adopt a shared backbone to extract high-level feature maps and perform a lightweight feature-map-level vertical decoupling to obtain spatially complementary upper and lower sub-region features, without requiring pose and parsing annotations (addressing G1). Each branch produces discriminative representations with GAP and BN, and is supervised by dual classifiers. Furthermore, we design a Geometric-Distribution Alignment Loss (GDALoss) to explicitly reduce scale and view-induced distribution shift by jointly constraining geometric distances and distribution differences between multi-scale transformed samples (e.g., resizing and horizontal flipping), thereby enforcing cross-source consistency (addressing G2–G3).

In summary, our main contributions are threefold:

(1)
We propose a lightweight dual-branch architecture that decouples upper and lower region features at the feature-map level to enrich local details without auxiliary estimators.
(2)
We propose GDALoss, which enforces geometric and distribution consistency across multi-scale transformations to improve robustness to scale and view variations.
(3)
Extensive experiments on Market-1501, DukeMTMC-ReID, and MSMT17 validate the effectiveness of the proposed method.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the proposed method in detail. Section 4 presents experimental settings and results, followed by conclusions in Section 5.
2. Related Work

2.1 Brief Overview of Person Re-Identification

In recent years, part-local-feature learning has become a mainstream direction for improving person re-identification (ReID), especially under occlusion, pose changes, and imperfect detections. A representative line of work introduces auxiliary cues such as pose estimation or human parsing to localize visible body regions and learn part-aware representations. For example, PPBI (Cui et al., 2025) leverages pose-guided partial attention together with batch-level information to enhance occlusion-robust matching, while DROP (Dou et al., 2024) decouples ReID and human parsing into task-specific features to reduce mutual interference and strengthen part-level modeling. Transformer-based designs have also been explored: the Part-Aware Transformer (Ni et al., 2023) injects part awareness into token-based representations to improve robustness and generalization, and PAFormer (Jung et al., 2024) further models body-part correlations to enhance part-to-part matching. Moreover, MPCC-Net (Zhou et al., 2025) enforces perception-consistency constraints across multi-state inputs to improve feature stability.

These approaches demonstrate that introducing part-level cues and consistency regularization can effectively improve recognition in challenging conditions. However, many of them still depend on additional estimators or complex attention branches, which increases computational overhead and may propagate localization errors. More importantly, existing part-based pipelines often focus on improving local discriminability, but they seldom provide a lightweight mechanism that simultaneously preserves global context and explicitly enforces stable correspondence between complementary regions under scale and viewpoint variations. This gap motivates our Dual-Branch Feature Separation (DBFS), which performs feature-map-level uniform decoupling without extra pose or parsing networks and further couples local discrimination with cross-source consistency constraints.

2.2. Multi-Scale Feature Extraction Methods

Scale and viewpoint variations are another major factor that causes feature drift and degrades ReID generalization. To mitigate these effects, a number of studies exploit multi-scale representations or scale-aware alignment. Lai et al. (Lai, 2025) explore multi-scale feature attention and strategy balancing in cross-modal ReID. For unsupervised ReID, TCMM (Zhu et al., 2025) introduces token constraints to suppress patch noise in ViT features and adopts a multi-scale memory bank for contrastive learning to improve feature consistency. In the visible-infrared setting, CM-DASN (Li et al., 2025) uses dynamic attention selection to learn more effective cross-modality representations. Other works also incorporate multi-scale cues via depth features (Zhou et al., 2024) or multi-scale and multi-granularity representation learning for person search (Han & Ma, 2024).

Although these methods improve robustness by fusing or selecting information across scales, they mainly enhance discriminability and do not explicitly require that same-identity embeddings under deterministic transformations (e.g., resizing and horizontal flipping) preserve a stable geometric structure and consistent neighborhood distribution in the embedding space. As a result, the model may still suffer from distribution shift when scale changes are large or when the training data is limited. To fill this gap, our Geometric–Distribution Alignment Loss (GDALoss) directly aligns cross-scale and cross-view features by jointly enforcing geometry-aware distance regularization and distribution-consistency constraints with minimal overhead.

3. Methodology

In person Re-ID, cross-camera matching is the most common and challenging setting in real deployments. Due to differences in camera hardware, installation geometry, and illumination conditions, the same identity is often observed at noticeably different apparent scales across non-overlapping cameras. Such cross-camera scale inconsistency can induce embedding drift, where semantically identical samples are mapped to inconsistent locations in the feature space, thereby weakening feature stability and degrading retrieval performance. While conventional objectives mainly emphasize intra-inter-class separation, they do not explicitly enforce consistency across deterministic scale and view transformations, which makes it difficult to learn discriminative yet scale-robust representations. To tackle this issue, we propose a dual-branch person re-identification network based on feature consistency, termed Feature Consistency Network (FCNet). The overall architecture is shown in Figure 1.

Figure 1.

Network architecture diagram.

As shown in Figure 1, we designed a dual-branch feature decoupling structure, which uniformly segments the pedestrian feature map into upper and lower local region branches along the height direction. While preserving the global receptive field of deep features from the backbone network, this structure extracts discriminative features of the upper body and lower body respectively—thus ensuring no loss of overall spatial correlation while significantly enhancing local feature learning capability. Secondly, to improve the model's robustness to scale and pose variations, we construct multi-scale and multi-view contrastive samples by performing down sampling and horizontal flipping on the original samples $I_{o r i g} \in R^{384 \times 128 \times 3}$ , guiding the model to learn scale-invariant and pose-invariant features. Their calculation formulas are shown in Equations (1) and (2). On this basis, we propose GDALoss. This loss enforces the consistency of samples across different scales in the feature space, ensuring that samples of the same identity (ID) with varying scales maintain consistency in the feature space—thereby effectively alleviating the problem of feature space misalignment in cross-camera scenarios.

\begin{aligned} I_{s c a l e} = Γ_{s c a l e} (I_{o r i g}; θ) \end{aligned}

(1)

where,

I_{o r i g} \in R^{384 \times 128 \times 3}

denotes the original input image,

Γ_{s c a l e} (\cdot; θ)

is the down sampling operator, and

I_{s c a l e} \in R^{336 \times 112 \times 3}

denotes the resulting multi-scale resized image.

θ = [\frac{s_{h}}{0} \frac{0}{s_{w}}]

s_{h} = \frac{336}{384}

s_{w} = \frac{112}{128}

\begin{aligned} I_{f l i p} = Γ_{f l i p} (\cdot) \end{aligned}

(2)

where,

Γ_{f l i p} (\cdot)

denotes horizontally flipping the scale-transformed image to simulate left–right parallax variations, and to some extent, to cover viewpoint discrepancies introduced by cross-camera settings.

Γ_{f l i p} (I) [x, y, :] = I [x, W - y - 1,;]

We present the detailed workflow of FCNet, as shown in Algorithm 1:

Algorithm 1. Training with Multi-Source Cross Fusion and GDALoss.
1: Input: mini-batch images I, identity labels $y$
2: Hyper-params: resize size $(336, 112)$ , temperature $t, α = 0.2, β$
3: Model: $F (\cdot)$ outputs two-branch logits and features $(s^{u}, s^{l}, F_{1}, F_{2}) = F (I)$
4: $\tilde{I} \leftarrow Flip (Resize (I, 336 \times 112)) {augmentation}$
5: $(s^{u}, s^{l}, F_{1}, F_{2}) \leftarrow F (I) {original}$
6: $({\tilde{s}}^{u}, {\tilde{s}}^{l}, {\tilde{F}}_{1}, {\tilde{F}}_{2}) \leftarrow F (\tilde{I}) {augmented}$
7: $\tilde{y} \leftarrow Concat (y, y) {labels for original} + {augmented}$
8: $L_{main} \leftarrow \frac{1}{2} [L_{reid} (Concat (s^{u}, {\tilde{s}}^{u}), Concat (F_{1}, {\tilde{F}}_{1}), \tilde{y})$
9: $+ L_{reid} (Concat (s^{l}, {\tilde{s}}^{l}), Concat (F_{2}, {\tilde{F}}_{2}), \tilde{y})]$
10: $Z_{c 1} \leftarrow$ ConcatChannel $(F_{1}, {\tilde{F}}_{2})$ ${cross - fusion features}$
11: $Z_{c 2} \leftarrow$ ConcatChannel $(F_{2}, {\tilde{F}}_{1})$
12: $F_{c f} \leftarrow ConcatBatch (Z_{c 1}, Z_{c 2})$
13: $L_{GAL}, L_{DAL} \leftarrow$ GDALoss_Core $(F_{c f}, t)$ ${geometry + KL terms}$
14: $L_{GDA} \leftarrow L_{GAL} + α * L_{DAL}$
15: $L_{total} \leftarrow L_{main} + β * L_{GDA}$
16: $θ \leftarrow θ - η * \nabla_{θ} L_{total} {backprop & update}$
17: Output: trained parameters $θ$
18: Subroutine GDALoss_Core $(F_{c f}, t)$ :
19: $Z \leftarrow Normalize (W \cdot F_{c f}) {projection + ℓ_{2} norm}$
20: $L_{GAL} \leftarrow mea n_{pairs (i, j)} (Dist (Z_{i}, Z_{j}) - 1)^{2} {target distance = 1}$
21: $S \leftarrow (Z_{i} \cdot Z_{j}^{⊤}) / t$ {similarity matrix}
22: $L_{DAL} \leftarrow$ SymmetricKL $(softmax (S), softmax (S^{⊤}))$ ${bidirectional KL}$
23: return $L_{GAL}, L_{DAL}$

3.1. Dual-Branch Feature Separation

Existing person re-identification methods typically represent the target using global features. However, global aggregation often weakens fine-grained local cues such as clothing textures, backpack boundaries, and shoe shapes. In contrast, features based on local receptive fields can emphasize local details, but they are susceptible to bounding-box misalignment, pose variations, and occlusions, which leads to unstable spatial alignment of local regions across different samples and thus limits the reliability of cross-camera matching. To enhance local discriminability and improve regional alignment stability without introducing additional pose estimation or semantic segmentation networks, we uniformly split high-level feature maps along the height dimension during the feature decoupling stage, thereby obtaining structurally stable and semantically complementary local representations.

Specifically, given input images $I_{o r i g} \in R^{384 \times 128 \times 3}$ and $I_{s c a l e} \in R^{336 \times 112 \times 3}$ , we extract the corresponding high-level feature maps $F \in R^{B \times C \times H \times W}$ and $F_{s c a l e} \in R^{B \times C \times H \times W}$ using the ResNet50 backbone. Here, B denotes the batch size, C denotes the channel dimension, and H and W denote the height and width of the feature maps, respectively. Taking the original-image branch as an example, we uniformly partition the high-level feature map into upper and lower regions along the height dimension, as formulated in Equations (3) and (4).

\begin{aligned} F_{1} = F [:, :, : H / 2, :] \end{aligned}

(3)

\begin{aligned} F_{2} = F [:, :, : H / 2, :] \end{aligned}

(4)

In this way, $F_{1} \in R^{B \times C \times \frac{H}{2} \times W}$ and $F_{2} \in R^{B \times C \times \frac{H}{2} \times W}$ correspond to the local semantic representations of the pedestrian's upper-body and lower-body regions, respectively: the upper-body region mainly covers the head–shoulder area, clothing textures, and backpacks, while the lower-body region mainly covers trousers, legs, and shoes. Since pedestrian bounding boxes in mainstream ReID benchmarks exhibit relatively good statistical alignment in the vertical direction, the upper and lower body structure is comparatively stable. Meanwhile, the two regions provide complementary identity-discriminative cues. Based on these properties, a 1:1 uniform split can preserve global spatial correlations while providing each partition with a relatively complete and stable local semantic representation, thereby alleviating local misalignment issues and improving the utilization of fine-grained cues.

Subsequently, the two local regions are processed by independent global average pooling (GAP), batch normalization (BN), and classifiers, yielding the GAP features $z_{1} \in R^{B \times C}$ and $z_{2} \in R^{B \times C}$ as well as the corresponding classification score vectors $s c o r e_{s c a l e 1} \in R^{B \times N}$ and $s c o r e_{s c a l e 2} \in R^{B \times N}$ . For multi-scale samples, the same processing pipeline as the original images is adopted, resulting in the GAP features $z_{s c a l e 1} \in R^{B \times C}$ and $z_{s c a l e 2} \in R^{B \times C}$ and the score vectors $s c o r e_{s c a l e 1} \in R^{B \times N}$ and $s c o r e_{s c a l e 2} \in R^{B \times N}$ . Here, N denotes the number of classes, i.e., each sample is associated with scores over N classes.

Finally, we concatenate the four groups of decoupled local features along the batch dimension to form the fused local features $f e a t_{11} = [\frac{z_{1}}{z_{s c a l e_{1}}}] \in R^{2 B \times C}$ and $f e a t_{22} = [\frac{z_{2}}{z_{s c a l e_{2}}}] \in R^{2 B \times C}$ . During training, we employ joint supervision with the cross-entropy loss and triplet loss to drive local representation learning: cross-entropy ensures identity discrimination, while the triplet loss further enforces intra-class compactness and inter-class separation. As a result, the robustness and cross-scale alignment capability of local features are enhanced without sacrificing overall retrieval performance.

3.2 Multi-Source Cross Fusion Strategy

Global features are used to represent a pedestrian's appearance holistically, whereas local features focus more on fine-grained cues from different body regions. However, under an independent training scheme, both types of features have limitations: global representations tend to over-rely on a few salient regions and fail to preserve details from other parts; local representations are constrained by limited receptive fields and may still lack sufficient contextual information even after multiple convolutional layers, thereby weakening regional discriminability. To address these issues, we propose a multi-source cross-fusion mechanism built upon the dual-branch feature decoupling architecture. As illustrated in Figure 2, it jointly leverages local supervision and cross-source consistency constraints to enhance the stability and robustness of feature representations under cross-scale and cross-view conditions.

Figure 2.

Fusion of local features.

Local feature extraction and representation. Given the high-level feature map $F \in R^{B \times C \times H \times W}$ and $F_{s c a l e} \in R^{B \times C \times H \times W}$ produced by the backbone network, the dual-branch decoupling module generates two local feature maps $F_{1}, F_{2} \in R^{B \times C \times \frac{H}{2} \times W}$ and $F_{s c a l e 1}, F_{s c a l e 2} \in R^{B \times C \times \frac{H}{2} \times W}$ . For the original sample ( $F_{1}, F_{2}$ ) and its multi-scale augmented sample ( $F_{s c a l e 1}, F_{s c a l e 2}$ ), we apply global average pooling (GAP) to each branch to obtain the local vector representations: $z_{1}, z_{s c a l e_{1}} \in R^{B \times C}$ and $z_{2}, z_{s c a l e_{2}} \in R^{B \times C}$ , where B denotes the batch size and C denotes the channel dimension.

Local feature concatenation for identity-discriminative learning. To ensure stable identity discrimination across different scale sources, we concatenate the two-source local vectors of the same branch along the batch dimension to form the local features used for supervised learning: $f e a t_{11} = C o n c a t (F_{1}, F_{s c a l e 1}) \in R^{2 B \times C}$ and $f e a t_{22} = C o n c a t (F_{2}, F_{s c a l e 2}) \in R^{2 B \times C}$ , As shown in Figure 2, $f e a t_{11}$ and $f e a t_{22}$ are fed into the Triplet Loss and ID Loss to promote intra-class compactness and inter-class separability of the local features, thereby maintaining the discriminability of both branches.

Cross-fusion features for cross-source consistency constraints. Relying solely on local discriminative supervision may be insufficient to ensure that cross-scale consistency is always correct. Therefore, we further construct cross-source cross-fusion features by concatenating the two-source local vectors of the same branch along the channel dimension, yielding: $z_{c 1} = C o n c a t (z_{1}, z_{s c a l e 2}) \in R^{B \times 2 C}$ and $z_{c 2} = C o n c a t (z_{2}, z_{s c a l e 1}) \in R^{B \times 2 C}$ .

As shown in Figure 2, $z_{c 1}$ and $z_{c 2}$ are fed into GDALoss, which constrains the representation discrepancies of the same identity under different scales and viewpoints from both geometric-structure and distribution-consistency perspectives. Since this constraint backpropagates through the shared backbone and branch-specific feature learning, it encourages the network to learn representations that are more robust to scale and viewpoint variations.

Compared with attention-based fusion, we choose channel-wise concatenation mainly for two reasons. First, concatenation is lightweight and introduces almost no additional parameters or hyperparameters, resulting in negligible computational overhead, which aligns with our design goal of controlling model complexity. Second, GDALoss relies on stable geometric relationships in the feature space, whereas attention-based fusion typically produces sample-dependent dynamic reweighting that may distort feature geometry and increase the difficulty of enforcing consistency constraints. In contrast, channel-wise concatenation deterministically preserves complementary cues from the two sources, providing a more stable and effective input for subsequent geometric–distribution alignment. Moreover, cross-source concatenation makes inconsistencies between the two sources more explicitly exposed, which helps improve the model's robustness.

3.3 GDA Loss

Pedestrian images exhibit significant appearance variations across different scales and viewpoints, which can induce distribution shifts in the feature representations of the same identity and ultimately manifest as cross-scale and cross-view feature inconsistency. To alleviate this issue, we propose the Geometric–Distribution Alignment Loss (GDALoss). By jointly enforcing geometric-structure consistency and distribution consistency, GDALoss explicitly reduces the feature discrepancies of the same identity under different scales and viewpoints, thereby encouraging the model to learn identity representations that are more robust and invariant to scale and viewpoint changes.

Input and projection normalization: GDALoss takes the cross-fusion features obtained in Section 3.2 as inputs. For the fused features of the original samples and their multi-scale and multi-view augmented counterparts, we employ a shared-weight linear projection layer $g (\cdot)$ to map them into a low-dimensional discriminative space, followed by $L_{2}$ normalization to constrain the representations onto the unit hypersphere. We denote the normalized feature groups as $g r o u p_{1}$ (original-sample group) and $g r o u p_{2}$ (augmented-sample group). This design suppresses feature magnitude variations introduced by scale changes and meanwhile makes distance metrics such as Euclidean distance and cosine similarity more stable.

Geometric Alignment Loss (GALoss): To maintain a stable geometric structure for the same identity under different scales and viewpoints, we first compute the Euclidean distance between the two feature groups in the projection space. Its calculation is given by Equation (5).

\begin{aligned} d_{e u c l i d e a n} = ∥ g r o u p_{i} - g r o u p_{2} ∥_{2} \end{aligned}

(5)

where,

g r o u p_{i}

denotes the feature group of the original samples,

g r o u p_{2}

denotes the feature group of the multi-scale samples,

∥ \cdot ∥_{2}

is the

L_{2}

normalization, and

d_{e u c l i d e a n}

is the Euclidean distance between the two feature groups, which is used to measure the representation drift of the same identity under different scale and view transformations.

Subsequently, GALoss is constructed by minimizing the mean squared error (MSE) between the distance of each feature-group pair and the target distance $σ$ ,Its calculation is given by Equation (6).

\begin{aligned} L_{G A L} = \frac{1}{k} \sum_{i < j} (d (g r o u p_{1}, g r o u p_{2}) - σ)^{2} \end{aligned}

(6)

where,

k

denotes the number of feature groups involved in the constraint. In this paper, we set

k = 2

(i.e., the original group and the augmented group); therefore,

\sum_{i < j} (d (g r o u p_{1}, g r o u p_{2}) - σ)^{2}

enforces the constraint on the single pair formed by these two groups.

σ

is the target distance. Regarding the choice of

σ = 1

after

L_{2}

normalization, the Euclidean distance between two unit vectors lies in

[0, 2]

. Setting

σ

to 1 imposes a moderate-strength geometric constraint: it effectively pulls together cross-scale and cross-view features of the same identity, while avoiding overly strong constraints and representation collapse when

σ

is too small, or an overly weak constraint that fails to suppress distribution shift when

σ

is too large. Moreover, from

∥ u - v ∥_{2}^{2} = 2 - 2 \cos θ

when

∥ u - v ∥_{2} = 1

, we have

\cos θ = 0.5

(corresponding to an angle of approximately

60 \circ

), which represents a reasonable similarity anchor—encouraging closeness without forcing complete overlap. We further conduct a sensitivity analysis on

σ

in the experimental section to validate the stability of this setting.

Distribution Alignment Loss (DALoss): Constraining distances alone may be insufficient to ensure consistent local neighborhood structures. Therefore, we further minimize the Kullback–Leibler (KL) divergence between the distributions of the two feature groups, encouraging them to become consistent in a probabilistic sense. We first compute the similarity matrix between the two feature groups, Its calculation is given by Equation (7).

\begin{aligned} S = \frac{g r o u p_{1} \cdot g r o u {p_{2}}^{T}}{T} \end{aligned}

(7)

where, T is a temperature coefficient used to control the smoothness and sharpness of the distribution after the softmax operation. S denotes the similarity matrix. We then apply softmax to the similarity matrix to obtain the probability distributions

p_{i}

and

p_{j}

, and construct DALoss using the symmetric KL divergence, Its calculation is given by Equation (8).

\begin{aligned} L_{D A L} = \frac{1}{2 k} \sum_{i < j} [D_{K L} (P_{i} ∥ P_{j}) + D_{K L} (P_{j} ∥ P_{i})] \end{aligned}

(8)

where,

k \;

denotes the number of feature groups involved in the constraint. In this paper, we set

k = 2

(i.e., the original group and the augmented group);

p_{i}

and

p_{j}

denote the probability distributions obtained by applying the softmax operation to the similarity matrix S for the i-th and j-th feature groups. We compute their symmetric KL divergence to ensure that these two feature groups maintain consistency at the probability distribution level. Its calculation is given by Equation (9).

\begin{aligned} D_{K L} (P ∥ Q) = \sum_{x} P (x) l o g \frac{P (x)}{Q (x)} \end{aligned}

(9)

Finally, the calculation of the GDALoss is shown in Equation (10)

\begin{aligned} L_{G D A L o s s} = L_{G A L} + α \cdot L_{D A L} \end{aligned}

(10)

Here, $α$ is a weighting coefficient that balances the contributions of the geometric-consistency constraint and the distribution-consistency constraint. In this paper, we set $α = 0.2$ to strengthen cross-scale feature distribution alignment while maintaining geometric-structure stability.

3.4 Loss Function

In the training phase, the loss consists of three components: Triplet Loss, Cross-Entropy Loss, and Geometry-Distribution Alignment Loss (GDALoss). Among them, the Triplet Loss and Cross-Entropy Loss are computed based on the score vectors of local features—where the local features are derived by concatenating the original image and multi-scale samples along the batch dimension to form the final features. Their calculations are shown in Equations (11) and (12).

\begin{aligned} L_{c c}^{(k)} & = - \sum_{i = 1}^{2 B} \sum_{j = 1}^{C} y_{i j} l o g (s o f t m a x (s c o r e_{k k} [i, j])) \end{aligned}

(11)

\begin{aligned} L_{t r i}^{(k)} & = \frac{1}{2 B} \sum_{i = 1}^{2 B} m a x (0, d_{p}^{(i)} - d_{n}^{(i)} + ε) \end{aligned}

(12)

where,

k \in {1, 2}

denote the losses of the upper and lower branches, respectively; B is the batch size; C denotes the channel dimension;

s o f t m a x

convert the scores into a probability distribution;

y_{i j}

is the ground-truth label of the sample.

d_{p}^{(i)}

represents the feature distance between the anchor sample and the positive sample;

d_{n}^{(i)}

denotes the feature distance between the anchor sample and the negative sample; and

ε

is the margin hyperparameter, Consistent with the baseline model, this parameter is set to 0.3. (Luo et al., 2019)

Subsequently, we introduce the proposed GDALoss to minimize the distribution difference between parallel samples, thereby obtaining the final total loss, and its calculation formula is shown in Equation (13).

\begin{aligned} L_{t o t a l} = L_{c c} + L_{t r i} + β \cdot L_{G D A L o s s} \end{aligned}

(13)

where,

L_{t o t a l}

is the total loss value of the model, and it serves as the optimization target during the model training process.

L_{c c}

refers to the cross-entropy loss, which is used to measure the error between the model's classification prediction results and the ground-truth labels, constraining the model to learn category-discriminative features.

L_{t r i}

denotes the triplet loss, which is applied to constrain the feature distances among anchor samples, positive samples, and negative samples, enhancing the discriminability of features.

L_{G D A L o s s}

stands for the GDALoss, which is designed to maintain the geometric structure and distribution consistency of features for samples of the same identity under different scales and viewpoints, thus alleviating the problem of feature misalignment across scenarios.

β

is a hyperparameter used to measure the importance of GDALoss in the total loss; In this paper,

β

is set to 0.2.

4. Experimental Evaluation

4.1. Datasets and Evaluation Metrics

We evaluate the proposed method on three widely adopted person Re-ID benchmarks: Market-1501, DukeMTMC-ReID, and MSMT17, whose statistics are summarized in Table 1.

Table 1.
Statistics of Used Datasets.

Dataset Total ID Training ID Gallery ID Image Gallery Image Camera

Market-1501 1501 751 750 32,668 19,732 6

DukeMTMC-ReID 1404 702 702 36,411 17,661 8

MSMT17 4101 1041 3060 126,441 82,161 15

Dataset	Total ID	Training ID	Gallery ID	Image	Gallery Image	Camera
Market-1501	1501	751	750	32,668	19,732	6
DukeMTMC-ReID	1404	702	702	36,411	17,661	8
MSMT17	4101	1041	3060	126,441	82,161	15

Market-1501 (Zheng et al., 2015) is collected by six cameras and contains 32,668 annotated bounding boxes of 1,501 identities. Following the standard protocol, 751 identities with 12,936 images are used for training, while the remaining 750 identities are used for testing, including 3,368 query images and 19,732 gallery images.

DukeMTMC-ReID (Zheng et al., 2017) is derived from the DukeMTMC dataset and includes 36,441 images captured by eight cameras. The dataset is split into 702 identities for training (16,522 images) and 702 disjoint identities for testing, with 2,228 query images and 17,661 gallery images.

MSMT17 (Wei et al., 2018) is a large-scale benchmark collected from a 15-camera network, containing 126,441 images of 4,101 identities under diverse indoor/outdoor scenes. The official split is adopted, where the test set includes 3,060 identities and the gallery contains 82,161 images, making MSMT17 notably challenging due to large scale and significant appearance variations.

Evaluation Metrics: To evaluate the performance of our FCNet and compare with other ReID methods, we report two common evaluation metrics: the cumulative matching characteristics (CMC) (Bai et al., 2017) at Rank-1 and mean average precision (mAP) (Gray et al., 2007) on the above three benchmarks following the common settings.

4.2. Experimental Setup

In our experiments, we implement the proposed model using the PyTorch framework. All experiments are conducted on a workstation equipped with an NVIDIA GeForce RTX 4060 Ti GPU and a 12th Gen Intel(R) Core(TM) i5-12600KF CPU. We initialize ResNet-50 with ImageNet-pretrained weights. During training, each input image is resized to $384 \times 112$ and randomly horizontally flipped with a probability of 0.5. To construct multi-scale contrastive samples, we further downsample the resized images to $336 \times 112$ and apply the same horizontal flip. The choice of $336 \times 112$ follows two principles: (1) preserving the commonly used $3 : 1$ spect ratio of person bounding boxes in ReID (consistent with $384 \times 112$ ) to avoid geometric distortion and local-region misalignment caused by non-uniform resizing; and (2) compared with smaller resolutions, $336 \times 112$ retains richer fine-grained textures and contours while introducing a moderate scale perturbation to simulate scale variations commonly observed in real-world surveillance scenarios. The batch size is set to 64. We adopt the Adam optimizer with an initial learning rate of 3.5 × 10⁻⁵ During the first 10 epochs, the learning rate is linearly warmed up to 3.5 × 10⁻⁴, and then decayed to 3.5 × 10⁻⁵ and 3.5 × 10⁻⁶ at the 30th and 50th epochs, respectively. The model is trained for 80 epochs in total.

4.3. Comparison with State-of-the-Art Methods

To verify the effectiveness of our method, we compare it with representative state-of-the-art approaches on three benchmark datasets: Market-1501, DukeMTMC-ReID, and MSMT17. As shown in Table 2, our method achieves competitive performance across all three datasets and consistently improves over the baseline. Specifically, on Market-1501, our method improves mAP and Rank-1 by 1.4% and 1.5%, respectively. On DukeMTMC-ReID, it surpasses the baseline by 1.1% in mAP and 2.9% in Rank-1. On MSMT17, which is considerably more challenging due to large-scale variations and cluttered backgrounds, our method obtains larger gains of 3.1% in mAP and 5.3% in Rank-1, indicating improved robustness under complex real-world conditions.

Table 2.
Comparison with State-of-the-Art Methods.

Market-1501 DukeMTMC-ReID MSMT17

Methods mAP rank-1 mAP rank-1 mAP rank-1

NPSS (Chen et al., 2023) 82.3 94.0 69.7 83.4 36.7 68.8

CIEM (Gautam et al., 2023) 83.8 95.1 73.5 86.8 - -

Aap-ReID (Wang et al., 2023) 86.3 94.6 76.2 87.6 - -

HCACE (Luo et al., 2024) 83.4 93.7 71.5 84.2 41.6 72.4

CC + CAJ (Chen et al., 2024) 86.1 94.4 - - 44.3 75.1

CALR (Li et al., 2024b) 84.5 93.6 74.2 86.0 50.6 78.1

SADB (Gao et al., 2024) 85.5 93.1 73.5 85.9 - -

MSMGO (Liu et al., 2024) 87.2 94.7 76.9 86.8 42.5 68.5

UFFM (Che et al., 2025) 84.9 95.8 76.8 84.5 54.9 77.3

Tran + pose2ID (Yuan et al., 2025) 82.6 95.5 76.1 86.7 54.7 78.0

ACFM (Lu & Tian, 2025) 86.7 95.4 76.5 87.7 - -

Baseline (Luo et al., 2019) 85.0 94.0 75.7 87.4 50.0 75.2

ours 86.4 95.5 76.8 90.3 53.8 80.5

	Market-1501	DukeMTMC-ReID	MSMT17
NPSS (Chen et al., 2023)	82.3	94.0	69.7	83.4	36.7	68.8
CIEM (Gautam et al., 2023)	83.8	95.1	73.5	86.8	-	-
Aap-ReID (Wang et al., 2023)	86.3	94.6	76.2	87.6	-	-
HCACE (Luo et al., 2024)	83.4	93.7	71.5	84.2	41.6	72.4
CC + CAJ (Chen et al., 2024)	86.1	94.4	-	-	44.3	75.1
CALR (Li et al., 2024b)	84.5	93.6	74.2	86.0	50.6	78.1
SADB (Gao et al., 2024)	85.5	93.1	73.5	85.9	-	-
MSMGO (Liu et al., 2024)	87.2	94.7	76.9	86.8	42.5	68.5
UFFM (Che et al., 2025)	84.9	95.8	76.8	84.5	54.9	77.3
Tran + pose2ID (Yuan et al., 2025)	82.6	95.5	76.1	86.7	54.7	78.0
ACFM (Lu & Tian, 2025)	86.7	95.4	76.5	87.7	-	-
Baseline (Luo et al., 2019)	85.0	94.0	75.7	87.4	50.0	75.2
ours	86.4	95.5	76.8	90.3	53.8	80.5

Although some recent methods achieve stronger results on certain metrics, they often target different problem settings or introduce additional components that increase training and inference complexity. For example, Liu et al. (Liu et al., 2024) propose MSAMGO for unsupervised ReID by aggregating multi-view similarity and optimizing multi-level gaps to improve pseudo-label quality, where performance gains largely depend on clustering reliability and the handling of noisy pseudo labels. In contrast, our work focuses on supervised training and explicitly addresses scale and view-induced distribution shift via multi-scale contrastive sample construction and consistency regularization. Che et al. (2025) improve ReID through Uncertainty Feature Fusion (UFFM) and Auto-weighted Measure Combination (AMC), emphasizing robust similarity computation by fusing multi-view cues and combining multiple similarity measures; such designs can be effective but may require additional fusion measure-combination steps (and are often used together with more elaborate similarity pipelines). Yuan et al. (2025) (Pose2ID) propose a training-free feature centralization framework, which leverages mechanisms such as identity-guided generation and neighborhood-based centralization to stabilize identity representations without conventional ReID training, introducing extra procedures beyond a standard supervised pipeline. Lu and Tian, 2025 introduce ACFM, which adaptively matches and aligns channel feature maps according to image content to mitigate misalignment, but this typically adds module-level complexity compared with simple deterministic fusion.

Different from these designs, our method aims to improve cross-scale and cross-view consistency mainly through training-time regularization: we build cross-source fused features and apply GDALoss to jointly enforce geometric-structure and distribution consistency, while keeping the inference pipeline lightweight (i.e., without pose estimation, generation, or complex fusion and matching at test time). As a result, even when the absolute improvements over some SOTA methods are not always the largest, our approach provides a favorable trade-off between accuracy, robustness, and deployability in practical surveillance scenarios.

4.4. Ablation Studies

To validate the effectiveness of the proposed method, we conducted extensive experiments on three benchmark datasets: Market-1501, DukeMTMC-ReID, and MSMT17. Among the methods listed in the table, Method 1 serves as the baseline model, where other modules (including DBFS and GDALoss) are disabled. Method 2 builds on the baseline by integrating the proposed DBFS strategy, which splits the image into upper and lower regions for separate processing and converts the global features in the baseline into multiple local features. Method 3 introduces multi-scale contrastive sample input to the baseline and employs the proposed Geometric-Distribution Alignment Loss (GDALoss) for constraint. Method 4 represents the complete framework that applies both DBFS and GDALoss simultaneously. The results are presented in Table 3. Compared with the baseline, both GDALoss and DBFS demonstrate considerable improvements on most datasets.

Table 3.
Results of Ablation Experiments.

Market-1501 DukeMTMC-ReID MSMT17

Methods DBFS GDALoss mAP rank-1 mAP rank-1 mAP rank-1

1 (baseline) 85.0 94.0 75.7 87.4 50.0 75.2

2 √ 85.2 95.0 74.6 88.7 50.4 78.6

3 √ 85.7 94.7 76.5 88.2 52.6 76.2

4 (ours) √ √ 86.4 95.5 76.8 90.3 53.8 80.5

			Market-1501	DukeMTMC-ReID	MSMT17
1 (baseline)			85.0	94.0	75.7	87.4	50.0	75.2
2	√		85.2	95.0	74.6	88.7	50.4	78.6
3		√	85.7	94.7	76.5	88.2	52.6	76.2
4 (ours)	√	√	86.4	95.5	76.8	90.3	53.8	80.5

As shown in Table 3, introducing DBFS leads to clear gains on Market-1501 and MSMT17, with mAP improving by 0.2% and 0.4%, and Rank-1 increasing by 1.0% and 3.4%, respectively. On DukeMTMC-ReID, DBFS yields a Rank-1 improvement of 1.3% but causes an mAP decrease of 1.1%. This phenomenon suggests that DBFS strengthens part-level discriminability and improves the top-1 match for many queries, yet it may disturb the overall ranking quality across the full gallery, which is more sensitively reflected by mAP. In DukeMTMC-ReID, larger viewpoint changes, detection-box variations, and occlusions make a fixed upper and lower partition more prone to semantic misalignment; consequently, the split features can become less consistent across cameras, leading to less stable similarity ordering beyond the top few retrieved samples. Therefore, the mAP drop on Duke can be regarded as a side effect of fixed local partitioning under severe misalignment, while GDALoss can effectively compensate for this issue.

When multi-scale contrastive inputs and GDALoss are applied (Method 3), the model consistently improves on all datasets (mAP +0.7%, +0.8%, +2.6% and Rank-1 + 0.7%, +0.8%, +1.0% on Market-1501, DukeMTMC-ReID and MSMT17). More importantly, combining DBFS with GDALoss (Method 4) mitigates the partition-induced fragmentation by explicitly enforcing cross-scale and cross-view geometric and distribution consistency on cross-source features, thereby stabilizing the similarity structure and improving both Rank-1 and mAP. Overall, DBFS primarily enhances local alignment and discriminability, while GDALoss alleviates scale and view induced distribution shift; their combination addresses both challenges simultaneously and yields the best overall performance.

4.5. Module Rationality Analysis

To verify the effectiveness of the cross-fusion strategy proposed in this paper and the rationality of the FCLoss design, we designed multiple groups of comparative experiments on the Market-1501 and DukeMTMC-ReID datasets. The specific settings are as follows: (a) Baseline method: The cross-fusion strategy is not adopted; instead, the original image features and multi-scale sample features are directly input into GDALoss for feature alignment; (b) Using only Euclidean distance constraint: The target distance is set to 0 to force the alignment of multi-scale sample features in the geometric space;(c) Target distance set to 1: Consistent with GDALoss; (d) Using only KL divergence constraint: To ensure the consistency of the feature distribution of multi-scale samples; (e) The method proposed in this paper: Adopting the cross-fusion strategy and the dual constraint mechanism. Experimental results are shown in Table 4.

Table 4.
Experimental Results of Module Rationality.

Market-1501 DukeMTMC-ReID

Methods mAP rank-1 mAP rank-1

a 86.0 95.1 76.2 88.0

b 86.1 95.1 76.6 90.0

c 86.0 95.1 76.6 89.9

d 86.1 95.0 76.4 89.5

e (ours) 86.4 95.5 76.8 90.3

	Market-1501	DukeMTMC-ReID
a	86.0	95.1	76.2	88.0
b	86.1	95.1	76.6	90.0
c	86.0	95.1	76.6	89.9
d	86.1	95.0	76.4	89.5
e (ours)	86.4	95.5	76.8	90.3

As shown in Table 4, compared with Method a, the proposed method achieves an mAP improvement of 0.4% and 0.6%, and a Rank-1 improvement of 0.4% and 0.3% on the Market-1501 and DukeMTMC-ReID datasets respectively. This effectively verifies that the cross-fusion strategy can enhance the model's ability to learn discriminative features and facilitate the alignment of multi-scale local features with global features. Furthermore, compared with Methods b, c, and d that only adopt a single constraint mechanism, the proposed method achieves the optimal performance in all key metrics, fully demonstrating the effectiveness and necessity of the geometric-distribution dual constraint mechanism in multi-scale sample alignment.

4.6. Hyperparameter Analysis

To investigate the impact of the KL divergence term, the geometric–distribution alignment loss (GDALoss), and the target distance $σ$ in GALoss on the overall performance of the model, we conducted an ablation study on three hyperparameters on the Market-1501 dataset. For α: In the initial phase, we controlled the ratio of the geometric loss to the distribution loss to be equal, trained the model for 10 epochs, and observed the magnitude of their average values. Based on this, we set the initial weight (approximately 0.2) to ensure the balanced contribution of the geometric loss and the distribution loss. The initial setting of β was similar to that of α. Since β served as an auxiliary parameter, we multiplied its ratio by a small coefficient of 0.1 to prevent it from dominating the training process. Subsequent fine-tuning of the parameters was conducted, and the experimental results are presented in Table 5.

Table 5.
Comparison of Hyperparameter Experimental Results.

Market-1501 Market-1501 Market-1501

$α$ mAP rank-1 $β$ mAP rank-1 $σ$ mAP rank-1

0.1 86.1 95.0 0.1 86.1 95.3 0 86.1 95.1

0.2 86.4 95.5 0.2 86.4 95.5 0.5 86.2 95.1

0.3 86.2 95.1 0.3 86.3 95.3 1 86.4 95.5

0.4 86.2 95.0 0.4 86.2 95.1 1.5 86.1 95.2

0.5 86.1 94.7 0.5 86.2 94.8 2 86.1 95.0

	Market-1501		Market-1501		Market-1501
0.1	86.1	95.0	0.1	86.1	95.3	0	86.1	95.1
0.2	86.4	95.5	0.2	86.4	95.5	0.5	86.2	95.1
0.3	86.2	95.1	0.3	86.3	95.3	1	86.4	95.5
0.4	86.2	95.0	0.4	86.2	95.1	1.5	86.1	95.2
0.5	86.1	94.7	0.5	86.2	94.8	2	86.1	95.0

As shown in Table 5, the weight $α$ of the KL divergence term and the weight $β$ the Geometry-Distribution Alignment Loss (GDALoss) have a notable impact on retrieval performance. When both $α$ and $β$ are set to 0.2, the model achieves the best results on Market-1501, reaching 86.4% mAP and 95.5% Rank-1 accuracy. This indicates that under this setting, the model attains a more desirable balance between representation learning and feature alignment, thereby improving feature discriminability while maintaining robustness.

These observations also support the rationality of our hyperparameter initialization strategy. The initial $α$ value (around 0.2), determined from preliminary observations, lies exactly in the peak-performance region, indicating that calibrating the magnitude of loss terms to balance their contributions is both effective and efficient. Likewise, the optimal $β = 0.2$ highlights that appropriately regularizing cross-scale geometric relations and distribution alignment is critical for boosting performance.

Regarding the target distance $σ$ , the best performance is achieved when $σ$ . This indicates that choosing $σ$ within a suitable range can effectively reduce intra-identity discrepancies across scales and views, while avoiding representation collapse caused by overly strong constraints when $σ$ is too small, or insufficient mitigation of distribution shift when $σ$ is too large due to overly weak constraints.

The Impact of Different Partition Ratios on FCNet

In this section, we report the retrieval performance of FCNet on Market-1501 under different stripe partition strategies in terms of mAP and Rank-1. The experimental results are summarized in Table 6.

Table 6.

Different Partition Ratios Analysis.

		Market-1501
Methods	Ratio	mAP	rank-1
Uniform Partition	1 Part (basebline)	85.0	94.0
	2 Parts (ours)	86.4	95.5
	4 Parts	86.1	94.9
	8 Parts	84.9	93.5
Non-uniform Division	4:6 Ratio	86.0	95.3
Non-uniform Division	6:4 Ratio	86.0	94.6

As shown in Table 6, under Uniform Partition, the 1Part (baseline) setting relies only on global features and achieves 85.0 mAP and 94.0 Rank-1. After introducing moderate local modeling, the performance improves noticeably: 2 Parts achieves the best results in the table with 86.4 mAP and 95.5 Rank-1, indicating that a coarse division of the human body into upper and lower regions can effectively capture complementary discriminative cues, such as clothing patterns in the upper body and shape information in the lower body, thereby improving matching robustness. When the stripes are further subdivided, the gain does not continue. 4 Parts shows a slight drop to 86.1 mAP and 94.9 Rank-1, while 8 Parts suffers a clear degradation to 84.9 mAP and 93.5 Rank-1. This suggests that overly fine partitioning reduces the amount of meaningful person information in each local region and makes the representation more sensitive to pose variation, bounding-box misalignment, occlusion or truncation, and background clutter, which weakens the stability of local features and increases false matches.

In addition, Non-uniform Division further supports the observation that the distribution of informative cues is not uniform across the body. With ratios of 4:6 and 6:4, the performance reaches 86.0 mAP and 95.3 Rank-1, and 86.0 mAP and 94.6 Rank-1, respectively, which are slightly below the 2-part setting. Overall, the results demonstrate that a two-part coarse partition provides the best balance between accuracy and robustness, while non-uniform division offers a stable and practical alternative, whereas excessive subdivision tends to cause performance deterioration.

4.8. The Impact of Downsampling Ratio on FCNet

To systematically evaluate how different downsampling strategies affect recognition performance, we conduct two groups of comparative experiments on the Market-1501 dataset: proportional scaling and non-proportional scaling. Proportional scaling preserves the original aspect ratio and is used to isolate the effect of scaling magnitude, whereas non-proportional scaling deliberately breaks the aspect ratio to introduce geometric distortion, enabling us to analyze the impact of ratio distortion on feature learning and spatial alignment.

For proportional scaling, we select two representative resolutions, 192 × 64 and 336 × 112, The former represents a more aggressive downsampling that can discard fine-grained textures and local structural cues, while the latter maintains the standard $3 : 1$ person aspect ratio and introduces a moderate scale perturbation that better matches multi-scale variations commonly observed in real-world surveillance. For non-proportional scaling, we intentionally use two unbalanced resolutions, 336 × 128 and 384 × 112 to induce aspect-ratio distortion and examine its influence on recognition performance.

As shown in Table 7 proportional scaling achieves the best performance at 336 × 112, and the performance degrades as the downsampling becomes more aggressive (e.g.,192 × 64), indicating that excessive resolution reduction weakens the representation of fine-grained details. The two non-proportional settings yield comparable results and are slightly inferior to the best proportional-scaling configuration, suggesting that aspect-ratio distortion introduces misalignment and structural inconsistency that can hinder discriminative feature learning. Overall, 336 × 112 provides a better trade-off between preserving fine-grained information and introducing a realistic scale perturbation, leading to the most favorable performance.

Table 7.
Comparison of Different Scaling Ratios on Market-1501.

Methods Resolution MAP Rank-1

proportional scaling 192 × 64 81.7 93.1

336 × 112 86.4 95.5

Non-proportional scaling 336 × 128 86.3 95.4

384 × 112 86.1 95.3

Methods	Resolution	MAP	Rank-1
proportional scaling	192 × 64	81.7	93.1
336 × 112	86.4	95.5
Non-proportional scaling	336 × 128	86.3	95.4
384 × 112	86.1	95.3

4.9. Efficiency Analysis

All efficiency evaluations were conducted on a machine equipped with a 12th Gen Intel(R) Core(TM) i5-12600KF CPU and an NVIDIA GeForce RTX 4060 Ti GPU (16 GB VRAM). Experiments were implemented under the PyTorch 1.10.0/CUDA 11.3 framework, with the input image resolution set to 384 × 128 pixels. We used the Market-1501 dataset for this evaluation, and the detailed results are reported in Table 8.

Table 8.
Complexity Analysis.

Methods FlOPs (G) Params (M) Inference Time (ms/img) Throughput (img/s) PeakMem (GB) MAP Rank-1

OSNet 2.18 1.28 1.84 541.65 0.49 84.9 94.8

PCB 6.13 25.99 2.12 470.96 0.42 81.6 93.8

Tran + pose2ID 28.83 105.15 7.09 140.92 0.83 82.6 95.5

FCNet 6.13 28.64 2.21 450.80 0.43 86.4 95.5

Methods	FlOPs (G)	Params (M)	Inference Time (ms/img)	Throughput (img/s)	PeakMem (GB)	MAP	Rank-1
OSNet	2.18	1.28	1.84	541.65	0.49	84.9	94.8
PCB	6.13	25.99	2.12	470.96	0.42	81.6	93.8
Tran + pose2ID	28.83	105.15	7.09	140.92	0.83	82.6	95.5
FCNet	6.13	28.64	2.21	450.80	0.43	86.4	95.5

As shown in Table 8, OSNet demonstrates strong deployment friendliness with the smallest parameter size (1.28 M) and the fastest inference speed (1.84 ms/img). However, its retrieval accuracy (mAP = 84.9) still lags behind stronger architectures. In comparison, PCB achieves competitive performance but at a clearly higher computational cost (FLOPs = 6.13G, Params = 25.99 M). Transformer + Pose2ID further expands both FLOPs and parameter scale (28.83G/105.15 M), resulting in much higher inference latency (7.09 ms/img) and lower throughput (140.92 img/s), indicating a substantially heavier efficiency burden.

Notably, FCNet maintains a computational complexity on the same order as PCB (6.13G) and a similar peak memory footprint (0.43 GB), while introducing only a slight increase in inference time (2.21 ms/img, compared with 2.12 ms/img for PCB). Despite this minimal overhead, FCNet delivers the best retrieval performance in the table (mAP = 86.4, Rank-K = 95.5). These results indicate that FCNet's performance gains are not achieved by significantly increasing model size or computation, but rather by improving feature representation and matching quality with very limited additional cost, leading to a more favorable accuracy–efficiency trade-off.

Figure 3.

Visualization results diagram.

4.10. Visualization

To intuitively evaluate the robustness of the proposed method, we visualize the retrieval results in Figure 3 on three datasets (Market-1501, DukeMTMC-ReID, and MSMT17). For each dataset, we present representative query images and compare the Top- K retrieved gallery samples produced by the baseline and our FCNet, covering both successful cases and failure cases. In the figure, green bounding boxes indicate correct matches (same identity) while red boxes denote incorrect matches. As shown, FCNet consistently returns more correct matches at higher ranks than the baseline across datasets, demonstrating a clear performance gap and verifying the effectiveness and robustness of FCNet.

Although FCNet can significantly improve retrieval quality for most queries, several typical failure modes can still be observed in the Failed Cases of Figure 3. These errors mainly stem from two scenarios: identity confusion caused by highly similar appearances and missing local cues due to severe occlusion or truncation. In the former case, the model may mistakenly treat clothing color, backpack shape, and other “similar appearance” cues as decisive evidence of the same identity; in the latter case, when informative human cues are insufficient, the model is forced to rely more on weak signals such as background regions or coarse color patches, which leads to mismatches. Meanwhile, these failure examples also highlight challenging situations that remain difficult for the current method, providing useful guidance for future improvements in fine-grained discriminative cue modeling and occlusion robustness.

5. Conclusion

This paper proposes a dual-branch person re-identification (Re-ID) framework based on feature consistency to mitigate the adverse impact of scale variations. Specifically, we introduce a lightweight feature-map decoupling strategy that splits high-level representations into two complementary regions (upper and lower parts). Each region is optimized with an independent global average pooling (GAP) head and classifier, which strengthens fine-grained representation learning and improves the stability of local feature alignment. In addition, we propose the Geometric–Distribution Alignment Loss (GDALoss) to explicitly regularize cross-scale representations by jointly constraining feature distances and distribution discrepancies between the original images and their deterministically transformed counterparts. Extensive experiments on three public benchmarks demonstrate that the proposed method consistently improves both accuracy and robustness. Owing to its lightweight design without relying on pose estimation or human parsing, the proposed framework is well suited for practical cross-camera Re-ID systems in real-world surveillance and retrieval applications, where heterogeneous cameras and scale variations are common.

5.1 Limitations and Future Work

Although the proposed vertical decoupling is lightweight and effective, it may be less optimal under extremely severe occlusion, large pose changes, or inaccurate detections, where the upper–lower partition may not perfectly correspond to semantic body parts. In future work, we will explore adaptive or content-aware partitioning, incorporate stronger augmentation and domain generalization strategies, and extend the framework to video-based Re-ID and multi-camera tracking settings to further improve robustness in real-world deployments.

Footnotes

Acknowledgements

This work is funded by the National Natural Science Foundation of China (Grant No. 62066036) and supported by the Basic scientific research business fee project for directly affiliated universities in Inner Mongolia Autonomous Region (Grant No. 2023XKJX020), and the Scientific Research Special Project for First-Class Disciplines in Inner Mongolia Autonomous Region (Grant No. YLXKZX-NKD-001).

ORCID iD

Yunfeng Zhai

Ethical Approval

This study did not involve human or animal subjects, and thus, no ethical approval was required.

Author's Contribution

Yunfeng Zhai (First Author), Proposed the core idea of the dual-branch person re-identification method based on feature consistency, including the design of the Dual-Branch Feature Separation (DBFS) strategy and the Geometric-Distribution Alignment Loss (GDALoss).

Implemented the entire experimental framework, including code writing (based on PyTorch), model training (ResNet50 backbone optimization), and result validation on three benchmark datasets (Market-1501, DukeMTMC-ReID, MSMT17).

Drafted the initial version of the manuscript, including the abstract, introduction, methodology, and experimental evaluation sections.

Xiaojian Pan, Conducted an in-depth literature review on person re-identification (Re-ID) techniques, especially focusing on local feature extraction and multi-scale feature alignment methods (supporting the “Related Work” chapter).

Processed experimental data, including dataset preprocessing (image resizing, data augmentation such as horizontal flipping and downsampling) and statistical analysis of experimental results (calculation of mAP and Rank-1 metrics).

Designed and drew key figures and tables in the manuscript, such as the network architecture diagram (Figure 1), feature fusion diagram (), and experimental result tables 1–8.

Qian Wang, Led the ablation study (Section 4.4) to verify the effectiveness of the DBFS strategy and GDALoss, including designing comparative experimental schemes (Methods 1–4) and analyzing the impact of each module on model performance.

Completed the hyperparameter analysis (Section 4.6) for GDALoss (hyperparameters α and β) and the downsampling ratio experiment (Section 4.7), optimizing the model's hyperparameter configuration.

Assisted in code optimization, including improving the efficiency of multi-scale sample generation and optimizing the training process to reduce overfitting.

Zhijie Chen, Verified the rationality of the multi-source cross-fusion strategy (Section 4.5) and the GDALoss dual-constraint mechanism (geometric alignment + distribution alignment), designing comparative experiments (Methods a–e) to validate module effectiveness.

Conducted the visualization analysis of retrieval results (Section 4.8), including selecting representative query images from datasets and comparing retrieval performance between the proposed FCNet and the baseline model.

Assisted in revising the manuscript, including refining the logic of the “Methodology” section and supplementing discussions on experimental results.

Jianjun Li, Conceived and supervised the entire research project, determining the research direction and technical route.

Provided financial support (via National Natural Science Foundation of China and Inner Mongolia autonomous region research grants) and academic guidance, including optimizing the design of GDALoss and solving key technical problems in model training.

Reviewed and revised the manuscript critically, ensuring the accuracy of technical content and compliance with academic norms; finalized the manuscript and handled correspondence with journals.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the the Basic scientific research business fee project for directly affiliated univer-sities in Inner Mongolia Autonomous Region, the Scientific Research Special Project for First-Class Disciplines in Inner Mongolia Autonomous Region, the National Natural Science Foundation of China, (grant number 2023XKJX020, YLXKZX-NKD-001, 62066036).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Code Availability

The code generated and used during the current study is part of a future patent application. To protect intellectual property, it is not publicly available at this time. After the patent application has been submitted, the code can be requested from the corresponding author. (xidianjj@163.com)

Data Availability

We used freely available datasets: Market-1501, DukeMTMC-ReID, and MSMT17. The datasets can be accessed from their official or commonly used sources: Market-1501 from https://paperswithcode.com/dataset/market-1501, DukeMTMC-ReID from https://paperswithcode.com/dataset/dukemtmc-reid, and MSMT17 from its official project page at . Alternatively, the data can be obtained from the respective authors upon reasonable request.

References

Bai

Tian

. (2017). Scalable person re-identification on supervised smoothed manifold. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2530–2539.

Che

Q. H.

Nguyen

L. C.

Luu

D. T.

Nguyen

V. T.

(2025). Enhancing person re-identification via uncertainty feature fusion method and auto-weighted measure combination. Knowledge-Based Systems, 307, 112737. https://doi.org/10.1016/j.knosys.2024.112737

Chen

Zhao

Wang

. (2023). Person Re-identification based on contour information embedding. Sensors, 23, 774. 10.3390/s23020774

Chen

Fan

Chen

. Ca-jaccard : Camera-aware jaccard distance for person re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17532–17541.

Cui

Chen

Deng

Liu

Wang

. (2025). PPBI: Pose-guided partial-attention network with batch information for occluded person Re-identification. Sensors, 25(3), 757. https://doi.org/10.3390/s25030757

Dou

, Lin

Gao

Zhao

(2024). Drop: decouple re-identification and human parsing with task-specific features for occluded person re-identification. arXiv preprint arXiv:2401.18032

Gao

Yue

Chen

Wang

Zhang

. (2024). Self-attention-Based dual-branch person re-identification. In Huang

D. S.

Zhang

Chen

(eds) Advanced intelligent computing technology and applications. ICIC 2024. Lecture notes in computer science (Vol. 14865). Springer. 10.1007/978-981-97-5591-2_18

Gautam

Prasad

Sinha

(2023). Aap-reid: Improved attention-aware person re-identification. arXiv preprint arXiv:2309.15780.

Geng

Liu

Wang

Yan

Guo

. (2025). Pose-skeleton guided cross-attention representation fusion for occluded pedestrian re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 35(12), 8598–8613. https://doi.org/10.1109/TCSVT.2025.3556250

10.

Gray

Brennan

Tao

. (2007). Evaluating appearance models for recognition, reacquisition, and tracking. Proceedings IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), 3(5), 1–7.

11.

Chang

Bai

Shan

Chen

. (2022). Clothes-changing person re-identification with RGB modality only. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1060–1069.

12.

Han

(2024). Enhancing identification for person search with multi-scale multi-grained representation learning. Pattern Recognition, 150, 110361. https://doi.org/10.1016/j.patcog.2024.110361

13.

Jung

Lee

Yoo

Kim

(2024). PaFormer: Part aware transformer for person re-identification. arXiv Preprint arXiv:2408.05918.

14.

Khan S

Hussain

Ullah

Baik

S. W.

. (2024). Deep-ReID: Deep features and autoencoder assisted image patching strategy for person re-identification in smart cities surveillance. Multimedia Tools and Applications, 83(5), 15079–15100. https://doi.org/10.1007/s11042-020-10145-8

15.

Lai

(2025). Cross-modal pedestrian re-identification technique based on multi-scale feature attention and strategy balancing. Engineering Research Express, 7(1), 015273. https://doi.org/10.1088/2631-8695/adb93c

16.

Wang

Yan

Jia

Ding

Shouhong

Sheng

Liu

Yang

. (2024a). Rethinking clothes changing person ReID: Conflicts, synthesis, and optimization. arXiv Preprint arXiv:2404.12611.

17.

Huang

Zhou

Wang

(2024b). Camera-aware label refinement for unsupervised person re-identification. arXiv Preprint arXiv:2403.16450.

18.

Qin

. (2025). CM-DASN: Visible-infrared cross-modality person re-identification via dynamic attention selection network. Multimedia Systems, 31(2), 138. 10.1007/s00530-025-01724-6

19.

Liu

Cheng

(2024). Multi-view similarity aggregation and multi-level gap optimization for unsupervised person re-identification. Expert Systems with Applications, 256, 124924. https://doi.org/10.1016/j.eswa.2024.124924

20.

Liu

Wang

Zhao

. (2023). Occluded person re-identification with pose estimation correction and feature reconstruction. IEEE Access, 11, 14906–14914. https://doi.org/10.1109/ACCESS.2023.3243113

21.

Tian

(2025). ACFM: Adaptive channel feature matching for pedestrian re-identification. IEEE Access, 13, 82278–82290. https://doi.org/10.1109/ACCESS.2025.3568806

22.

Luo

Liao

Lai

Jiang

(2019). Bag of tricks and a strong baseline for deep person re-identification. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.

23.

Luo

Jiang

Kong

Tao

. (2024). Hierarchical camera-aware contrast extension for unsupervised person re-identification. IEEE Transactions on Multimedia, 26(12), 7636–7648. https://doi.org/10.1109/TMM.2024.3369904

24.

Gao

Shen

H. T.

Song

. (2023). Part-aware transformer for generalizable person re-identification. Proceedings of the IEEE/CVF International Conference on Computer Vision, 11280–11289.

25.

Schroff

Kalenichenko

Philbin

. (2015). Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 815–823.

26.

Wang

Yang

Liu

Zheng

W. S.

. (2023). Pseudo-label noise prevention, suppression and softening for unsupervised person re-identification. IEEE Transactions on Information Forensics and Security, 18, 3222–3237. https://doi.org/10.1109/TIFS.2023.3277694

27.

Wang

Liu

Song

Guo

Shi

. (2022). Pose-guided feature disentangling for occluded person re-identification based on transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 2540–2549. 10.1609/aaai.v36i3.20155

28.

Wang

Shi

Zhang

Geng

Jiang

(2024). A survey on person and vehicle re-identification. IET Computer Vision, 18(8), 1235–1268. https://doi.org/10.1049/cvi2.12316

29.

Wei

Zhang

Gao

Tian

. (2018). Person transfer GAN to bridge domain gap for person re-identification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 79–88.

30.

Wen

Zhang

Qiao

. (2016). A discriminative feature learning approach for deep face recognition. In European conference on computer vision (pp. 499–515). Springer International Publishing.

31.

Huang

Gao

Zhao

Zhang

. (2024). A two-stream hybrid convolution-transformer network architecture for clothing-change person re-identification. IEEE Trans Multimedia, 26(10), 5326–5339. https://doi.org/10.1109/TMM.2023.3331569

32.

Shen

Lin

Xiang

Shao

Hoi

S. C. H

. (2021). Deep learning for person re-identification: A survey and outlook. IEEE transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872–2893. https://doi.org/10.1109/TPAMI.2021.3054775

33.

Fan

Chen

Wang

Han

(2025). POSR: Pose-aligned outlier sample re-labeling for unsupervised person re-identification. IEEE Transactions on Instrumentation and Measurement, 74, 1–11. https://doi.org/10.1109/TIM.2025.3546396

34.

Yuan

Zhang

Niu

. (2025). From poses to identity: Training-free person re-identification via feature centralization. Proceedings of the Computer Vision and Pattern Recognition Conference, 24409–24418.

35.

Zhang

(2024). Dual-stream feature fusion network for person re-identification. Eng Appl Artif Intell, 131, 107888. https://doi.org/10.1016/j.engappai.2024.107888

36.

Zheng

Shen

Tian

Wang

Tian

. (2015). Scalable person re-identification: A benchmark. Proceedings of the IEEE International Conference on Computer Vision, 1116–1124.

37.

Zheng

Yang

(2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. Proceedings of the IEEE International Conference on Computer Vision, 3754–3762.

38.

Zhou

Lian

Ouyang

(2025). Multi-state perception consistency constraints network for person re-identification. Pattern Analysis and Applications, 28(1), 31–43. https://doi.org/10.1007/s10044-024-01398-2

39.

Zhou

Peng

. (2024). Pedestrian Re-identification based on multi-scale depth features. Proceedings of the 5th International Conference on Computer Information and Big Data Applications, 520–524.

40.

Zhu

Z. A.

Chien

H. C.

Chiang

C. K.

(2025). TCMM: Token constraint and multi-scale memory bank of contrastive learning for unsupervised person re-identification. arXiv preprint arXiv:2501.09044.

			Market-1501		DukeMTMC-ReID		MSMT17
Methods	DBFS	GDALoss	mAP	rank-1	mAP	rank-1	mAP	rank-1
1 (baseline)			85.0	94.0	75.7	87.4	50.0	75.2
2	√		85.2	95.0	74.6	88.7	50.4	78.6
3		√	85.7	94.7	76.5	88.2	52.6	76.2
4 (ours)	√	√	86.4	95.5	76.8	90.3	53.8	80.5

	Market-1501			Market-1501			Market-1501
$α$	mAP	rank-1	$β$	mAP	rank-1	$σ$	mAP	rank-1
0.1	86.1	95.0	0.1	86.1	95.3	0	86.1	95.1
0.2	86.4	95.5	0.2	86.4	95.5	0.5	86.2	95.1
0.3	86.2	95.1	0.3	86.3	95.3	1	86.4	95.5
0.4	86.2	95.0	0.4	86.2	95.1	1.5	86.1	95.2
0.5	86.1	94.7	0.5	86.2	94.8	2	86.1	95.0

Dual-Branch Person Re-Identification Method Based on Feature Consistency

Abstract

Keywords

1 Introduction

2.1 Brief Overview of Person Re-Identification

2.2. Multi-Scale Feature Extraction Methods

3. Methodology

4.1. Datasets and Evaluation Metrics

Table 1. Statistics of Used Datasets. Dataset Total ID Training ID Gallery ID Image Gallery Image Camera Market-1501 1501 751 750 32,668 19,732 6 DukeMTMC-ReID 1404 702 702 36,411 17,661 8 MSMT17 4101 1041 3060 126,441 82,161 15

4.3. Comparison with State-of-the-Art Methods

Table 3. Results of Ablation Experiments. Market-1501 DukeMTMC-ReID MSMT17 Methods DBFS GDALoss mAP rank-1 mAP rank-1 mAP rank-1 1 (baseline) 85.0 94.0 75.7 87.4 50.0 75.2 2 √ 85.2 95.0 74.6 88.7 50.4 78.6 3 √ 85.7 94.7 76.5 88.2 52.6 76.2 4 (ours) √ √ 86.4 95.5 76.8 90.3 53.8 80.5

Table 4. Experimental Results of Module Rationality. Market-1501 DukeMTMC-ReID Methods mAP rank-1 mAP rank-1 a 86.0 95.1 76.2 88.0 b 86.1 95.1 76.6 90.0 c 86.0 95.1 76.6 89.9 d 86.1 95.0 76.4 89.5 e (ours) 86.4 95.5 76.8 90.3

The Impact of Different Partition Ratios on FCNet

Table 7. Comparison of Different Scaling Ratios on Market-1501. Methods Resolution MAP Rank-1 proportional scaling 192 × 64 81.7 93.1 336 × 112 86.4 95.5 Non-proportional scaling 336 × 128 86.3 95.4 384 × 112 86.1 95.3

5. Conclusion

5.1 Limitations and Future Work

Footnotes

Acknowledgements

ORCID iD

Ethical Approval

Author's Contribution

Funding

Declaration of Conflicting Interests

Code Availability

Data Availability

References

Table 1.
Statistics of Used Datasets.

Dataset Total ID Training ID Gallery ID Image Gallery Image Camera

Market-1501 1501 751 750 32,668 19,732 6

DukeMTMC-ReID 1404 702 702 36,411 17,661 8

MSMT17 4101 1041 3060 126,441 82,161 15

Table 3.
Results of Ablation Experiments.

Market-1501 DukeMTMC-ReID MSMT17

Methods DBFS GDALoss mAP rank-1 mAP rank-1 mAP rank-1

1 (baseline) 85.0 94.0 75.7 87.4 50.0 75.2

2 √ 85.2 95.0 74.6 88.7 50.4 78.6

3 √ 85.7 94.7 76.5 88.2 52.6 76.2

4 (ours) √ √ 86.4 95.5 76.8 90.3 53.8 80.5

Table 4.
Experimental Results of Module Rationality.

Market-1501 DukeMTMC-ReID

Methods mAP rank-1 mAP rank-1

a 86.0 95.1 76.2 88.0

b 86.1 95.1 76.6 90.0

c 86.0 95.1 76.6 89.9

d 86.1 95.0 76.4 89.5

e (ours) 86.4 95.5 76.8 90.3

Table 7.
Comparison of Different Scaling Ratios on Market-1501.

Methods Resolution MAP Rank-1

proportional scaling 192 × 64 81.7 93.1

336 × 112 86.4 95.5

Non-proportional scaling 336 × 128 86.3 95.4

384 × 112 86.1 95.3