Dual-prompt complementary fusion network for RGBT tracking

Abstract

RGBT target tracking is a significant downstream task in the field of object tracking. However, compared to visible light target tracking, RGBT target tracking faces the challenge of smaller datasets, making it difficult to achieve performance levels comparable to those achieved in visible light target tracking. To address how to effectively combine the complementary characteristics of visible and thermal modalities, as well as how to fully leverage the superior performance of models trained on visible light target tracking tasks, while also aiming for lower computational costs and higher tracking effectiveness, a dual-prompt complementary fusion strategy for an RGBT tracking network is proposed. Drawing on the concept of prompt learning, this network aims to extend the efficient performance of visible light target tracking to the RGBT target tracking domain. In its implementation, the prompt module inputs both visible and thermal modality information as dual prompts into the backbone network, where the network utilizes these prompts to generate new, enriched prompt information at each layer. Subsequently, an information enhancement fusion module enhances the acquired prompt information and refeeds it into the backbone network, aiming to improve the tracking accuracy and robustness. Experimental results on GTOT, RGBT234 and LasHeR datasets show that the tracking accuracy (PR) and success rate (SR) of the network reach 93.1%/76.8%, 84.4%/62.4% and 66.8%/53.8%, respectively, which is improved compared with the current mainstream RGBT target tracking network, which verifies the effectiveness of the network.

Keywords

RGBT target tracking prompt learning transformer

1. Introduction

Visual target tracking constitutes a foundational task within the realm of computer vision, with visible and thermal infrared target tracking (RGBT target tracking) representing a crucial extension of this domain. In real-world applications, tracking based solely on visible light imagery often struggles under complex environmental conditions such as fog, rain, and extreme lighting scenarios. Conversely, thermal infrared imagery excels under these circumstances due to its insensitivity to light variations and its capability to penetrate smoke. The advent of advanced visible light and thermal infrared camera technologies in recent years has propelled RGBT target tracking to the forefront of research, finding extensive application across video surveillance, robotics, and autonomous driving, among others. The pivotal challenge in RGBT target tracking lies in effectively harnessing both visible light and thermal infrared modalities to exploit their complementary strengths.

At present, the mainstream RGBT object tracking algorithm adopts the method of multi-domain learning and the structure of siamese network.¹ The aspect ratio of the candidate box region obtained by the RGBT tracking method based on multi-domain learning is fixed. And it only contains local features, which cannot flexibly adapt to changes in the shape of the target. At the same time, because it contains an online update module, its real-time performance is poor. The RGBT tracking algorithm based on the siamese network needs a large amount of data for offline training in order to obtain a robust model. At present, the RGBT object tracking dataset is still small in terms of data volume compared to the visible object tracking dataset. Therefore, the RGBT object tracking algorithm based on the siamese network often fails to achieve the desired performance level. Transformer² was originally applied to the field of natural language processing, but due to its powerful sequence modeling capabilities and attention mechanism, researchers began to expand it to the field of computer vision. In the field of object tracking, Wang et al.³ proposed TMT, which is the first time that Transformer has been applied to a target tracking task. Subsequently, Transformer was widely used in the field of single-target tracking, and a series of effective trackers appeared, such as TransT,⁴ Stark,⁵ and OSTrack.⁶ The use of Transformer for target tracking is conducive to extracting the global features of the object.

Prompt learning, propelled by the rapid evolution of large-scale models, has gained prominence in natural language processing. This led Jia et al. to introduce Visual Prompt Tuning (VPT),⁷ bringing the concept of prompt learning into computer vision. This approach enables the application of large models to downstream tasks via prompt information. Given that RGBT target tracking is a derivative task of visible light target tracking, large models pre-trained on the latter can be adeptly applied to RGBT target tracking challenges. ProTrack⁸ first implemented this concept within RGBT tracking, utilizing a simple color transformation on thermal infrared images as a prompt. This method combines the visible light image with prompt information to form a novel three-channel input for network processing. Similarly, ViPT⁹ adopted a comparable strategy by integrating multiple lightweight modal complementary prompters into the foundational model, with thermal modal information serving as the prompt. Prompt learning achieves noteworthy outcomes with minimal parameter augmentation. Unlike comprehensive fine-tuning, prompt learning freezes the backbone network’s parameters, focusing fine-tuning efforts solely on the prompt module. This approach significantly curtails computational resource and time requirements, offering a more streamlined solution. However, these algorithms typically employ a single modality as prompt information, which can lead to inaccuracies in tracking outcomes if the prompt modality is compromised or absent. Additionally, this overlooks the complementary nature of the two modalities, potentially limiting tracking performance in complex scenes due to the underutilization of available complementary information.

Given the potential for thermal infrared images to exhibit low resolution or missing modes in practical applications, and the fact that visible light images, despite their higher resolution, can suffer from degraded quality in scenarios marked by lighting fluctuations, relying solely on a single modality for prompt information could compromise tracking accuracy. To address this, this paper introduces a Dual-Prompt Complementary Fusion Network for RGBT Tracking (DPCFT). This network features a dual-prompt information structure that fuses interaction information from both visible and thermal modalities as dual prompts. These fused prompts are then integrated into the backbone network, with subsequent enhancement and reintegration facilitated by an information fusion enhancement module. This interactive fusion within the prompt module maximizes the complementary attributes of both modalities, thereby refining tracking accuracy with only a minor increase in parameter count.

The principal contributions of this work are outlined as follows:

The introduction of a dual-prompt complementary fusion tracking network that leverages dual prompt messages to mitigate the tracking accuracy issues associated with poor-quality single modalities.

The development of an information enhancement fusion module that applies spatial and channel attention mechanisms to augment and amalgamate prompt information, optimally harnessing the modalities’ beneficial characteristics.

The demonstration of the algorithm’s superior performance across multiple benchmark datasets, affirming its efficacy and potential in enhancing RGBT target tracking capabilities.

2. Related work

2.1. RGBT target tracking

RGBT target tracking is a downstream task of visual target tracking, which currently contains three main paradigms. The first one is the RGBT tracking algorithm based on multi-domain learning. This method is based on extracting candidate frame regions, then fusing features from candidate frame regions of different modalities, and finally obtaining the tracking results of the current target through binary classification and bounding box regression. MANet¹⁰ extracts modality-specific features, modality-shared features, and instance-aware features, respectively, by designing three different adapters in order to better exploit the complementary properties of the visible and thermal infrared modalities. MFGNet¹¹ uses a dynamic modality-aware generation module to enhance the information interaction between the visible and thermal infrared modalities by adaptively adjusting the convolution sum based on different input images during the tracking process. APFNet¹² designs an attribute-based progressive fusion network with five challenge branches designed to adaptively aggregate attribute-specific fusion features to improve the fusion capability. This multi-domain learning-based approach obtains candidate frame regions with fixed aspect ratios and contains only local features that cannot flexibly adapt to changes in the target shape.

Then comes the Siamese-based RGBT tracking algorithm, which involves designing two same branches of the network to extract features for visible and thermal infrared modalities respectively. The extracted features of both modalities are then fused using a fusion module, and the fused features are fed into the header for classification regression. SiamFT¹³ extends SiamFC¹⁴ by designing a feature fusion branch to fuse visible and thermal modal features. DSiamFT¹⁵ uses a channel attention mechanism to fuse the features of the template frame, while keeping the search frame unchanged. The Siamese-based RGBT tracking method is able to achieve high speed because it does not require online updates, but it requires a large amount of data for training to obtain robust results.

Finally, Transformer-based RGBT tracking algorithms have also attracted a lot of attention due to the wide application and good results of Transformer² in the field of visual target tracking. Since the Transformer structure is able to capture global features, it is more effective in acquiring global features compared to the two previous paradigms using CNN structure. ViPT⁹ adopts prompt learning based on OsTrack,⁶ and inputs the thermal infrared features as prompt information into the backbone network. It freezes the backbone network parameters and only fine-tuning the prompt module to obtain good tracking results with fewer parameters trained. TBSI¹⁶ extended the ViT¹⁷ backbone network into a dual-stream structure to extract the features of visible and thermal infrared modalities respectively. In order to make full use of the complementary nature of the two, a template-bridging interaction fusion module was designed for cross-modal interactions, and good results were achieved.

2.2. Prompt learning

The “pre-training, prompt” paradigm first appeared in the field of natural language processing and gradually replaced the “pre-training, fine-tuning” paradigm as the mainstream. This paradigm allows the base model to adapt to different types of tasks by adding prompts to the model’s input. Compared to the popular fine-tuning paradigm, the prompt paradigm can update only the parameters of the prompt part while freezing the parameters of the backbone network, which significantly reduces the consumption of computational resources and time. In addition, the prompt paradigm is able to achieve performance comparable to fine-tuning while significantly reducing memory footprint. Due to the prompt’s promising results in the field of natural language processing, researchers have started to investigate its application to the field of computer vision. VPT⁷ applies prompt to the vision Transformer, adapting a large-scale pretrained model to a variety of downstream tasks by fine-tuning a very small number of learnable parameters. DualPrompt,¹⁸ on the other hand, employs a dual-prompt approach, where one set of prompts is used to encode task-invariant instructions and the other set of prompts is used to encode task-specific instructions, enabling the model to adapt to new categories while retaining the memory of the old ones. In the field of RGBT target tracking, algorithms such as ProTrack⁸ and ViPT⁹ have introduced prompt learning to the field with good results.

3. Method

In this section, our proposed RGB-T tracking model DPCFT is described in detail, as shown in Figure 1. It mainly consists of ViT backbone network, multi-prompt complementary fusion (MPCF) branch and a localization head. The multi-prompt complementary fusion branch includes Prompt (P) module and Information Enhancement Fusion (IEF) module.

Figure 1.

Overview architecture of our proposed.

3.1. Overview of our network architecture

Firstly the input visible and thermal infrared images are segmented and flattened into a patch sequence. Subsequently, two pieces of prompt information are obtained through the initial prompt module, and these two pieces of information, along with the visible light patches, are fed into the Vision Transformer (ViT) backbone network. The output from each layer of the backbone network goes through a multi-prompt complementary fusion branch. Within this branch, the output from each layer of the backbone network and the two pieces of prompt information from the previous layer of the prompt module are processed through the prompt module again to generate new sets of two prompt information. These new prompt information are then passed through an information enhancement fusion module, producing prompt information for the next layer. This prompt information is added to the output of the previous layer of the backbone network before being input into the subsequent network layer. Finally, the features obtained from the backbone network are fed into the tracking localization head to predict the current state of the target. For the tracking localization head the design is similar to OSTrack.⁶

3.2. ViT backbone

With the development of Transformer, Vision Transformer has been widely used in computer vision-related tasks, including image classification,¹⁷ image denoising,¹⁹ remote sensing image processing²⁰ and other fields. In this paper, the ViT used in image classification is used as the backbone network. The input initial visible and thermal infrared video frames are used as template images $Z_{r g b} \in R^{H_{z} \times W_{z} \times 3}$ , $Z_{t} \in R^{H_{z} \times W_{z} \times 3}$ , and the visible and thermal infrared images in the subsequent frames are used as search images $X_{r g b} \in R^{H_{x} \times W_{x} \times 3}$ , $X_{t} \in R^{H_{x} \times W_{x} \times 3}$ . First the template images and search images are converted to patches by flattening and embedding $H_{r g b}^{z}, H_{t}^{z} \in R^{N_{z} \times C}$ , $H_{r g b}^{x}, H_{t}^{x} \in R^{N_{x} \times C}$ . Where $N_{z} = H_{z} W_{z} / P^{2}$ , $N_{x} = H_{x} W_{x} / P^{2}$ . $C$ is the patch dimension. Then, the obtained template patches and search patches are concatenated, and a one-dimensional learnable position encoding is added to obtain visible light patches and thermal infrared patches. The visible light patches are input into the backbone network for feature extraction and information interaction. Both visible light patches and thermal infrared patches are input into the multi-prompt complementary fusion branch to generate prompt information before being input into the backbone network. Finally, the features obtained through the Transformer encoder are input into the localization head to achieve the final tracking result.

H_{r g b}^{l} = E^{l} (H_{r g b}^{l - 1}, P^{l}) l = 1, 2, \dots N

(1)

where

E^{l}

represent the Transformer Encoder at layer

l

H_{r g b}^{l}

represent the input sequence at layer

l

, and

P^{l}

represent the prompt information at layer

l

The backbone network is composed of $N$ Transformer Encoders, with the structure of each Transformer Encoder depicted on the right side of Figure 1. This structure includes Multi-head Self-Attention (MSA), Layer Normalization (LN), a Feed-Forward Network (FFN), and residual connections. The formula for calculating attention is as follows. For simplicity, superscripts are omitted.

A = soft max (\frac{Q K^{T}}{\sqrt{C}}) V = soft max (\frac{H_{r} W_{q} {(H_{r} W_{k})}^{T}}{\sqrt{C}}) (H_{r} W_{v})

(2)

where Q, K, and V represent the query, key, and value matrices, respectively.

W_{q}, W_{k}, W_{v}

represent weight matrices.

Figure 2.

The overall architecture of prompt module.

3.3. Multi-prompt complementary fusion branch

3.3.1. Prompt module

Inspired by ViPT,⁹ the prompt module generates two types of prompt information: visible light self-enhancement prompt information and thermal infrared prompt information that fuses with visible light features. This allows base models pre-trained on large-scale RGB datasets to adapt to downstream RGBT target tracking tasks. Specifically, the visible light prompt is used to enhance visible light features, while the thermal infrared prompt is employed for cross-modal interaction, facilitating partial fusion between visible light and thermal infrared modalities. The design of the prompter is shown in Figure 2, with its generation process as follows:

I_{r g b}^{l}, I_{t}^{l} = P^{l} (I_{r g b}^{l - 1}, I_{t}^{l - 1}) l = 1, 2, \dots N

(3)

where

P^{l}

represents the prompter module,

I

represents the prompt information output at each layer.

The prompt module performs identical operations on both the template and search sequences, exemplified here by the generation of thermal infrared information for the search sequence. For simplicity, superscripts are omitted. Initially, the prompter module employs a $1 \times 1$ convolution to reshape the visible light and thermal infrared sequences, converting them into two-dimensional feature maps with a channel dimension of 8. Then, the visible light features undergo softmax normalization, and the results are multiplied by the original feature maps to obtain new visible light feature maps. Next, the thermal infrared feature map is added to the new visible light feature map, completing the cross-modal interaction between the two feature maps. Finally, a $1 \times 1$ convolution is used again to restore the dimensionality of the mixed features. For the generation of visible light information, a similar operation is employed, only converting the thermal infrared feature map into a visible light feature map, to achieve self-enhancement of visible light features, thereby minimizing the impact on tracking accuracy due to the absence of the thermal infrared modality. The specific formula is as follows:

\begin{aligned} I_{r g b} & = c o n v (c o n v (H_{r g b}) + s o f t m a x (c o n v (H_{r g b})) \times c o n v (H_{r g b})) \end{aligned}

(4)

\begin{aligned} I_{t} & = c o n v (c o n v (H_{t}) + s o f t m a x (c o n v (H_{r g b})) \times c o n v (H_{r g b})) \end{aligned}

(5)

where

c o n v

represents a

1 \times 1

convolution operation, and

s o f t m a x

denotes normalization processing.

In the initial prompt module, the variables $H_{r g b}, H_{t}$ from the aforementioned formulas correspond to $H_{r g b}^{0}, H_{t}^{0}$ respectively. In subsequent prompt modules, $H_{r g b}, H_{t}$ represent the two pieces of prompt information $I_{r g b}, I_{t}$ outputted from the previous layer. The prompt module, when processing inputs, applies weighting to different regions to aid the network in more effectively learning the crucial information within the input features, capturing essential details, and bolstering the network’s modeling capacity.

3.3.2. Information enhancement fusion module

To better utilize the effective information from the two prompt messages generated by the prompt module, inspired by the CBAM²¹ attention mechanism, this paper employs both spatial and channel attention mechanisms to enhance the model’s focus on important features. Channel attention assigns different weights to each channel, suppressing redundant channels. Spatial attention weights the spatial positions of the feature map to highlight the areas that contribute most to the tracking task. These two types of attention are complementary to each other. Using both channels and spatial attention can further improve model performance. It has proved its effectiveness in multiple image processing tasks, such as image rain removal,²² image classification,²³ etc. The information enhancement fusion module is illustrated in Figure 3. Initially, through a channel attention module as shown in Figure 3(a), global average pooling is performed on the feature map of each channel. Then, the relationships between channels are obtained through a fully connected layer, generating a weight vector to weight the features of each channel. Subsequently, through a spatial attention module as depicted in Figure 3(b), global average pooling is conducted on the feature map of each spatial position, and the relationships between spatial positions are derived through a fully connected layer, generating a weight vector to weight the features of each spatial position. Finally, the features processed by the channel and spatial attention modules are added together to obtain the enhanced and fused prompt information. For the channel attention module, the input features are first processed through both global average pooling and max pooling separately, then treated by a shared fully connected layer. The output features from these processes are then added together and normalized to obtain the channel attention weights. As for the spatial attention module, the input features undergo global average pooling and max pooling respectively, after which the obtained features are concatenated. This concatenated feature is then passed through a convolutional layer, and finally, spatial attention weights are obtained through a normalization operation. The specific formulas are as follows:

\begin{aligned} W_{c} & = s i g m o i d (f c (r e l u (f c (a v g_p o o l (P)))) + f c (r e l u (f c (max_p o o l (P))))) \end{aligned}

(6)

\begin{aligned} W_{s} & = s i g m o i d (c o n v (c o n c a t (a v g_p o o l (P), max_p o o l (P)))) \end{aligned}

(7)

\begin{aligned} P^{*} & = (P_{r} * W_{c}^{r}) * W_{s}^{r} + (P_{i} * W_{c}^{i}) * W_{s}^{i} \end{aligned}

(8)

where

a v g_p o o l

represents global average pooling,

max_p o o l

denotes max pooling,

W_{c}

is the channel attention weight, and

W_{s}

is the spatial attention weight.

Figure 3.

The overall architecture of Information enhancement fusion module, (a) is the process of generating channel attention weights, (b) is the process of generating spatial attention weights, (c) is the overall pipeline.

Integrating channel attention mechanisms and spatial attention mechanisms allows the model to more effectively focus on important channel and spatial information within the input feature maps, thereby enhancing the model’s performance and generalization ability.

3.4. Loss funcation

The loss function adopted in this paper is consistent with that of OSTrack,⁶ employing weighted focal loss²⁴ for classification and utilizing L1 loss and generalized IoU loss²⁵ for bounding box regression. The overall loss function is presented as follows:

\begin{aligned} L = L_{c l s} + λ_{1} L_{g i o u} + λ_{2} L_{1} \end{aligned}

(9)

where

L_{c l s}

represents the weighted focal loss,

L_{g i o u}

denotes the generalized IoU loss,

L_{1}

indicates the L1 loss, and

λ_{1}, λ_{2}

represent hyperparameters.

4. Experiment and data analysis

This section mainly introduces the specific implementation method of DPCFT, evaluates its performance on multiple datasets, and conducts ablation experiments on the involved modules to verify its effectiveness.

4.1. Implementation details

The models in this paper were implemented using PyTorch, trained on two NVIDIA RTX2080Ti. The training set of LasHeR²⁶ was used to train the network in the experiments, a total of 120 epochs were trained, the batch size was set to 16, and each epoch contained 60,000 sample pairs. During model training, the parameters of the backbone network and classification header were frozen and only the parameters of the multi-prompt interactive fusion branch were updated. The frozen parameters were initialised using the baseline model OSTrack,⁶ while the other parameters were initialised using the xavier uniform initialisation scheme.²⁷ The optimiser uses AdamW²⁸ with the weight decay set to 0.0001, and the learning rate set to 0.0004. The input search regions are resized to $256 \times 256$ , and template regions are resized to $128 \times 128$ . The $λ_{1}$ and $λ_{2}$ are set to 2 and 5, respectively. the other experimental details are similar to those of OSTrack.⁶

4.2. Evaluation on LasHeR dataset

The LasHeR dataset²⁶ is a large-scale short-term RGBT tracking dataset containing a total of 1224 pairs of visible and thermal infrared video sequences with 730K frames of images. Its test set contains 245 visible and thermal infrared image sequences. Nine other state-of-the-art trackers are used to compare with the tracker DCPFT proposed in this paper, mainly containing MANet,¹⁰ mfDiMP,²⁹ MaCNet,³⁰ CAT,³¹ MANet++,³² DMCNet,³³ APFNet,¹² ProTrack,⁸ and ViPT.⁹ The results are reported in Figure 4. From the figure, it can be seen that the DCPFT outperforms the state-of-the-art tracker ViPT by 1.7% and 1.3% in PR and SR, respectively. The performance of DCPFT also dominates over other trackers, which fully demonstrates the effectiveness of the DCPFT proposed in this paper.

Figure 4.

Overall performance on LasHeR test set.

4.3. Attribute-based evaluation

In order to evaluate the effectiveness of the algorithm proposed in this paper more comprehensively, the challenge attributes were evaluated on the test set of LasHeR. The LasHeR dataset contains 19 challenge attributes, including Background Clutter (BC), Camera Moving (CM), Fast Motion (FM), Motion Blur (MB), Deformation (DEF), Scale Variation (SV), Heavy Occlusion (HO), Total Occlusion (TO), No Occlusion (NO), Partial Occlusion (PO), Out-of-view (OV), Low Illumination (LI), High Illumination (HI), Abrupt Illumination Variation, AIV), Low Resolution (LR), Thermal Crossover (TC), Similar Appearance (SA), Frame Lost(FL), and Aspect Ratio Change (ARC).

The experimental results are shown in Table 1, and seven algorithms which are better in the above results are selected for further comparative analysis. From the table, it can be seen that DCPFT outperforms other trackers in most of the challenging scenarios, and especially for the challenging scenarios with light changes, such as HI, LI, and AIV, the algorithms in this paper show excellent performance. This further illustrates the effectiveness of the dual-prompt complementary fusion idea proposed in this paper, which can effectively solve the situation of extreme illumination in visible target tracking and give full play to the advantages of both visible and thermal infrared modalities. Meanwhile, for occlusion-type challenging scenarios such as HO, TO, and PO, the algorithm in this paper also demonstrates better performance.

Table 1.
Attribute-based evaluation on LasHeR dataset. Bold for the best results, italic for the second-best results.

Attributes MANet++ MaCNet DMCNet mfDiMP APFNet ProTrack ViPT DPCFT

BC 43.6/31.4 42.2/31.9 42.8/33.1 34.9/27.0 44.9/33.7 49.8/38.8 64.9/51.8 67.2/53.5

CM 42.2/29.4 46.7/33.9 46.2/34.0 40.8/30.6 47.7/35.1 54.1/41.6 62.1/50.0 64.5/51.8

FM 41.1/28.9 43.7/33.0 44.0/33.2 41.3/32.4 45.1/33.9 52.0/41.4 63.1/51.4 65.5/53.3

MB 39.7/26.6 40.4/29.8 44.4/32.2 37.6/28.7 45.9/32.8 52.4/39.5 57.3/45.9 58.8/47.1

DEF 39.4/30.8 41.4/34.0 44.4/36.3 40.3/34.2 45.8/36.8 51.9/42.8 67.4/55.7 69.4/57.1

SV 46.4/31.1 48.0/34.8 48.6/35.1 45.2/34.9 49.8/36.0 54.5/42.5 65.0/52.5 66.4/53.7

HO 24.5/24.4 28.1/29.1 21.8/23.3 19.8/23.8 27.1/27.7 40.2/38.6 43.7/43.8 50.5/46.2

TO 35.4/25.4 38.6/29.2 40.2/29.9 32.2/25.0 41.7/31.4 43.9/34.2 57.6/46.1 59.8/47.5

NO 63.6/40.7 74.0/51.7 67.8/46.3 76.5/57.5 66.7/46.7 75.4/58.0 84.0/68.4 87.9/71.0

PO 44.0/30.1 44.6/32.8 45.9/33.8 39.7/30.8 47.3/34.5 50.5/39.6 62.4/50.3 64.0/51.6

OV 28.0/22.0 34.8/36.7 44.5/40.1 40.6/34.9 36.4/34.2 54.8/45.8 76.2/65.0 76.0/65.6

LI 35.8/24.0 36.0/26.7 39.5/30.0 29.6/23.8 41.8/30.8 42.4/33.4 49.8/41.2 53.5/43.8

HI 53.3/34.7 52.0/37.4 55.3/38.6 46.7/35.1 60.4/41.2 59.5/44.4 67.9/54.2 69.5/55.5

AIV 18.8/15.8 17.3/15.6 20.9/22.0 16.6/16.4 32.1/26.2 30.4/26.7 37.5/35.0 42.0/37.9

LR 47.4/26.8 43.9/28.0 45.4/29.7 40.2/25.6 46.1/29.4 46.2/32.1 56.4/41.6 58.0/43.0

TC 40.1/26.8 39.8/28.7 42.9/31.2 38.0/28.8 43.1/31.6 45.8/35.8 57.3/46.0 59.2/47.5

SA 41.1/27.9 40.8/30.4 42.2/31.9 37.2/29.5 42.8/31.7 45.1/36.3 57.3/46.5 59.2/48.0

FL 37.8/21.6 34.6/22.2 39.3/28.3 32.3/25.7 37.6/27.9 52.0/38.6 59.1/46.5 57.6/45.9

ARC 35.5/25.7 36.0/28.5 37.7/29.1 37.8/30.9 40.5/31.0 47.5/39.1 59.3/49.5 60.5/50.4

ALL 46.7/31.4 48.2/35.0 49.0/35.5 44.7/34.3 50.0/36.2 53.8/42.0 65.1/52.5 66.8/53.8

Attributes	MANet++	MaCNet	DMCNet	mfDiMP	APFNet	ProTrack	ViPT	DPCFT
BC	43.6/31.4	42.2/31.9	42.8/33.1	34.9/27.0	44.9/33.7	49.8/38.8	64.9/51.8	67.2/53.5
CM	42.2/29.4	46.7/33.9	46.2/34.0	40.8/30.6	47.7/35.1	54.1/41.6	62.1/50.0	64.5/51.8
FM	41.1/28.9	43.7/33.0	44.0/33.2	41.3/32.4	45.1/33.9	52.0/41.4	63.1/51.4	65.5/53.3
MB	39.7/26.6	40.4/29.8	44.4/32.2	37.6/28.7	45.9/32.8	52.4/39.5	57.3/45.9	58.8/47.1
DEF	39.4/30.8	41.4/34.0	44.4/36.3	40.3/34.2	45.8/36.8	51.9/42.8	67.4/55.7	69.4/57.1
SV	46.4/31.1	48.0/34.8	48.6/35.1	45.2/34.9	49.8/36.0	54.5/42.5	65.0/52.5	66.4/53.7
HO	24.5/24.4	28.1/29.1	21.8/23.3	19.8/23.8	27.1/27.7	40.2/38.6	43.7/43.8	50.5/46.2
TO	35.4/25.4	38.6/29.2	40.2/29.9	32.2/25.0	41.7/31.4	43.9/34.2	57.6/46.1	59.8/47.5
NO	63.6/40.7	74.0/51.7	67.8/46.3	76.5/57.5	66.7/46.7	75.4/58.0	84.0/68.4	87.9/71.0
PO	44.0/30.1	44.6/32.8	45.9/33.8	39.7/30.8	47.3/34.5	50.5/39.6	62.4/50.3	64.0/51.6
OV	28.0/22.0	34.8/36.7	44.5/40.1	40.6/34.9	36.4/34.2	54.8/45.8	76.2/65.0	76.0/65.6
LI	35.8/24.0	36.0/26.7	39.5/30.0	29.6/23.8	41.8/30.8	42.4/33.4	49.8/41.2	53.5/43.8
HI	53.3/34.7	52.0/37.4	55.3/38.6	46.7/35.1	60.4/41.2	59.5/44.4	67.9/54.2	69.5/55.5
AIV	18.8/15.8	17.3/15.6	20.9/22.0	16.6/16.4	32.1/26.2	30.4/26.7	37.5/35.0	42.0/37.9
LR	47.4/26.8	43.9/28.0	45.4/29.7	40.2/25.6	46.1/29.4	46.2/32.1	56.4/41.6	58.0/43.0
TC	40.1/26.8	39.8/28.7	42.9/31.2	38.0/28.8	43.1/31.6	45.8/35.8	57.3/46.0	59.2/47.5
SA	41.1/27.9	40.8/30.4	42.2/31.9	37.2/29.5	42.8/31.7	45.1/36.3	57.3/46.5	59.2/48.0
FL	37.8/21.6	34.6/22.2	39.3/28.3	32.3/25.7	37.6/27.9	52.0/38.6	59.1/46.5	57.6/45.9
ARC	35.5/25.7	36.0/28.5	37.7/29.1	37.8/30.9	40.5/31.0	47.5/39.1	59.3/49.5	60.5/50.4
ALL	46.7/31.4	48.2/35.0	49.0/35.5	44.7/34.3	50.0/36.2	53.8/42.0	65.1/52.5	66.8/53.8

4.4. Evaluation on RGBT234 dataset

The RGBT234 dataset³⁴ contains 234 pairs of visible and thermal infrared image sequences, totaling 234K frames of images. The comparison results of DCPFT with other 10 algorithms are shown in Table 2. From the table, it can be seen that the PR and SR of DCPFT are improved by 0.9% and 0.7%, respectively, compared to ViPT, and also compared to other tracking algorithms.

Table 2.
Overall performance on RGBT234 dataset. Bold for the best results, italic for the second-best results.

Methods Year Precision Success

MANet¹⁰ 2019 77.7 53.9

mfDiMP²⁹ 2019 64.6 42.8

MaCNet³⁰ 2020 79.0 55.4

CAT³¹ 2020 80.4 56.1

MANet++³² 2021 79.5 55.9

DMCNet³³ 2022 83.9 59.3

APFNet¹² 2022 82.7 57.9

ProTrack⁸ 2022 79.5 59.9

CMD³² 2023 82.4 58.4

ViPT⁹ 2023 83.5 61.7

DPCFT – 84.4 62.4

Methods	Year	Precision	Success
MANet¹⁰	2019	77.7	53.9
mfDiMP²⁹	2019	64.6	42.8
MaCNet³⁰	2020	79.0	55.4
CAT³¹	2020	80.4	56.1
MANet++³²	2021	79.5	55.9
DMCNet³³	2022	83.9	59.3
APFNet¹²	2022	82.7	57.9
ProTrack⁸	2022	79.5	59.9
CMD³²	2023	82.4	58.4
ViPT⁹	2023	83.5	61.7
DPCFT	–	84.4	62.4

4.5. Evaluation on GTOT dataset

The GTOT dataset³⁵ contains 50 pairs of visible and thermal infrared image sequences and is a small RGBT target tracking dataset. The comparison results of DCPFT with other 9 algorithms are shown in Table 3. From the table, it can be seen that the performance of DCPFT is improved compared to the other tracking algorithms, which illustrates the effectiveness of our algorithm DCPFT.

Table 3.
Overall performance on GTOT dataset. Bold for the best results, italic for the second-best results.

Methods Year Precision Success

MANet¹⁰ 2019 89.4 72.4

mfDiMP²⁹ 2019 87.7 73.1

MaCNet³⁰ 2020 88.0 71.4

CAT³¹ 2020 88.9 71.7

MANet++³² 2021 90.1 72.3

DMCNet³³ 2022 90.9 73.3

APFNet¹² 2022 90.5 73.9

CMD³⁶ 2023 89.2 73.4

ViPT⁹ 2023 92.0 76.4

DPCFT – 93.1 76.8

Methods	Year	Precision	Success
MANet¹⁰	2019	89.4	72.4
mfDiMP²⁹	2019	87.7	73.1
MaCNet³⁰	2020	88.0	71.4
CAT³¹	2020	88.9	71.7
MANet++³²	2021	90.1	72.3
DMCNet³³	2022	90.9	73.3
APFNet¹²	2022	90.5	73.9
CMD³⁶	2023	89.2	73.4
ViPT⁹	2023	92.0	76.4
DPCFT	–	93.1	76.8

4.6. Parametric analysis

In order to further analyze the practical application effect of the algorithm proposed in this chapter, a parameter analysis is carried out, and the results are shown in Table 4. In order to effectively alleviate the problem of large number of parameters, the prompt learning method is adopted, and the parameters of the backbone network are frozen during training, so as to effectively reduce the resource consumption during training. It can be effectively applied to real-world scenarios that require a balance between computing resources and accuracy.

Table 4.
Parametric analysis. Bold for the best results, italic for the second-best results.

Methods Year Total Params Train Params

MANet¹⁰ 2019 7.28 7.28

mfDiMP²⁹ 2019 175.82 175.82

MaCNet³⁰ 2020 14.86 14.86

MANet++³² 2021 7.38 7.38

APFNet¹² 2022 15.01 15.01

ViPT⁹ 2023 92.52 0.84

DPCFT – 96.437 4.612

Methods	Year	Total Params	Train Params
MANet¹⁰	2019	7.28	7.28
mfDiMP²⁹	2019	175.82	175.82
MaCNet³⁰	2020	14.86	14.86
MANet++³²	2021	7.38	7.38
APFNet¹²	2022	15.01	15.01
ViPT⁹	2023	92.52	0.84
DPCFT	–	96.437	4.612

4.7. Ablation study

4.7.1. Variants comparison

In order to verify the effectiveness of the individual modules of the proposed model, ablation experiments are performed on the test set of LasHeR. Three different models are compared: model 1 represents a single-branch RGB tracker containing a backbone network and a classification header, model 2 represents a baseline tracker with a prompt (P) module added to it, and DCPFT represents the tracker proposed in this paper.Model 3 and model 4 represent the CA module and SA module in the IEF module, respectively. The experimental results are shown in Table 5.

From the table, it can be seen that with the addition of the Prompt module, the accuracy and success rate are improved by 14.7% and 12.2%, respectively, compared to the baseline network, which indicates that the design of the prompt module is effective for both visible and thermal infrared modalities. From the experimental results of model 3 and model 4, it can be seen that the tracking performance is reduced when only the CA or SA module is used, which further illustrates the effectiveness of the IEF module. The performance is further improved with the addition of the information-enhanced fusion module, indicating the effectiveness of the module. Thus, the results of the ablation experiments show that the modules of our proposed algorithm are effective.

Table 5.
Ablation studies on LasHeR test set. Bold for the best results.

Model P CA SA IEF Precision Success

1 51.5 41.2

2 ✓ 66.2 53.4

3 ✓ ✓ 65.9 52.8

4 ✓ ✓ 65.8 52.6

DPCFT ✓ ✓ 66.8 53.8

Model	P	CA	SA	IEF	Precision	Success
1					51.5	41.2
2	✓				66.2	53.4
3	✓	✓			65.9	52.8
4	✓		✓		65.8	52.6
DPCFT	✓			✓	66.8	53.8

For different prompter designs, it has a large impact on the effect of the final model. In this paper, we analyse the effects of three kinds of prompters, whose designs are shown in Figure 5, and the experimental results are shown in Table 6. Model a is shown in Figure 5(a), which uses only thermal modality as the prompt information. Model b is shown in Figure 5(b), using visible and thermal modalities as the two prompt information, where each prompt information is designed using the interaction of the two modalities. Model c, shown in Figure 5(c), employs the prompter design method proposed in this paper, where one of the prompts employs two modal interactions while the other employs self-interaction enhancement of visible light features. From the experimental results, it can be seen that the prompter design method adopted by our algorithm is more effective compared to the other two methods.

Figure 5.

Design drawings of different prompters.

Table 6.

Quantitative comparison between different variants of prompter on the LasHeR dataset. Bold for the best results.

Methods	Precision	Success
a	64.5	52.5
b	65.8	53.1
c	66.2	53.8

4.7.2. Visualization

To better verify the effectiveness of the proposed DPCFT algorithm, tracking results were visualized on several representative tracking sequences, as shown in Figure 6(a) shows the visualization results for the sequence 10runone, demonstrating that, in the event of severe occlusions, other trackers exhibited various degrees of deviation, whereas DPCFT was still able to accurately track the target. Figure 6(b) presents the visualization results for the sequence small-gai, where the target is a transparent object with other objects in the background, the other two algorithms were more likely to focus on background features, yet the algorithm discussed in this paper was still able to effectively track the target. Figure 6(c) displays the visualization results for the sequence truckgonorth, where there were low and high illumination changes in the scene, the other two algorithms experienced varying degrees of deviation due to illumination changes, whereas the algorithm proposed in this paper was still able to track the target well. The analysis of the visualization tracking results shows that the DPCFT algorithm possesses good robustness and effectiveness.

Figure 6.

Visualize the tracking results. Green bounding boxes represent the true boundaries, red bounding boxes represent DPCFT, purple bounding boxes represent ViPT, and yellow bounding boxes represent OSTrack.

4.7.3. Failure cases

Figure 7 depicts cases of tracking failure. In the figure, green bounding boxes represent the true positions of the objects, while red bounding boxes depict the predictions made by the method proposed in this paper. As shown in Figure 7(a), for the sequence boytakingbasketballfollowing, the tracker performs well under normal conditions. However, when sudden camera movement causes image blur, and the target is severely obscured by trees before re-emerging into extreme lighting conditions, the tracker loses the target. Subsequently, the target exits and re-enters the field of view, appearing very small upon re-entry. Under such complex scenarios involving multiple challenges, the method presented in this chapter struggles to maintain effective tracking. As illustrated in Figure 7(b), for the sequence boyunder2baskets when there is simultaneous extreme lighting and thermal crossover, along with the presence of multiple similar objects, the method fails to accurately track the target. This failure is primarily due to the reduced reliability of both modalities under these conditions and the distraction caused by multiple similar objects, leading to easy deviation of the tracker.

Figure 7.

Failure Cases. Green bounding boxes represent the true boundaries, red bounding boxes represent DPCFT. (a) boytakingbasketballfollowing; (b) boyunder2baskets.

5. Conclusion

In this paper, a double-prompt complementary fusion RGBT tracking network is proposed. It effectively solves the problem of inaccurate tracking results due to the absence of a single modality in practical applications. The network makes full use of the complementary properties of visible and thermal modes, and by introducing a dual-prompt design, the accuracy of the tracking results can still be ensured in the absence of one of the modes. In addition, an information enhancement fusion module is introduced, which can effectively enhance the expressive ability of the prompt information, thus improving the network’s ability to effectively integrate information from different modalities. Meanwhile, the strategy of cue learning is adopted to enable the pre-trained base model for visible light tracking to be migrated to RGBT target tracking. The excellent performance of the visible light target tracking model is fully utilised, and the computational resources and time consumption are greatly reduced. Through experimental validation on several benchmark datasets, compared with several existing RGBT target tracking algorithms, the method proposed in this paper shows good performance in coping with scenarios with varying illumination such as high illumination and low illumination, as well as the presence of various kinds of occlusions, and has good practicality and promotion value. However, from the case of tracking failure, it can be seen that the tracking accuracy of this paper is low for scenes with multiple complex challenges at the same time, and we consider further increasing the processing of lighting changes and tracking of small target objects in the future work to further improve the algorithm.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grants 52374155, Anhui Provincial Natural Science Foundation under Grant No. 2308085MF218, Natural Science Research Project of Colleges and Universities in Anhui Province under Grant No. 2022AH040113.

ORCID iD

Hongwei Ge

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Zhang

Leung

, et al. Object fusion tracking based on visible and infrared images: a comprehensive review. Inf Fusion 2020; 63: 166–187.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Proceedings of the 31st international conference on neural in-formation processing systems, 2017, pp.6000–6010.

Wang

Zhou

Wang

, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.1571–1580.

Chen

Yan

Zhu

, et al. Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.8126–8135.

Yan

Peng

, et al. Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.10448–10457.

Chang

, et al. Joint feature learning, relation modeling for tracking: a one-stream framework. In: European conference on computer vision, 2022, pp.341–357. Springer.

Jia

Tang

Chen

B-C

, et al. Visual prompt tuning. In: European conference on computer vision, 2022, pp.709–727. Springer.

Yang

Zheng

, et al. Prompting for multi-modal tracking. In: Proceedings of the 30th ACM international conference on multimedia, 2022, pp.3492–3500.

Zhu

Lai

Chen

, et al. Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.9516–9526.

10.

Zheng

, et al. Multi-adapter RGBT tracking. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp.0–0.

11.

Wang

Shu

Zhang

, et al. MFGNet: dynamic modality-aware filter generation for RGB-T tracking. IEEE Trans Multimedia 2022; 25: 4335–4348.

12.

Xiao

Yang

, et al. Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI conference on artificial intelligence, vol. 36, 2022, pp.2831–2838.

13.

Zhang

Peng

, et al. SiamFT: an RGB-infrared fusion tracking method via fully convolutional Siamese networks. IEEE Access 2019; 7: 122122.

14.

Bertinetto

Valmadre

Henriques

, et al. Fully-convolutional Siamese networks for object tracking. In: Computer vision–ECCV 2016 workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, 2016, pp.850–865. Springer.

15.

Zhang

Peng

, et al. DSiamMFT: an RGB-T fusion tracking method via dynamic Siamese networks using multi-layer feature fusion. Signal Process Image Commun 2020; 84: 115756.

16.

Hui

Xun

Peng

, et al. Bridging search region interaction with template for RGB-T tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.13630–13639.

17.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

18.

Wang

Zhang

Ebrahimi

, et al. Dualprompt: complementary prompting for rehearsal-free continual learning. In: European conference on computer vision, 2022, pp.631–648. Springer.

19.

Tian

Zheng

Zuo

, et al. A cross transformer for image denoising. Inf Fusion 2024; 102: 102043.

20.

Jiang

Wang

Chen

, et al. Magic ELF: image deraining meets association learning and transformer. arXiv preprint arXiv:2207.10455, 2022.

21.

Woo

Park

Lee

J-Y

, et al. Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.3–19.

22.

Jiang

Wang

, et al. Rain-free and residue hand-in-hand: a progressive coupled network for real-time image deraining. IEEE Trans Image Process 2021; 30: 7404–7418.

23.

Park

Woo

Lee

J-Y

, et al. BAM: bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.

24.

Law

Deng

. Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.734–750.

25.

Rezatofighi

Tsoi

Gwak

, et al. Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.658–666.

26.

Xue

Jia

, et al. LasHeR: a large-scale high-diversity benchmark for RGBT tracking. IEEE Trans Image Process 2021; 31: 392–404.

27.

Glorot

Bengio

. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp.249–256.

28.

Loshchilov

Hutter

. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

29.

Zhang

Danelljan

Gonzalez-Garcia

, et al. Multi-modal fusion for end-to-end RGB-T tracking. In: Proceedings of the IEEE/CVF International conference on computer vision workshops, 2019, pp.0–0.

30.

Zhang

Zhuo

, et al. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 2020; 20: 393.

31.

Liu

, et al. Challenge-aware RGBT tracking. In: European conference on computer vision, 2020, pp.222–237. Springer.

32.

Yan

, et al. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans Image Process 2021; 30: 5613–5625.

33.

Qian

, et al. Duality-gated mutual condition network for RGBT tracking. IEEE Transactions on Neural Networks and Learning Systems 2022: 1–14.

34.

Liang

, et al. RGB-T object tracking: benchmark and baseline. Pattern Recognit 2019; 96: 106977.

35.

Cheng

, et al. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process 2016; 25: 5743–5756.

36.

Zhang

Guo

Jiao

, et al. Efficient RGB-T tracking via cross-modality distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.5404–5413.

Dual-prompt complementary fusion network for RGBT tracking

Abstract

Keywords

1. Introduction

2. Related work

2.1. RGBT target tracking

2.2. Prompt learning

3. Method

3.2. ViT backbone

3.3.1. Prompt module

4.1. Implementation details

4.2. Evaluation on LasHeR dataset

4.7.1. Variants comparison

Table 5. Ablation studies on LasHeR test set. Bold for the best results. Model P CA SA IEF Precision Success 1 51.5 41.2 2 ✓ 66.2 53.4 3 ✓ ✓ 65.9 52.8 4 ✓ ✓ 65.8 52.6 DPCFT ✓ ✓ 66.8 53.8

Footnotes

Acknowledgments

ORCID iD

Funding

Declaration of conflicting interests

References

Table 5.
Ablation studies on LasHeR test set. Bold for the best results.

Model P CA SA IEF Precision Success

1 51.5 41.2

2 ✓ 66.2 53.4

3 ✓ ✓ 65.9 52.8

4 ✓ ✓ 65.8 52.6

DPCFT ✓ ✓ 66.8 53.8