Using DINOv2 and multimodal fusion to build a photoelectric detection system and precision optimization model

Abstract

Existing photoelectric detection systems often suffer from inadequate image feature extraction accuracy and suboptimal performance in multimodal information fusion, particularly under complex and dynamic environmental conditions. To address these challenges, this paper presents a novel photoelectric detection framework based on the self-supervised vision model DINOv2 and a cross-modal Transformer architecture with deformable attention. First, DINOv2 is employed to extract rich, global semantic features from visible and infrared images, generating high-fidelity visual representations at a unified scale without reliance on large annotated datasets. Second, a deformable cross-modal attention mechanism within a Transformer-based fusion network is designed to enable adaptive spatial alignment and deep integration of heterogeneous modalities, effectively capturing long-range dependencies and local structural correspondences. Finally, a self-supervised fine-tuning module based on contrastive learning is introduced to enhance feature consistency across modalities and improve the robustness of the fused representation under environmental variations. Experimental results on benchmark multimodal detection tasks demonstrate that the proposed method achieves a target recognition accuracy ranging from 86.4% to 93.2%, with a maximum performance gain of 7.8% over baseline models. Moreover, the cross-modal alignment error is reduced to within 2.7%, indicating superior fusion precision. The proposed framework significantly enhances both detection accuracy and fusion consistency, offering a promising solution for the development of high-performance, robust photoelectric detection systems in real-world scenarios.

Keywords

multimodal perception optical-electronic detection system Distillation with No labels v2 contrastive learning accuracy optimization model

Introduction

With the widespread application of intelligent sensing technology in security, remote sensing, military and other fields, optoelectronic detection systems have put forward higher requirements for target recognition and tracking accuracy in complex scenes.^1,2 In order to improve the environmental adaptability and target recognition accuracy of the system, multimodal information fusion has become a key direction for the development of photoelectric detection systems in recent years.^3,4 By complementing the features of multi-source data such as visible light and infrared images and jointly modeling them, the system’s perception robustness and discrimination ability in uncertain environments can be significantly enhanced.^5,6 However, current multimodal fusion systems still have technical bottlenecks such as insufficient feature representation capabilities, poor semantic consistency between modalities, and rough fusion strategies, which require systematic optimization from the perspectives of underlying modeling methods and system structure design.^7,8 Therefore, building a new photoelectric detection system with high-precision semantic understanding, multimodal alignment and stable output capabilities has important theoretical significance and engineering value for promoting the development of intelligent sensing technology and achieving high-reliability detection in complex scenarios.^9,10

Current photoelectric detection systems still face three key challenges in complex environments, which seriously restrict their application performance in high-precision perception and engineering deployment.^11,12 Challenge 1: Existing feature extraction methods based on convolutional neural networks are limited by the receptive field and local convolution structure, making it difficult to capture long-range dependencies and global semantic information in images. Especially in low-contrast, partial occlusion, and strong background interference scenes, it is often impossible to extract discriminative deep semantic features, resulting in a significant decrease in recognition accuracy.^13,14 Challenge 2: Traditional multimodal fusion is mostly based on static strategies such as feature splicing and weighted averaging, which do not take into account the semantic offset between modalities, the difference in perceptual granularity and redundant interference information, resulting in blurring and distortion of features after fusion and degradation of discrimination ability.^15,16 Challenge 3: Due to the heterogeneity of infrared and visible light in spatial structure, imaging mechanism and signal-to-noise characteristics, inaccurate modality alignment can easily lead to semantic mismatch and reduce the expression effectiveness of the fused features.^17,18

To solve the above problems, recent studies have attempted to introduce self-supervised visual Transformer models and contrastive learning mechanisms to enhance multimodal perception capabilities.^19,20 For example, self-supervised models such as DINO (Distillation with No labels) and MAE (Masked Autoencoders) have demonstrated superior semantic modeling capabilities on large-scale unlabeled images and have the potential to transfer knowledge to heterogeneous modalities such as infrared/visible light.^21,22 At the same time, the Transformer fusion mechanism based on the deformable cross-modal attention structure and modality contrast loss have made progress in some visual tasks. However, current methods generally have the following shortcomings: First, multimodal features lack spatial structural uniformity in the encoding stage, making it difficult to directly align semantic distribution; second, most fusion strategies are static splicing or unified attention paths, which makes it difficult to perform structural adaptive adjustments based on local differences in input features. To this end, this paper introduces DINOv2 combined with a cross-modal Transformer mechanism, supplemented by a contrastive learning optimization strategy, to improve the system’s adaptability and recognition accuracy for multi-source information.

This study aims to develop a high-precision, multimodal fusion framework for photoelectric detection systems capable of operating reliably in complex environments. To this end, we employ the self-supervised vision model DINOv2 (Self-Distillation with No Labels v2) to extract rich, global semantic features from visible and infrared imagery, achieving high-quality, label-free visual representations under diverse conditions. A Transformer-based fusion architecture incorporating a cross-modal attention mechanism is specifically designed to enable deep, adaptive alignment and integration of heterogeneous modalities, effectively capturing both spatial and semantic correspondences. Furthermore, a contrastive learning-based self-supervised optimization module is introduced to enhance the consistency of fused features across modalities and improve the robustness of target recognition under environmental variations. Experimental results demonstrate that the proposed method achieves a target recognition accuracy ranging from 86.4% to 93.2% in multimodal detection tasks, with a modal alignment error reduced to within acceptable limits. These outcomes confirm the framework’s effectiveness in addressing key limitations of existing systems—particularly low fusion accuracy and poor robustness—under challenging operational scenarios. The proposed approach provides a solid foundation for the next generation of intelligent, multimodal photoelectric detection systems.

Construction of optoelectronic detection system optimization model

DINOv2 feature extraction based on spectral adaptation

This paper uses DINOv2 as the basic feature encoder and trains infrared images and visible light images separately under unsupervised conditions to obtain semantic embedding representations with unified dimensions and coordinated distribution.

First, to address the structural heterogeneity of multimodal inputs, two DINOv2 training processes with shared parameters but independent data sources are constructed. For each type of image, multi-scale view pairs are constructed through view generation strategies such as random cropping, color perturbation, and Gaussian blur. The student network and teacher network in the DINOv2 structure are used to process different views respectively. The student network parameters are updated through gradient back propagation, while the teacher network uses exponential moving average for gradient-free update. This mechanism effectively improves the robustness of the model to cross-modal information changes (such as illumination, thermal noise, and texture loss) by maximizing the similarity of different view outputs in the feature space. The parameter configuration is shown in Table 1.

Table 1.

Parameter configuration.

Training hyperparameters	Initial learning rate	5e^-4
	Learning rate scheduling strategy	Cosine annealing
	Total number of training rounds (epochs)	100
	Batch size	64
View generation strategy	Random cropping	Size range [0.8, 1.0]
	Color perturbation	Brightness, contrast, and saturation change 0.4
	Gaussian blur	Kernel size 5 and standard deviation 1.0
Model structure parameters	Student network update method	Gradient backpropagation
Model structure parameters	Teacher network update method	Exponential moving average and no gradient update
Optimizer	Optimizer type	AdamW
Optimizer	Weight decay	0.05

Table 1 covers the key parameter configurations in the DINOv2 training process. The initial learning rate is set to 5e-4, and the Cosine Annealing strategy is used to smoothly decrease. The training is performed for 100 rounds with a batch size of 64 to ensure stable training and moderate efficiency. The view generation strategy enhances input diversity through random cropping, color perturbation, and Gaussian blur, effectively improving the model’s robustness to cross-modal interference such as lighting and texture changes. The student network uses gradient back propagation to update parameters, while the teacher network is updated without gradient through exponential moving average to ensure stability and consistency during training. The optimizer selected AdamW, combined with a weight decay of 0.05, helps prevent overfitting and promotes model generalization. The overall configuration is scientific and reasonable, providing a solid training foundation for the model in multimodal image feature extraction tasks.

In terms of model architecture, DINOv2 uses Vision Transformer (ViT) as the encoding backbone, divides the input image into a patch sequence of fixed size, obtains embedding through linear projection, and then inputs the Transformer backbone for global modeling. This structure models the semantic dependencies between arbitrary regions in an image through a multi-head self-attention mechanism, effectively making up for the weakness of the CNN (Convolutional Neural Network) structure in long-distance modeling. The features extracted by this mechanism retain the boundary information, structural semantics and multi-scale context of the target, which is conducive to information alignment and matching in subsequent modal fusion, as shown in Figure 1.

Figure 1.

DINOv2 model architecture.

Figure 1 shows the image encoding process based on Vision Transformer in DINOv2. The input includes visible light or infrared images, which serve as the original data source of the system. The input image is divided into multiple non-overlapping image blocks of fixed size (6 × 16 pixels), and the two-dimensional image is converted into a series of one-dimensional Patch sequences for easy processing by Transformer. Each patch is flattened into a one-dimensional vector and mapped to a unified high-dimensional feature space through a linear transformation layer to generate an embedded representation of the patch. The Transformer model captures the spatial position information between patches and adds the corresponding position information encoding to each patch embedding vector to make up for the defect that the Transformer structure itself does not have position information.

During the training process, in order to ensure that the infrared and visible light images have a good alignment basis in the feature space, this paper introduces an additional inter-modal structural consistency constraint term in addition to the regularized objective function of DINOv2. Specifically, the structure-preserving loss is used to constrain the cosine similarity of the features extracted from images of different modalities but the same scene, so that they are close in the embedding space, thereby alleviating the semantic shift problem caused by the difference in modal distribution. The loss is defined as follows:

L_{align} = 1 ‐ \cos (f_{vis}, f_{ir})

(1)

In formula (1),

f_{vis}

and

f_{ir}

represent the global feature vectors extracted from visible light and infrared images by DINOv2, respectively.

In addition, DINOv2 itself has the ability to cluster features, that is, the model aggregates the features of semantically consistent areas through self-supervision, effectively improving the semantic consistency under unlabeled conditions. Combined with multimodal input, the infrared image and the visible light image are trained to activate the attention maps of the same area in the Transformer backbone, thereby aligning the structural information and providing high-semantic correlation features for downstream modal fusion.

Finally, each input image is mapped into a semantic embedding vector of uniform dimension (768 dimensions) as the input basis of the multimodal Transformer module. In order to further improve the semantic density and robustness of features extracted from images of different modalities, this paper performs feature normalization after pre-training, and maps all modal features into a unified fusion space through a linear projection layer, thereby constructing a feature base representation with high fusibility and cross-modal consistency.

Deformable cross-modal transformer fusion

In order to achieve high-precision feature fusion and alignment between infrared and visible light modalities, this paper designs and implements a multi-layer Transformer fusion structure based on a deformable cross-modal attention mechanism to complete deep interaction modeling and semantic alignment between modalities. In this module, the infrared image features $F_{ir} \in R^{N \times d}$ and visible light image features $F_{vis} \in R^{N \times d}$ extracted by DINOv2 in the previous stage are sent to the fusion structure for joint encoding. First, the position encoding of the two types of input modal features is performed. A two-dimensional sinusoidal position encoding method compatible with ViT is used to add spatial information to each patch position to keep the spatial structure aligned.

Second, a two-stream cross-attention Transformer structure is constructed, where each layer consists of two independent deformable cross-modal attention modules and a shared Feedforward module. The deformable cross-modal attention module uses visible light as the query and infrared as the key and value to achieve information interaction between modalities. The specific expression is:

{Attention}_{vis \leftarrow ir} (Q_{vis}, K_{ir}, V_{ir}) = softmax (\frac{Q_{vis} K_{ir}^{T}}{\sqrt{d}}) V_{ir}

(2)

In formula (2),

Q_{vis}

is the query obtained by linear mapping of visible light image features, and

K_{ir}

and

V_{ir}

are the key and value obtained by linear transformation of infrared image features. Query, Key, and Value are obtained from the input features through linear mapping, and the parameters are updated independently at each layer to avoid cross-layer interference. Residual connection and LayerNorm operation are used to maintain training stability. The final cross-attention output is:

H_{vis} = LayerNorm ({\tilde{F}}_{vis} + {Attention}_{vis \leftarrow ir})

(3)

This structure switches the modality direction in the next layer (i.e., infrared is used as the query) to ensure bidirectional semantic alignment. After stacking $L$ layers of cross-attention modules, a deep fusion feature representation $F_{fusion} \in R^{N \times d}$ is generated, in which each patch encoding contains semantic information of both modalities. In order to enhance the complementarity of modalities and improve the discriminability of fusion, this paper introduces a modal attention gating mechanism based on deformable cross-modal attention. This mechanism calculates the dynamic weight of each modality in the fusion output to control the intensity of modal information flow. The gating weight is defined as:

w_{ir}, w_{vis} = σ (W_{g} [F_{ir} \oplus F_{vis}])

(4)

In formula (4),

σ

is the Sigmoid activation function,

W_{g}

is the trainable parameter matrix, and

\oplus

represents the concatenation operation. The final fusion feature is obtained by weighted summation:

F_{fusion} = w_{ir} \cdot F_{ir}^{'} + w_{vis} \cdot F_{vis}^{'}

(5)

In formula (5),

F^{i r}

and

F^{v i s}

are cross-attention output features. This mechanism improves the information expression ability of key modalities in different scenarios and reduces the risk of redundant modalities interfering with system performance.

In addition, in order to improve the accuracy of local area alignment, this paper embeds a local fine-grained matching module in each layer of Transformer. This module takes each patch as the center, constructs the corresponding neighborhood matching graph using the cosine similarity between modalities, and performs local weighted fusion operations on the features in the matching area. The graph matching strategy is used to adaptively align the misaligned regions between modalities at the patch level to optimize the consistent representation of the final features.

During the training process, the modal balance loss function is introduced to constrain the distribution difference of the two modal contributions in the fusion feature, which is defined as:

L_{balance} = | \frac{1}{N} \sum_{i = 1}^{N} w_{ir}^{(i)} - w_{vis}^{(i)} |

(6)

This loss is used to guide the network to automatically balance the information proportion of different modalities in the fusion process, prevent a specific modality from dominating the attention mechanism for a long time, and improve the expression integrity of multimodal features. Finally, the fusion structure output $F_{fusion}$ is sent to the subsequent self-supervised optimization module for consistency enhancement and decision reasoning. Experiments show that this fusion structure has significant advantages over traditional feature splicing methods in target recognition and modality alignment tasks, especially in complex backgrounds, showing higher semantic discrimination and robustness.

Noise-aware contrastive learning optimization

In order to further improve the robustness and consistency of cross-modal fusion features, this paper introduces a self-supervised modality consistency contrastive learning optimization module based on DINOv2 feature extraction and deformable cross-modal attention fusion structure. By constructing positive and negative sample pairs and defining a consistency contrast loss function, the fused features are guided to maintain the convergence of inter-modal representations in the semantic space, thereby alleviating the semantic shift and fusion ambiguity problems between heterogeneous modalities.

In the feature construction stage, the visible light image and infrared image obtained in the same scene are assumed to be $x_{vis}$ and $x_{ir}$ respectively. After being processed by DINOv2 and the cross-modal Transformer, the corresponding fused feature representation $f_{vis}, f_{ir} \in R^{d}$ is obtained. The image pair $(f_{vis}, f_{ir})$ of the same scene is taken as the positive sample pair, and the image pair $(f_{vis}, f_{ir}^{-})$ of different scenes is selected as the negative sample pair to construct a feature contrast learning sample set. To avoid positive and negative sample label dependency, all contrast relationships are constructed under unsupervised conditions, and sample matching is performed based on the spatiotemporal consistency of image acquisition. The contrastive learning loss function is constructed based on the temperature-scaled InfoNCE loss and is defined as follows:

L_{con} = - \log \frac{\exp (\sin (f_{vis}, f_{ir}) / τ)}{\sum_{j = 1}^{K} \exp (\sin (f_{vis}, f_{ir}^{(j)}) / τ)}

(7)

In formula (7),

\sin (\cdot)

represents the cosine similarity between normalized vectors,

τ

is the temperature coefficient, and

K

is the number of negative samples corresponding to each positive sample. By maximizing the similarity between positive sample pairs and compressing their feature distance with negative sample pairs, the model learns the feature aggregation direction between modalities and strengthens the semantic consistency of homologous images.

In order to ensure the consistency of local areas instead of relying solely on global feature matching, this paper further divides the fused feature $F_{fusion} \in R^{N \times d}$ into Patch units, performs local comparison on the Patch feature $(f_{vis}^{(i)}, f_{ir}^{(i)})$ at each position $i$ , and defines the local comparison loss as:

L = \frac{1}{N} \sum_{i = 1}^{N} - \log \frac{\exp (\sin (f_{vis}^{(i)}, f_{ir}^{(i)}) / τ)}{\sum_{j = 1}^{K} \exp (\sin (f_{vis}^{(i)}, f_{ir}^{- (j)}) / τ)}

(8)

This loss term strengthens the consistency of semantic responses of local regions in different modalities, effectively improving the detail alignment accuracy and target distinction ability after modal fusion.

In the actual training process, in order to improve the stability of feature contrast learning, the momentum encoder mechanism is used to build a contrast sample library. That is, a replica network with slow parameter updates is constructed for the DINOv2 feature encoder, and its parameter $θ_{m}$ is updated in the following way:

θ_{m} \leftarrow m \cdot θ_{m} + (1 - m) \cdot θ

(9)

In formula (9),

m

is the momentum coefficient, which is set to 0.999. This mechanism can avoid the training instability problem caused by the dynamic changes of positive and negative samples, and significantly improve the consistent representation ability of feature embedding.

At the same time, the contrast feature normalization process is introduced. Each feature vector is first normalized by LayerNorm and then L2 normalized to map it to the unit sphere to enhance the separability of different modal features in the angular space and improve the distinguishing ability of cosine similarity:

\hat{f} = \frac{LayerNorm (f)}{∥ LayerNorm (f) ∥_{2}}

(10)

In addition, in order to reduce the impact of pseudo-negative samples, this paper introduces the Hard Negative Mining strategy, which dynamically updates the negative sample selection set based on the feature similarity generated in the initial training stage, retaining only sample pairs whose similarity with positive samples is lower than a certain threshold to enter the loss calculation, excluding modality mismatch samples and improving the purity of contrastive learning.

Finally, the total loss is composed of a weighted combination of the global contrast loss $L_{com}$ , the local consistency loss $L_{local}$ , and the modality balance loss $L_{balance}$ :

L_{total} = λ_{1} L_{con} + λ_{2} L_{local} + λ_{3} L_{balance}

(11)

In formula (11), each weight

λ_{i}

is determined by cross-validation to ensure that the optimization objectives between multiple tasks do not conflict. The above optimization module is jointly trained with the main network structure in the end-to-end process to continuously constrain modal consistency and regional alignment.

Implementation of software-hardware collaborative system

In order to achieve accurate perception, deep fusion and efficient recognition of multimodal images in complex scenes, this paper deeply integrates the DINOv2 feature extraction module, the cross-modal Transformer fusion structure and the modality consistency comparison learning optimization module to build an end-to-end integrated photoelectric detection system framework.^23,24 The system is based on the main process structure of “input-perception-fusion-discrimination.” Through modular deployment and unified training strategies, it solves the problems of information fragmentation, unstable fusion and weak feature representation in traditional systems.

The overall architecture of the system consists of four core sub-modules: input preprocessing module, DINOv2 dual-channel encoding module, deformable cross-modal attention fusion module, and modal consistency optimization and decision output module. Each sub-module is constructed in series based on a unified data flow channel, and end-to-end training and reasoning are achieved through the GPU (Graphics Processing Unit) parallel acceleration framework, as shown in Table 2.

Table 2.

Computing resource allocation.

Module name	Computational capacity (FLOPs, giga)	Memory usage (MB)	Computation time (ms)	GPU utilization (%)	Parallelization method
Input preprocessing module	1.2	150	8	15	CPU and GPU hybrid
DINOv2 dual-channel encoding module	25	1200	140	92	GPU single card parallelism
Deformable cross-modal attention fusion module	15	900	100	85	Multi-GPU data parallelism
Modal consistency optimization and decision output module	7	600	60	78	GPU single card parallelism

Table 2 shows the distribution of data processing and computing resources of each core module in the photoelectric detection system. The input preprocessing module has a low computational workload, small memory requirements and computational time. It is mainly responsible for data format conversion and preliminary screening, and adopts a hybrid computing method of CPU (Central Processing Unit) and GPU to ensure high efficiency. The DINOv2 dual-channel encoding module, as the backbone of feature extraction, has the largest amount of computation, the most intensive memory usage, the longest running time, and a GPU utilization rate of 92%, indicating that this module is the computational bottleneck of the system and that efficient processing can be achieved by using a single GPU card in parallel. The cross-modal Transformer fusion module is at a medium level in terms of computation and memory consumption, and uses multi-GPU data parallelism to further accelerate the deep interaction and alignment process. The modality consistency optimization and decision output module has relatively low computational resource requirements, but still maintains a high GPU utilization rate, and is responsible for feature consistency reinforcement and final judgment.

In the input stage, the system receives visible light image $I_{vis}$ and infrared image $I_{ir}$ synchronously collected from the photoelectric sensor device. After normalization and resolution alignment,^25,26 the two modal images are respectively sent to the weight-shared DINOv2 backbone network structure. The ViT-B/16 model obtained by pre-training DINOv2 under unsupervised conditions is used as a feature extractor to encode $I_{vis}$ and $I_{ir}$ respectively, and extract patch-level semantic features $F_{vis}$ and $F_{ir} \in R^{N \times D}$ of unified dimensions to ensure that subsequent fusion is performed in the homogeneous space. Subsequently, the system feeds the bimodal features into the designed multi-layer deformable cross-modal attention Transformer fusion structure.

After the fusion feature output, it enters the modality consistency contrast learning module. The module uses the constructed contrast loss function to constrain the consistency of the fusion features of $F_{vis}$ and $F_{ir}$ in the spatial dimension. At the same time, a local loss term based on patch comparison is introduced in the local area to strengthen spatial alignment. During the training process, the momentum encoder and the negative sample filtering mechanism jointly improve the robustness of contrastive learning and sample quality, ensuring that similar modes in the feature space have a high degree of aggregation.

In order to realize the photoelectric detection output at the system task level, this paper introduces a task head based on Transformer Decoder at the end of the fusion module.^27,28 The task head contains two sub-modules: one is the classification branch based on Class Token, which outputs the target category confidence; the other is the spatial prediction branch based on position encoding, which generates the center coordinates and bounding box parameters of the target. The multi-task joint loss $L_{t}$ is used in the training phase, which integrates the classification loss, box regression loss and modality consistency loss to achieve end-to-end optimization:

L_{t} = λ_{cls} L_{cls} + λ_{reg} L_{reg} + λ_{con} L_{con}

(12)

In formula (12),

λ_{cls}, λ_{reg}, λ_{con}

respectively control the optimization weights of each loss branch to improve the model’s balance performance between feature aggregation and target discrimination. The system training uses the AdamW optimizer, the initial learning rate is set to 5e^-4, the Cosine Annealing scheduling strategy is used to decrease, the total number of training rounds is 100 epochs, and the Batch Size is 64. When the model is deployed, the feature extraction, fusion and discrimination processes are encapsulated into a unified inference engine with a unified interface to support real-time processing. This process achieves adaptive fusion of multimodal information of optoelectronic images, enhanced feature consistency, and integrated modeling of task output through tightly coupled structural optimization and loss function collaborative training between modules.^29,30

Figure 2 shows the change in learning rate and the decrease in training loss during model training. The left vertical axis is the learning rate. The Cosine Annealing strategy is used to make the learning rate gradually decrease smoothly from the initial value along the cosine curve, which helps the stable convergence of training. The vertical axis on the right is the training loss, and the reverse axis shows that the loss gradually decreases from about 1 to about 0.05, reflecting that the error of the model parameters gradually decreases after multiple rounds of iterative optimization, and the training effect continues to improve. With the reasonable scheduling of the learning rate, the model training process is smooth and effective, and finally reaches a lower loss level, which verifies the effectiveness of the training strategy and the convergence of the model.

Figure 2.

Training evaluation.

Evaluation

Accuracy of multimodal target recognition

This paper constructs a multimodal image dataset that contains visible light images and infrared images, covering five typical environmental scenes: sunny, cloudy, rainy, foggy, and night. The image sizes are aligned and the pixel values are normalized. The training set and test set are constructed separately to ensure balanced distribution of scenes and annotate target categories as supervisory signals for the classification task. The comparison methods are set as: Method 1: Traditional CNN baseline model, Method 2: ResNet50 (ImageNet pre-training + fine-tune), Method 3: DINOv2 (self-supervised pre-trained Vision Transformer), Method 4: DINOv2 + Deformable Cross-modal Attention Transformer (this method). The evaluation indicator is the multimodal object recognition accuracy. The definition is the number of samples correctly classified by the model in each scene/the total number of test samples. The evaluation method is to independently evaluate the target recognition accuracy in the test sets of five scenes under each method. All methods keep the same data partitioning and experimental parameters to ensure fair comparison. Each group of experiments is repeated 3 times and the average value is taken as the final result.

Figure 3 shows the target recognition accuracy of different multimodal fusion methods in various typical environmental scenarios. The horizontal axis in the figure represents five common photoelectric detection scenarios (sunny, cloudy, rainy, foggy, and night). Regardless of the scenario, DINOv2 + Transformer always has the highest recognition accuracy, stabilizing in the range of 86.4% to 93.2%, demonstrating superior cross-modal modeling capabilities and robustness. As the complexity of the environment increases (such as foggy days and nighttime), the accuracy of all methods decreases, indicating that this method is more stable under complex conditions. The accuracy of traditional CNN methods is generally low, only 68.3% at night, and performs poorly for weak texture and low-contrast targets. ResNet50 is slightly higher than CNN in all scenarios. DINOv2 shows good semantic modeling capabilities. DINOv2 + Transformer comprehensively utilizes global semantic features and cross-modal alignment strategies to achieve optimal results in all scenarios. DINOv2 + Transformer has the smallest accuracy fluctuation range (maximum 93.2%, minimum 86.4%, and fluctuation of only 6.8%), which reflects its ability to maintain high recognition performance in various complex environments. In contrast, the recognition accuracy of the CNN method fluctuates by about 10%, and its stability is poor.

Figure 3.

Multimodal target recognition accuracy.

Average modal alignment error

This paper collects multimodal data including visible light and infrared image pairing, covering five typical environmental scenes: sunny, cloudy, rainy, foggy, and night. Four methods are used to process the data: traditional CNN model, ResNet50 model, DINOv2 self-supervised visual Transformer, and the DINOv2 and deformable cross-modal attention Transformer fusion architecture proposed in this paper. For each method, model training or feature learning is completed on the training set. For each pair of multimodal images in the test set, the fused feature representation is extracted. The spatial or semantic distance difference between different modal features is calculated to obtain the modal alignment error. The errors of all test samples in each scene are averaged to obtain the average modal alignment error of the method in that scene.

Figure 4 shows the modal alignment error performance of different algorithm methods in five typical environmental scenarios. The horizontal axis is the four algorithm methods, namely CNN, ResNet50, DINOv2 and the DINOv2 + Transformer fusion structure proposed in this paper. The vertical axis is the five photoelectric detection scenarios of sunny, cloudy, rainy, foggy and night. The method in this paper (DINOv2 + Transformer) achieves the lowest modal alignment error in all scenarios, with the best fusion effect, and the modal alignment error is reduced to no more than 2.7%. In contrast, the traditional CNN and ResNet50 methods generally have higher errors and darker colors, showing poor cross-modal alignment capabilities. When DINOv2 is used alone, the error is reduced, but it is still not as stable and significantly improved as after fusion with Transformer. The errors of all methods increase with the complexity of the scene (such as foggy days and nighttime), reflecting that complex environments pose higher challenges to modal alignment. The proposed method still maintains a low error in such complex scenes, indicating that its fusion strategy has strong robustness and adaptability.

Figure 4.

Modal alignment error.

Feature consistency

This paper collects and preprocesses multimodal photoelectric detection data, including visible light and infrared images, to ensure data alignment and complete annotation. The DINOv2 model is used to perform self-supervised training on visible light and infrared images to extract high-semantic feature representations of unified dimensions. The extracted multimodal features are input into a multi-layer deformable cross-modal attention Transformer to achieve cross-modal deep interaction encoding and spatial semantic alignment to generate fused features. Positive and negative sample pairs are constructed, with positive samples being homologous image pairs and negative samples being heterologous image pairs. Based on the designed contrast loss function, the fusion features are self-supervised and trained to strengthen the consistency of features between modalities. The distance distribution of positive and negative sample pairs in the feature space after contrastive learning is calculated as the basis for feature consistency evaluation. The kernel density estimation method is used to draw the probability density curve of the feature distance between positive and negative samples, which intuitively reflects the distribution difference between positive and negative samples. The feature consistency score is calculated based on the distance distribution, and the score values of different methods are compared to evaluate the modal consistency optimization effect of each method.

The left figure of Figure 5 reflects the feature distance distribution of positive sample pairs and negative sample pairs. The blue solid line represents the feature distance of homologous image pairs, which is concentrated and small, indicating that after optimization, the feature representation of the same target in different modalities is highly consistent. The red dotted line represents the feature distance of heterogeneous image pairs, which is relatively scattered and large, indicating that the features of different targets or modalities are obviously different. The obvious separation of the two distributions verifies that the designed contrastive learning loss effectively improves the feature consistency between modalities. The figure on the right shows the feature consistency scores of various methods. This method achieved the highest score of 0.94, which is significantly better than the traditional CNN and ResNet50 baseline models, indicating that the framework combining DINOv2 with multimodal fusion and contrastive learning optimization has stronger capabilities in feature consistency modeling.

Figure 5.

Feature consistency.

Robustness evaluation

This paper designs three types of diverse input change scenarios, including light intensity, weather conditions, and noise levels. Each type of scenario is subdivided into multiple levels to simulate the diverse conditions in actual complex environments. This paper collects multimodal photoelectric image datasets covering the above-mentioned levels of environment, ensures sample balance and performs necessary preprocessing, including image normalization and alignment. The trained photoelectric detection system based on DINOv2 and cross-modal Transformer fusion is independently operated in each environmental scenario, and the system performs feature extraction, fusion and target recognition on the input image. The tests in each scenario are repeated multiple times, and the consistency of the system’s output results in multiple tests is statistically analyzed to calculate the system output stability rate, that is, the proportion of system output that does not fluctuate significantly in the same scenario.

The model stability index (System Stability Rate) indicates the consistency and robustness of the output results of the photoelectric detection system under a given environment or changes in input conditions. The specific calculation method is: the number of times the system output results remain unchanged or fluctuate very little within the preset tolerance range in multiple repeated tests/total number of tests.

Figure 6 shows the stability of the photoelectric detection system under three different environmental change conditions: light intensity, weather conditions, and noise level. From the lighting condition sub-figure, it can see that the system has the highest stability rate (about 0.90) in medium lighting (medium), and the stability decreases in extreme lighting environments (very low and extreme), indicating that lighting changes have a certain impact on the robustness of the system, but overall it maintains a high stability. The weather condition sub-figure shows that the system stability rate is relatively high (over 0.89) in clear and cloudy environments. As the weather deteriorates (such as rain, fog, and storms), the stability rate gradually decreases, indicating that the system’s adaptability to severe weather has weakened. The noise level sub-figure reflects the system’s response to different noise intensities. The stability rate gradually decreases from the highest point of 0.95 in the noise-free environment to 0.75 in the extreme noise environment, indicating that noise interference has a significant impact on system performance, but the system still maintains good stability in a wider noise range.

Figure 6.

Robustness evaluation.

Inference latency

This paper collects multimodal photoelectric detection data sets, including visible light and infrared images, to ensure data diversity and representativeness, and deploys four target recognition models (CNN, ResNet50, DINOv2, and DINOv2 + Transformer) on the NVIDIA Jetson AGX Orin platform to ensure consistency in hardware configuration and system status. This paper performs multiple inferences on each model on a fixed test set, records the time delay from image input to result output each time, collects all inference delay data, and calculates the distribution of inference delay of each model, including median, quartiles, and outliers.

Figure 7 shows the performance distribution of the four methods in terms of inference latency. The DINOv2 + Transformer method has the lowest median inference latency of about 35 milliseconds, which is significantly better than the other three methods, indicating that its overall inference speed is the fastest. The traditional CNN model has the highest inference latency, while ResNet50 is slightly better. DINOv2 uses a self-supervised visual Transformer structure to significantly improve efficiency. This paper combines the DINOv2 model with the Transformer structure to achieve the lowest inference latency, meeting the strict real-time requirements of the photoelectric detection system.

Figure 7.

Delay distribution.

Conclusion

This paper addresses the critical challenges of limited perception accuracy and unstable multimodal fusion in photoelectric detection systems operating under complex environmental conditions. To overcome these limitations, we propose an end-to-end optimization framework that integrates DINOv2-based self-supervised semantic feature extraction with a cross-modal Transformer fusion architecture. Specifically, DINOv2 is employed to extract high-level semantic features from visible and infrared modalities at a unified spatial scale, enabling robust representation learning without reliance on extensive labeled data. A multi-layer deformable cross-modal attention mechanism is then introduced to achieve fine-grained spatial alignment and deep semantic fusion across modalities. Furthermore, a modality consistency-aware contrastive learning strategy is incorporated to enhance the coherence and robustness of the fused feature space. The integrated framework effectively constructs a high-precision, high-stability multimodal photoelectric detection system. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches in both multimodal recognition accuracy and cross-modal alignment error. Nevertheless, limitations remain in real-time processing efficiency and generalization to large-scale, diverse datasets. Future work will focus on developing lightweight network architectures and adaptive modality selection mechanisms to improve computational efficiency and scalability, thereby enhancing the system’s practicality for deployment in real-world, resource-constrained scenarios.

Footnotes

ORCID iD

Shangbing Huang

Author contributions

All authors contributed to the study conception and design. Material preparation and data analysis were performed by Hong Xiao, Kexuan Wang, and Han Deng. The first draft of the manuscript was written by Hong Xiao、Zijian Gao, and Shangbing Huang. All authors read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The authors confirm that the data supporting the findings of this study are available within the article.*

References

Zhao

, et al. A wind vector detecting system based on triboelectric and photoelectric sensors for simultaneously monitoring wind speed and direction. Nano Energy 2021; 89: 106382.

Zhang

. Design and analysis of laser photoelectric detection sensor. Microw Opt Technol Lett 2021; 63(12): 3092–3099.

Zhang

Gao

. A design method of active photoelectric detection sensor based on 1-D multiunit pin detector and its detection model. IEEE Sens J 2022; 22(22): 21600–21612.

Wang

Liu

. Detection of heavy metal ions by ratiometric photoelectric sensor. J Agric Food Chem 2022; 70(37): 11468–11480.

Chen

Wang

. Organic photoelectric materials for X-ray and gamma ray detection: mechanism, material preparation and application. J Mater Chem C Mater 2021; 9(14): 4709–4729.

. Short distance detection and control system based on photoelectric sensor and its application in target detection of handling robot. J Nanoelectron Optoelectron 2023; 18(8): 964–970.

Zhang

. Three-dimensional coordinates test method with uncertain projectile proximity explosion position based on dynamic seven photoelectric detection screen. Def Technol 2022; 18(9): 1643–1652.

Wang

. Object detection-tracking algorithm for unmanned surface vehicles based on a radar-photoelectric system. IEEE Access 2021; 9: 57529–57541.

Gong

, et al. Exploiting photoelectric activities and piezoelectric properties of NaNbO3 semiconductors for point-of-care immunoassay. Anal Chem 2022; 94(7): 3418–3426.

10.

Feng

SHI

Hongchang

Lei

YAN

, et al. Advances in underwater photoelectric imaging technology. Infrared Technology 2023; 45(10): 1066–1083.

11.

Zhou

, et al. Ultrasensitive photoelectric detection with room temperature extremum. Light Sci Appl 2025; 14(1): 96.

12.

Gao

Shang

, et al. Ultrabroadband tellurium photoelectric detector from visible to millimeter wave. Adv Sci 2022; 9(5): 2103873.

13.

Michailow

Spencer

Almond

, et al. An in-plane photoelectric effect in two-dimensional electron systems for terahertz detection. Sci Adv 2022; 8(15): eabi8398.

14.

Wang

Yang

Meng

, et al. Research on vehicle-mounted soil electrical conductivity and moisture content detection system based on current–voltage six-terminal method and spectroscopy. Comput Electron Agric 2023; 205: 107640.

15.

Bielecki

Achtenberg

Kopytko

, et al. Review of photodetectors characterization methods[J]. Bulletin of the Polish academy of sciences. Technical Sciences. 2022; 70(2): e140514.

16.

Wei

Tian

, et al. Recent progress in anisotropic 2D semiconductors: from material properties to photoelectric detection. Physica Status Solidi (A) 2021; 218(16): 2100204.

17.

Gao

Huang

, et al. Photoinduced electron transfer modulated photoelectric signal: toward an organic small molecule-based photoelectrochemical platform for formaldehyde detection. Anal Chem 2023; 95(23): 9130–9137.

18.

Lou

Dai

Wang

, et al. Highly sensitive light-induced thermoelastic spectroscopy oxygen sensor with co-coupling photoelectric and thermoelastic effect of quartz tuning fork. Photoacoustics 2023; 31: 100515.

19.

Chen

, et al. Thermoelectric and photoelectric dual modulated sensors for human internet of things application in accurate fire recognition and warning. Adv Funct Mater 2023; 33(41): 2303861.

20.

. Local defogging algorithm for the first frame image of unmanned surface vehicles based on a radar-photoelectric system. J Mar Sci Eng 2022; 10(7): 969.

21.

Zhang

, et al. Superhydrophobic MXene coating with biomimetic structure for self-healing photothermal deicing and photoelectric detector. ACS Appl Mater Interfaces 2022; 14(47): 53298–53313.

22.

Guan

Mao

Zhong

, et al. A self-powered UV photodetector based on the hydrovoltaic and photoelectric coupling properties of ZnO nanowire arrays. J Alloys Compd 2021; 867: 159073.

23.

Shooshtari

Ghods

Mohammadpour

, et al. Design of effective self-powered SnS2/halide perovskite photo-detection system based on triboelectric nanogenerator by regarding circuit impedance. Sci Rep 2022; 12(1): 7227.

24.

Yang

Sun

Zhou

, et al. Photoelectric memristor-based machine vision for artificial intelligence applications. ACS Mater Lett 2023; 5(2): 504–526.

25.

Zheng

Dong

, et al. High sensitivity infrared photoelectric detection based on WS2/Si structure tuned by ferroelectrics. Small 2022; 18(7): 2105188.

26.

Zhang

Zhao

Liu

, et al. Ferroelastic domains enhanced the photoelectric response in a CsPbBr3 single-crystal film detector. J Phys Chem Lett 2021; 12(35): 8685–8691.

27.

Zhou

Qin

, et al. A multiplexing optical partial discharge sensing system for power transformer using a single photodetector. IEEE Trans Power Deliv 2021; 36(3): 1911–1913.

28.

Xing

Zhang

Wang

, et al. Addressable label-free photoelectric sensor array with self-calibration for detection of neuron specific enolase. Anal Chem 2022; 94(19): 6996–7003.

29.

Zhou

, et al. An innovative echo detection system with STM32 gated and PMT adjustable gain for airborne LiDAR. Int J Rem Sens 2021; 42(24): 9187–9211.

30.

Liu

Liang

, et al. Micro-quartz crystal tuning fork-based photodetector array for trace gas detection. Anal Chem 2023; 95(17): 6955–6961.