Abstract
The proliferation of unmanned aerial vehicles (UAVs) in both civilian and military domains has intensified the need for autonomous counter-drone systems capable of operating without reliance on ground infrastructure. Existing ground-based and hybrid approaches suffer from high latency and complete failure under communication jamming or denial. This paper proposes a fully onboard (OB) architecture for autonomous drone-to-drone detection in both the visible (RGB) and thermal (IR) domains, where all perception and decision-making tasks are executed exclusively using the embedded computational resources of the unmanned aerial vehicle. In particular, this work analyzes the features of the physical components of the architecture (i.e., the Single-Board Computers or SBCs for short, the available sensors, and the UAV platforms) and their performances in the experimental settings. Various computational platforms are tested to assess their impact on the performance of the detection pipeline, evaluating specific parameters such as inference speed (fps), inference time (ms), power consumption (W) and operational autonomy. In order to enable a comprehensive evaluation, a ground-based (GB) counterpart was also implemented, where real-time video streams are transmitted from the drone to a ground station for processing and control commands are subsequently sent back. The onboard architecture offers significantly lower latency and complete independence from radio links and controllers, making it particularly suitable for applications requiring high robustness in communication-denied or contested environments. In particular, the findings highlight the advantages of the Jetson Orin Nano platform in achieving inference speeds up to 80.93 fps at 12.36 ms on YOLO v8n quantized models, overcoming state-of-the-art performances. According to our knowledge, this is the first fully onboard RGB-IR drone-to-drone visual detection architecture in the literature.
Keywords
Introduction
The rapid proliferation of Unmanned Aerial Vehicles (UAVs) in both civilian and military sectors has significantly increased the risk of intrusions and accidents, highlighting the need for reliable drone-to-drone detection systems.1,2 While ground-based processing of UAV video streams is common and easier to implement, due to its access to virtually unlimited computational resources, it suffers from high latency, single points of failure in communication links and a complete loss of functionality where communication is jammed or denied. 3 Hence, the main novelty of this paper is the proposal of a full onboard architecture for real-time drone-to-drone visual detection, where all perception, detection, tracking and decision-making tasks are executed autonomously using embedded hardware on board the UAV. Indeed, the proposed system achieves a fully independent operational capability, without the need for external communications, ensuring a continuous operation even when the link with the ground station is interrupted or actively jammed.
In order to rigorously validate the performance of the onboard approach, different state-of-the-art onboard platforms for UAVs have been tested, each impacting the performance in terms of detection speed, inference time and operational autonomy. In addition, a ground-based version has been implemented for comparison, with the master drone (i.e., the in-air detector in this configuration) streaming real-time video to a ground control station (GCS) for processing. This is a configuration widely used in semi-autonomous UAV systems today. Both the onboard and ground-based systems share the same detection and tracking pipelines, to ensure a fair comparison of their respective strengths and weaknesses.
The experimental results, obtained from extensive practical tests, show that the onboard system significantly outperforms the ground-based approach in responsiveness and robustness, while keeping high performances (especially in the thermal domain) in real-time detection due to the optimized models adopted. This makes it particularly suitable for applications in contested environments.
Synopsis
This paper proposes and evaluates a fully onboard architecture for real-time drone-to-drone visual detection. All perception and decision-making tasks are performed autonomously on the UAV without relying on any ground infrastructure or real-time radio link, except for the startup phase of the UAV.
In order to rigorously quantify its benefits and limitations, the onboard solution is compared with a conventional ground-based system that offloads processing to a ground control station, an approach widely used in current systems that involve UAV navigation and interception.
The rest of the paper sections are organized as follows: Section 2 reviews the state of the art in drone-based detection systems, with particular emphasis on fully autonomous onboard approaches. Section 3 briefly presents the proposed system architecture within its main components. Section 3.1 describes the onboard physical architecture in detail, covering hardware selection and physical implementation while Section 3.2 shows the logical design of the system, presenting software pipeline, real-time processing and decision-making steps. Finally, Section 4 presents the experimental setup and results obtained from extensive tests using different onboard-SBCs and computer vision models. Beyond the implementation of a specific onboard detection platform, this work aims to distill broader architectural principles for autonomous UAV perception in communication-constrained environments. In particular, the study identifies transferable design guidelines concerning computation placement, modality integration, hardware selection, and model optimization under Size–Weight–Power (SWaP) constraints. These principles are intended to support future development of communication-independent aerial perception systems beyond the specific platform evaluated here (see Section 4.5).
The conclusive Section 5 summarizes the evidence obtained from the comparison of the different paradigms, discussing their respective strengths and weaknesses.
Therefore, this paper provides clear and actionable insights into the fundamental trade-offs between onboard and ground-based perception strategies, in the development of truly autonomous UAV systems for contested and communication-denied environments.
Related work
The rapid proliferation of UAVs has dramatically increased the need for reliable drone detection systems capable of operating in shared and contested airspace. In particular, drone-to-drone autonomous approaches are currently studied, since they could represent possible solutions to apply in dangerous areas where humans cannot operate. More broadly, recent research in robotics has explored the integration of learning-based techniques to improve autonomy in articulated robotic systems, including the optimization of locomotion patterns and the learning of joint motion constraints in complex mechanisms. Although these studies primarily address motion generation rather than perception tasks, they highlight the growing role of machine learning methods in enabling adaptive and intelligent robotic platforms.4–6
UAV detection system architectures
This section reviews the state-of-the-art in airborne UAV detection, with particular emphasis on architectural choices and their implications for real-time performance, autonomy and deployment in communication-denied environments.
Ground-based
This kind of counter-UAV and air-traffic awareness systems (e.g., Dedrone, 7 Robin Radar, 8 Aaronia AARTOS 9 ) rely on Radar, Radio Frequency (RF) and Camera-based networks installed on towers or fixed sites. They trade range and accuracy for high cost, limited coverage in remote/dynamic scenarios and inability to protect the detecting UAV itself when operating far from the protected perimeter. 10 The latter is merely a flying camera, transmitting compressed video (usually H.264/H.265 over 4G/5G or dedicated RF links) to a ground control station (GCS) equipped with high-end GPUs.11,12 Detection and tracking are performed on the GCS and commands are sent back via MAVLink or similar protocols, enabling the use of YOLOv8x, 13 Faster R-CNN, etc., and achieving mAP@0.5 above 93%. However, they also introduce latencies of 200–800 ms and completely fail when the communication link is lost or jammed.14,15
Hybrid and dual-stage
Computation is split between the drone and the ground station. Ntousis et al. 14 proposed a dual-stage pipeline: lightweight detection onboard (YOLOv5n) for low-latency candidate generation, followed by heavy re-identification and tracking on the ground server when bandwidth permits. Similar ideas appear in Kwon and Lee, 16 Liu et al. 17 These systems still collapse to zero capability when the link is broken, and add architectural complexity.
Fully onboard
Recently, embedded neural processors and lightweight computer vision models18,19 favored processing onto the UAV itself. Early works focused on general obstacle avoidance rather than drone-to-drone detection.20–22 Vrba et al. 23 and subsequent works from the CTU Prague MRS group 24 shifted to LiDAR-based flying-object detection on the Intel Up Squared board with an Ouster OS1 sensor. Despite being robust at long ranges, these solutions added weight, power draw and cost, making them impractical for small-to-medium commercial platforms.
Chen et al. 25 demonstrated real-time object detection on a Snapdragon 855 platform with embedded NPU, achieving 28 fps with MobileNetV2-SSD, but only on static or slowly moving objects. Vision-only onboard systems using Jetson Nano appeared in recent studies.26–28 Typical performance ranged between 8–25 fps with mAP@0.5 of 0.64–0.79 and power consumption of 10–25 W, at the expense of reduced flight time or reliance on ROS2-to-ground telemetry for decision-making, giving up with communication-independency.
Finally, some new research involves the integration of both audio and video sensors for improving UAV detection, localization and tracking in open environments. 29 Specifically, the cited work relies on audio detection of the target for a preliminary estimation of the direction of arrival, in order to command a rotation of the drone/camera towards the target (which is then detected by a CNN model), in order to enhance prediction confidence thanks to complementary information given by audio and video.
To the best of our knowledge, no prior work has demonstrated a fully onboard drone-to-drone detection system combining both RGB and IR modalities without any ground link dependency.
The importance of domain-specific datasets
The broader literature suggests that model initialization and adaptation strategies should not be selected solely on the basis of benchmark popularity. Evidence from adjacent computer vision domains shows that domain-specific transfer learning can outperform standard ImageNet-based pretraining when visual characteristics differ substantially from natural-image distributions. This observation is particularly relevant for aerial RGB–IR detection, where operational imagery often diverges from canonical datasets and may benefit from specialized fine-tuning protocols.30,31 This is the main reason behind the choice of the dataset used in this work (see Section 4.1).
Proposed drone-to-drone detection system
The proposed system architecture integrates several key components, which are highlighted in Figure 1. In particular, the flow of the information starts from the UAV platform itself. It integrates a plug board to which it is possible to connect the main onboard controller (the Linux Single Board Computer) through the power connector and the two USB ports. Power is supplied to the whole system through the main UAV batteries via a dedicated DC-DC converter, ensuring stable operation even during high-thrust maneuvers. On the other hand, data exchange is made possible thanks to the two USB peripherals. One is exclusively dedicated to the information sharing between software development kit (SDK) and the drone, due to a USB-to-TTL (Transistor-Transistor Logic) adapter. The other USB cable is exploited for managing advanced sensors (such as RGB-IR multispectral cameras) and to capture live videos and photos. These can then be saved on SBC’s memory or shown directly in the SBC’s operating system if it also integrates a graphical user interface (GUI).

Overall proposed architecture describing how autonomous system interacts in external environment.
This section presents the fully onboard drone-to-drone detection architecture, also visible in Figure 2. The system is specifically designed to operate autonomously in communication-denied or contested environments, eliminating any real-time dependency on ground infrastructure for perception, detection, tracking or decision-making. All processing stages, from sensor data acquisition to adversary drone detection, are executed exclusively on embedded hardware carried by the UAV, ensuring continuous functionality even under active RF jamming or beyond-radio-line-of-sight conditions.

Physical architecture of the system, in which the main components are highlighted and summarized, together with their connections.
The proposed solution has been implemented and evaluated on a robust and industrial-grade M200 DJI platform. Its modular payload interface and high payload capacity (more than 2 kg) make it an ideal testbed for real-world deployment of autonomous onboard perception systems on UAV platforms.
The architecture is deliberately sensor-agnostic, in order to maximize flexibility and applicability across operational scenarios. It supports seamless integration with a variety of payloads, including the ones cited before. However, our selection has been focused on multimodal optical sensors, such as the one presented in Section 3.1.1. In particular, the choice for a single and proprietary camera model stems from the high cost and limited flexibility of mounting multiple modules on the same UAV. Nevertheless, our dataset 32 used for models training includes images captured by a variety of sensors, ensuring a broader and more generalized data representation. In particular it comprises over 20,000 RGB and IR image pairs collected across 20 distinct outdoor scenarios, with an 80/20 train/test split.
Moreover, the NVIDIA Jetson Orin Nano has been selected as the main onboard processing unit, instead of the Raspberry Pi 4B and Pi 5, due to its significantly higher computational performance while maintaining a similar form factor. As it will be seen in Section 4, Jetson Orin Nano provides onboard computing capabilities comparable to those of a ground station, making it a powerful solution for onboard real-time processing. The differences in computational performance and power consumption are quantitatively assessed in the same Section.
In addition, the modular design ensures that the same software stack can be deployed on both SBCs with minimal reconfiguration, facilitating comparative studies and technology transfer to lower-SWaP platforms. (SWaP stands for Size Weight and Power, a standard acronym used in aerospace engineering.)
Modern UAV sensors could be broadly classified according to their hardware architecture and software ecosystem. A fundamental distinction exists between open-source payload sensors and closed-source (proprietary) payload sensors, each offering specific advantages and limitations depending on the intended application and use case. In the following section, we briefly introduce some popular ones, with respect to the goal of drone-to-drone detection. Closed Source: DJI: optical sensors such as DJI Zenmuse XT2 and DJI Zenmuse Z30 are considered as closed source since they have been developed for giving optimal performances in specific working environments. In addition, it is quite complicated to develop and integrate software on these cameras, as DJI removes support after a few years from their products first appearance in the market. YUNEEC: standard RGB cameras, like E30Z/X (30 FLIR: the most relevant unmanned payloads are the Teledyne FLIR Vue TV128 (dual thermal and visible camera payload) and the EO/IR MK-II (multi-sensor imaging payload). Open Source: SIYI: open gimbal payloads such as the SIYI A8 mini (3-axis gimbal camera) and A2 mini (ultra wide angle FPV gimbal camera), which permit fluent movements allowed by a dedicated gimbal structure and they are also programmable for every necessity thanks to their open source ecosystem. Pi Camera: one of the best known cameras in the market, due to its low cost and compatibility with many boards. However, due to its limited size, it does not guarantee high resolutions.
Furthermore, additional advanced payload modules enable drone detection through alternative sensing techniques, such as multi-channel image analysis. These include multispectral cameras, such as the MicaSense RedEdge-MX and Altum-PT, which support combined visible and near-infrared (VNIR) sensing. Such capabilities are particularly effective in cluttered or densely vegetated environments, where single-band optical sensors may suffer from reduced detection performance.
Beyond multispectral systems there also exist hyperspectral cameras, such as those produced by RESONON. These sensors provide even finer spectral resolution by capturing narrow spectral bands across a continuous wavelength range. In addition, they allow detailed material discrimination and advanced classification techniques. However, Resonon hyperspectral payloads cannot be considered fully open source. While the manufacturer provides an internal software development kit (SDK) to access data and control the modules, the underlying hardware design, firmware and schematics are proprietary and not publicly available. Finally, all sensors have been summarized in Table 3 of Appendix A, in order to complete the comparison.
In order to develop an onboard architecture, it is also essential to evaluate and select the appropriate hardware components. In particular, main boards for data processing are a crucial element for developing an autonomous system, as they must offer good computational capabilities embedded into a small and energy saving form factor.
In particular, for UAV onboard computing, ARM-based SBCs are relatively low power draw compared to x86 systems, making them suitable for battery-constrained platforms, while offering adequate computing resources for control and perception tasks. Moreover, they feature a rich I/O interface for sensors (USB 3.0, CSI camera interface, GPIO, and networking simplifying integration with cameras, IMUs, GPS, and radios) with an extensive community, ROS support, and libraries for robotics and UAV development.
Dually, x86 systems provide significantly higher CPU throughput for tasks such as mapping, planning, or multi-sensor processing, still featuring broad compatibility. Finally, NVIDIA- platforms open the possibility of dedicated energy-efficient edge AI acceleration, balancing performance with UAV battery limits.
Thus, exploring the market, the single board computers (SBCs) reported in Table 4 in Appendix A have been identified as the main and most relevant solutions feasible for onboard systems. Although some of them have already been tested in previous research,
28
our comparison has been expanded over more and diverse architectures, while focusing on specific parameters for comparing the boards. Raspberry Pi 4B+: this represents a cost-effective and low-power solution suitable for lightweight and optimized computer vision models (mostly nano and darknet versions of YOLO
33
), which has been our first choice. Raspberry Pi 5: an evolution of the previous board more suitable for AI tasks. The 8GB RAM version has been selected instead of the 16GB one, since models run mostly over GPU and CPU, without loading too much over RAM. In addition, it would have been possible to better compare the results obtained from tests. Intel NUC 7: it is a small form-factor X64 PC built on an Intel APU, which guarantees higher performances but lacks autonomy and power saving features. NVIDIA Jetson Orin Nano: it is usually selected as the primary platform due to its significantly higher performances, thanks to its CUDA-capable graphic module. For this reason, the board supports advanced tracking and prediction algorithms, with a comparable power consumption with respect to Raspberry Pi 5 (10w to 15w under load).
Although these solutions are quite promising in modern days for developing an autonomous system, many competitors still rely on ground based systems for drone monitoring, which often present very powerful stations for running heavy AI models. Most of them feature NVIDIA RTX GPUs, which are specifically designed with large and fast memories. An example is our X64 Dell workstation used in the experimental phase of the research, which features an Intel i9-11900 processor and an RTX 3090 GPU, which overcome in performance the other systems previously presented, while sacrificing consumption, form factor and portability.
The logical architecture
Figure 3 illustrates the end-to-end logical architecture of the proposed onboard processing pipeline. All stages delimited by the green area are executed entirely on the UAV, eliminating the need for any ground-based computation or communication links.

Logical architecture and real-time processing pipeline. Once the UAV startup command is given by the user, the whole system operates autonomously booting up the OS and initiating the drone detection pipeline. However, due to EASA regulations the pilot remains ready to intervene in case of need.

Comparison between raw frames (before) and aligned frames (after).
Furthermore, in order to clearly illustrate the whole process, the individual steps are described in detail in the subsequent paragraphs.
The pipeline begins after UAV startup by the operator. In this phase the onboard camera is also automatically calibrated. During take-off, the SBC’s operating system boots up enabling connections between its ports and UAV control unit. Only after this step, camera stream and video acquisition start through the UAV software development kit (SDK).
Stream acquisition and alignment
At this point the SBC is able to acquire the video stream and process each single frame through its internal computational unit, which enables GPU-accelerated processing. The live video feed is then captured by the software pipeline, which has also to align the RGB and infrared streams. Alignment operation is crucial and it depends on sensor features. In particular, it is based on the following steps:
the focal ratio should then be used to obtain the scaling factors after the previous step, the RGB image is rescaled with respect to the IR one using since cameras focuses are physically separated, the distance between them ( before cropping the RGB image with respect to the IR one, the position where to crop should be computed using the following method.
finally, the RGB frame is rescaled to the same resolution of the IR one in order to have the two frames aligned (Figure 4).
Finally, Ultralytics YOLO detector performs inference on each input image received.
Post-detection, non-maximum suppression and tracking modules refine the results, while Kalman filters predict object trajectories. In particular, these estimate the new state in which the UAV will be after a movement considering the system and its uncertainty. More details on the process, including Kalman Filters equations, can be explored further in Section D of the Appendix.
Technical implementation
The onboard software environment was established on an NVIDIA Jetson Orin Nano, selected as the embedded computer for the UAV platform, running a dedicated version of Ubuntu. All integration steps between software and UAV sensors closely followed the official UAV SDK documentation, which is quite standard and can be applied to many other drone platforms in the market.
In order to capture video from the drone camera, the legacy FFMPEG 2.8.15 library was chosen. Since it was no longer available in standard Debian repositories but still compatible and stable for our purposes, it was compiled from source.
Moreover, UART port access was granted by adding the main user of the SBC to the dialout group and by creating the necessary udev rule inside the path /etc/udev/rules.d/. The serial interface was subsequently enabled while keeping Bluetooth disabled, since it might occupy the communication bus reserved to UART.
The advanced sensing feature of our UAV SDK provides direct low-latency access to live video streams from both the FPV camera and the main payload camera via the USB-Serial link, bypassing the delays inherent in streaming through the UAV Radio Controller. Therefore, delay while capturing live video feeds has to be taken into account when building a new autonomous system from scratch.
Finally, particular attention was given to the optimization of drone detection models, in order to reach the highest performance possible on the board. Starting from PyTorch standard models saved in
Standard ONNX format is suitable for CPU-based computations on SBCs that do not integrate dedicated graphics processing units, whereas the Engine format enables GPU-accelerated execution using NVIDIA TensorRT libraries on compatible boards. Although PyTorch natively supports GPU inference through the
Experimental results
This section reports the experimental evaluation of all the embedded boards, focusing on evaluation procedure, system integration, power consumption, computational performance and detection accuracy.
Evaluation procedure and setup
First of all, several Ultralytics YOLO models have been trained using our own multimodal dataset for drone detection, 32 which is available for public download together with the models tested here. These models have then been adopted for measuring the performance of the boards on UAV detection.
In order to evaluate the differences between the platforms, several tests have been conducted running all our pretrained models and performing inference on a 3-min MPEG4-compressed H264 30 FPS video, with a resolution of 640x512 pixels for a total flow of 5,400 frames. The test video may or may not contain a target drone (in our case a DJI Mavic Mini 4K), in order to verify the behavior of the models when the target UAV leaves the frame.
It is important to clarify that this benchmarking phase and the real onboard deployment correspond to two different but complementary stages of the evaluation. For the comparison of inference speed, latency and power consumption across five highly heterogeneous hardware platforms, the input stimulus had to be strictly identical; therefore, all boards were tested on the same prerecorded video stream. Running independent live flights for each board would have introduced uncontrolled environmental variability, such as different target trajectories, wind-induced payload oscillations and lighting changes, thus compromising the fairness of the comparison. At the same time, as described in Sections 3.1 and 3.2, the full system is effectively deployed onboard the DJI Matrice 210, powered directly by the UAV batteries and processing the live stream in real time through the DJI SDK.
The range of pixel resolutions of the target has also been calculated to have an additional reference of the actual pixel dimensions relative to the distance in meters between the target UAV and the camera. The following formulas were used to compute the real-world dimensions in pixels for different distances:
As measurements are derived from a single full-pass evaluation over the 5,400-frame test sequence, run-to-run variance is not separately quantified; this is acknowledged as a limitation of the current evaluation protocol.
Initial tests were conducted using a Raspberry Pi 4B as the main onboard SBC. However, due to its limited processing capabilities, the system achieved only 2–3 frames per second, highlighting the need for more powerful hardware in real-time applications. Therefore, other boards and a whole GCS have been tested, in order to have a complete setup for comparisons.
A critical aspect of the pipeline concerns frame alignment. In fact, our camera integrates two separate optical channels (an RGB sensor and an infrared sensor) which are physically displaced and operate at different zoom levels. As shown in Figure 2, the lenses are not co-aligned, requiring real-time geometric alignment of the RGB and IR frames. This step introduces a significant computational overhead, further reducing the effective throughput of the system.
Another important consideration is the additional power demand introduced by the detection pipeline. The Python script must simultaneously manage the real-time video stream provided by the UAV SDK, perform frame alignment between RGB and IR channels and execute object detection across both spectra. This workload results in a measurable increase in current consumption. For example, while idle operation requires approximately 0.59A with Raspberry Pi 4B, the full detection pipeline increases consumption to 1.09A, confirming the impact of onboard processing on energy efficiency.
Performances and power consumption differences across all the tested systems are also linked to the underlying hardware architectures. Intel NUC 7 and the workstation employed in our tests run on X64, a type of architecture following the CISC (Complex Instruction Set Computing) paradigm, which generally leads to higher power consumption. On the other hand, Raspberry Pi boards and NVIDIA Jetson Orin Nano work with ARM, which is a RISC (Reduced Instruction Set Computing) architecture, simpler and more efficient than CISC. This is why lower consumption has been obtained with them. As seen in our tests, Intel NUC 7 performs slightly better than Raspberry Pi 5 while sacrificing consumption, reaching over 38 Watts compared to the approximately 15 W of Raspberry Pi.
Computational performance
Furthermore, the mean inference speed in frames per second and the mean inference time per frame were measured, while monitoring each board’s power consumption. All results have been summarized in Table 1, which also reports the corresponding detection accuracy on both RGB and IR data in terms of mAP@50 and mAP@50–95.
Mean inference speeds, per-frame times, power consumption and detection accuracy for YOLO models across different hardware platforms.
Mean inference speeds, per-frame times, power consumption and detection accuracy for YOLO models across different hardware platforms.
CUDA offers greater performances, thanks to its efficiency in YOLO models management. Ultralytics allows to move the model over CUDA and perform operations directly on GPU if it is available. The last two columns report, for each model, mAP@50 and mAP@50–95 on the RGB and IR datasets. All TensorRT models tested have been quantized in half-precision floating point (FP16). Power consumption has been measured using integrated software tools inside the operating systems, such as tegrastats for Jetson Orin Nano.
The disparity between Raspberry Pi and Jetson Orin Nano performances is quite evident. The latter is able to offer much higher throughput in terms of processed frames per second, while maintaining a compact form factor suitable for onboard deployment. However, Raspberry Pi 5 results are still promising, since the pretrained models have been exported in ONNX format, which enables substantially better performances than standard PyTorch execution on resource-constrained platforms. Moreover, there are still clear differences when comparing the Jetson Orin Nano board with an X64 machine with a dedicated CUDA GPU. However, this study is specifically related to onboard architecture development, which necessarily implies compromises with respect to a ground-based machine.
A key factor behind the large FPS increase observed for the TensorRT models is the adoption of half-precision floating point (FP16) inference. FP16 reduces the amount of data moved between memory and compute units, lowers bandwidth pressure and allows the NVIDIA GPU to exploit hardware-optimized execution paths, including Tensor Cores on supported devices. As a consequence, more operations can be processed in parallel and each frame requires less latency to traverse the pipeline. In our experiments, this effect is especially evident on the Jetson Orin Nano, where TensorRT FP16 versions consistently outperform the corresponding CUDA models while preserving nearly unchanged detection accuracy on both RGB and IR benchmarks.
Despite the evident differences across all the computer modules we tested, the detection accuracy of the models remained substantially consistent across the considered deployment scenarios. In particular, the
Table 2 summarizes the comparison of our proposal with the related work cited in Section 2. Notice that the IR channel performance is the one driving the detection, because the system fuses both modalities and IR provides the primary detection signal in the evaluated scenarios.
Comparison of the proposed system with state-of-the-art onboard UAV detection approaches.
Comparison of the proposed system with state-of-the-art onboard UAV detection approaches.
N/A indicates that the metric was not reported in the original work.
The results in Tables 1 and 2 allow a practitioner to navigate the configuration space across four key dimensions: inference speed, latency, power consumption and detection accuracy. For energy-constrained deployments, the Jetson Orin Nano with YOLOv8n TensorRT offers the best speed–power trade-off (80.93 fps, 5.1 W, mAP@0.5 of 98.2% on IR). When detection reliability is the primary constraint, YOLOv11s TensorRT provides marginally higher accuracy at a moderate cost in throughput. A full Pareto-front analysis across these dimensions remains an interesting direction for future work.
As to the reproducibility of the experiments, in Appendices A and B, we report all the details of the hardware components and configuration settings. The models and all the code scripts for the video pipeline and camera calibration will be provided on request, and the dataset is publicly available. 32
In conclusion, several tests were conducted under a range of outdoor conditions including different times of day, background types (trees, ground, sky), and zoom levels (1

Comparison of detections with different zoom levels backgrounds, in both RGB and IR spectra.
Abstracting from the experience gained in implementing and experimenting our onboard architecture, we propose some general design principles highlighting good practices and design choices beyond the specific implementation described.
Principle 1—Compute locality should match mission criticality
Tasks requiring immediate response under contested or degraded communication conditions should be executed at the sensing edge, minimizing dependency on remote infrastructure.
Principle 2—SWaP-constrained optimization is a system-level problem
Hardware selection should not be based solely on inference speed, but on joint optimization across computational throughput, power draw, payload mass, and mission endurance.
Principle 3—Model optimization can outweigh hardware scaling
Software-level acceleration strategies (e.g., TensorRT conversion, FP16 quantization) can yield larger operational gains than upgrading to higher-power platforms.
Principle 4—Multimodal sensing improves robustness only if alignment overhead is controlled
The benefit of RGB–IR fusion depends on efficient cross-modal registration; otherwise, preprocessing costs may offset detection gains.
Principle 5—Architectural resilience requires communication independence
In adversarial or infrastructure-limited environments, operational continuity depends more on autonomy than on peak computational performance.
Conclusions
The proposed system architecture has been successfully implemented and validated, demonstrating promising performance in real-world counter-UAV scenarios. The overall detection pipeline, sensor fusion framework and onboard decision-making logic operate reliably under operational conditions, confirming the feasibility of a fully autonomous, vision-based interception system deployed on an aerial platform.
Nevertheless, real-time computational performance was constrained by some of the SBCs during tests. Although the Raspberry Pi 4B and Pi 5 represent capable single board computers for many embedded applications, their processing capabilities have been insufficient for sustaining high frame-rate, multi-object tracking and deep-learning inference. Indeed, 15 fps is the bare minimum for less critical monitoring, while the ideal range is 25–30 fps. For this reason, Raspberry Pis turned out to be insufficient for the most demanding operational configurations. On the other hand, NVIDIA Jetson Orin Nano board remains the most relevant and capable board of the ones analyzed and tested. In fact, it achieved promising performances and comparable results with our tested ground-based solution. In addition, thanks to models optimization conducted using TensorRT libraries, it has been possible to overcome current state of the art performances, as seen in the results presented.
Overall, having a distributed detection architecture, the traffic can be reduced to the minimum, just for telemetry/commands communication (video streams are not necessary). Even in heavily jammed environments drones can continue their pre-programmed missions, logging detection events on board. Such events can then be downloaded to the GCS when communication is restored, e.g., coming back towards the GCS. All the used SBCs are below 200 g of weight, and with an additional 4S LiPo Battery (14.8V) with 4000–5000 mAh a Jetson Orin Nano can operate for one hour, beyond the average flight time of most drones. Hence, there are no strict payload and battery impacts.
Therefore, future development will focus on further optimizing the entire software stack on this modern and high-performance edge-AI platform. For instance, a useful addition to the onboard detection software could be the capability of switching between different models, depending on the available computing resources. For example, a small UAV with a Raspberry Pi 5 could be used with a smaller/simpler model to monitor the external perimeter of a sensitive area, while more powerful drones could be used to actively patrol the inner more critical parts with more precise models. Moreover, implementing a policy to identify false positives and false negatives, giving feedback to the base system, could also be a worthwhile future development.
Finally, a more statistically grounded validation with repeated runs and variability analysis will be necessary to strengthen the robustness of the reported results.
Footnotes
Acknowledgements
This work was partially supported by the FSE fund, by the Departmental Strategic Plan (PSD) of the University of Udine – Interdepartmental Project on Artificial Intelligence (2020–25) and by the Italian Minister of Defence PNRM project “ARGOS” (2023–2025).
Ethical considerations
All the research meets the ethical guidelines and legal requirements specified by the Integrated Computer-Aided Engineering journal.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Author contributions
All authors have had an active part in the study and the manuscript preparation. All authors have approved the manuscript, and agree with its submission to the Integrated Computer-Aided Engineering journal.
Funding
This work was partially supported by the FSE fund, by the Departmental Strategic Plan (PSD) of the University of Udine – Interdepartmental Project on Artificial Intelligence (2020–25) and by the Italian Minister of Defence PNRM project “ARGOS” (2023–2025).
Declaration of conflicting interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Hardware
In the following tables are summarized the most known UAV sensors together with their own features and capabilities. In addition, another table has been built for resuming the hardware platforms adopted in our tests
Configuration details
Our proposed architecture has been developed over DJI Matrice 210 RTK v1.4 UAV platform together with a Zenmuse XT2 RGB-IR camera, whose specifications have been listed in the following table.
Pixel resolution
The table containing all the values obtained from pixel resolution calculation steps.
Kalman filters
The Kalman Filters predicts the new state of the autonomous UAV system by applying these steps and following these equations:
Appendix E. Validation metrics
The following table summarizes the validation metrics of the UAV quantized detection models that have been tested over our computer modules.
