Abstract
Accurate monitoring of crop phenotypic traits is essential for efficient farm management and automation in agriculture. Multi-object tracking (MOT) and video instance segmentation (VIS) offer promising approaches to enhance agricultural robotic-vision systems, yet a major limitation is the scarcity of high-quality spatial-temporal datasets. In this paper, we introduce BUP-ST20, a novel weakly labelled spatial-temporal dataset for sweet pepper tracking and segmentation captured on a robotic platform. Our dataset is generated by leveraging still image annotations and utilizing a neural radiance field approach (PAg-NeRF) to automatically obtain consistent object semantics and identities across video sequences. BUP-ST20 contains 16,240 images from 275 sequences, with weak labels for training and validation, and human-annotated ground truth for evaluation. We describe how this pseudo-labelling approach can be adapted to any robotic platform with the required inputs, greatly reducing the annotation requirements for dataset creation, with a focus on agriculture and horticulture. Utilizing BUP-ST20, we evaluate state-of-the-art MOT approaches and propose two novel tracklet matching criteria, enhancing robustness in frame-skipped scenarios and low frame rate cameras. When we decrease the frame rate to approximately 1 frame per second our offline MOT based matching criteria is able to improve performance by an absolute value of 19.63, outlining its validity as a tracklet aggregation technique in this scenario. Our experiments demonstrate the effectiveness of the dataset in benchmarking MOT and VIS techniques within the agricultural domain. This also allows us to highlight challenges such as occlusion, shape variations, and weak-labelling limitations. BUP-ST20 serves as a valuable resource for further advancements in robotic crop monitoring and agricultural automation, while demonstrating the ability to create future weakly labelled datasets using robotic platforms.
Keywords
Introduction
Agriculture, a cornerstone of human civilization, has continually evolved through technological advancements, with a specific interest in quality and efficiency. Even with these technological advances, it is still common for farmers to visually inspect their crops to inform key decisions, which is a key expenditure in most traditional farms (ABARES, 2018). Generally, farmers undertake manual labour (including the use of large machinery) to validate and recognize the phenotypic properties, pests, or diseases attributed to their crops (LWK-Rheinland, 2020). This is a labour-intensive task and is heavily reliant on a farmer’s experience. In recent years, to reduce this reliance researchers have developed robotic vision-based techniques for seeding (Azmi et al., 2021), monitoring (Smitt et al., 2021), weed intervention (Ahmadi et al., 2022), and harvesting (Lehnert et al., 2020).
A key concept in farm management is accurate monitoring of crops to assess phenotypical properties. Automating this process via robotics requires accurate robotic-vision techniques that can efficiently and accurately estimate crop traits. If the goal is to ascertain crop health, monitoring techniques need to be able to detect pests or diseases (Cubero et al., 2020). To determine harvest timing, crop quality and ripeness must be accurately assessed (Halstead et al., 2018). Techniques have also been proposed for automated weed detection and classification via robotic platforms (Ahmadi et al., 2022, 2024), allowing for minimal herbicide use while enhancing sustainability. Typically, these robotic-vision techniques rely on single images to perform their task (pest classification, ripeness, and weed identification). To advance robotics in the agricultural field more detailed information is required, such as multiple views of the same object (Smitt et al., 2023). Multi-object tracking (MOT) is one method to aggregate multiple instances (samples) of the same object taken from different views.
MOT approaches can provide fine-grained detail about the state of a field which could be exploited by a farmer for precise crop management. An example of this is recent work by Pan et al. (2023) which combines MOT with mapping and 3D reconstruction techniques. This rich information can be pivotal for on-farm decision making, from yield estimation to crop health. Such approaches are in stark contrast to only performing yield estimation, which produces much coarser spatial information where only the predicted amount or total weight of crop to be harvested is estimated (Nuske et al., 2011). In addition to MOT, video instance segmentation (VIS) techniques play a crucial role in advancing robotic perception in agriculture. VIS techniques provide precise and temporally consistent segmentation masks, enabling a deeper understanding of crop structure and phenotypic variations. By leveraging both MOT and VIS approaches, robotic-vision systems can generate comprehensive spatial-temporal insights for enhanced crop monitoring and management. However, a styming factor for broad deployment of these approaches is access to labelled spatial-temporal data to train the requisite machine learning models.
In this paper, we apply a neural radiance field (NeRF) approach to obtain weakly labelled data from real-world agricultural data, consisting of sweet pepper, acquired by a robotic system (PATHoBot Smitt et al. (2021)). For this work, we take the existing BUP20 data from Smitt et al. (2021) and train a still image detector to label video frames whose panoptic identity are then automatically obtained using PAg-NeRF (Smitt et al., 2023). An overview of this process is given in Figure 1. We also show how this approach can be generalized to any robotic platform that outputs the required information for weak-label-based dataset creation. While we use PAg-NeRF to generate the training and validation subsets, our evaluation set is manually annotated by a human in the loop. Creating the evaluation set in this manner considerably decreases the time requirements for manual labelling. This is due to the accuracy of the pseudo-labels and the human in the loop is only required to fix erroneous segmentation masks, missing objects, and incorrect temporal ids. This leads to weakly labelled training and validation data while the evaluation data is fully labelled. We refer to this resultant spatial-temporal dataset as BUP-ST20. Using this new dataset, we evaluate several state-of-the-art MOT techniques, considering both online and offline methods. Further, we propose two new approaches to matching tracklets throughout the sequences. We extensively evaluate these two techniques (including how to obtain the elliptical shape used) and outline where they are appropriate considering two scenarios: random frame skipping; and low frame rates. Both of these evaluation situations are something that occurs frequently in deployed robotic systems. Our new matching approaches are based on the movement of the object throughout the scene, capture either using the variance inherent in a Kalman filter (Kalman, 1960) or by the average moving distance based on the dynamic radius approach of Halstead et al. (2021). This leads to the following contributions, we: (1) Provide a new weakly labelled spatial-temporal dataset, BUP-ST20, for tracking and video instance segmentation in the agricultural domain; (2) Show how our approach can be extended to any robotic platform with the required outputs; (3) Perform extensive evaluations of state-of-the-art tracking approaches on the dataset; and (4) Propose two new tracklet matching criteria to improve MOT when encountering random frame skips and low frame rates. BUP-ST20 pipeline. For each sequence, we first obtain instance-based semantic segmentation masks using Mask2Former. These segmentation masks, along with RGB images and odometry data, are fed to PAg-NeRF which generates a 3D representation of the scene and produces panoptic segmentation masks, ensuring temporal consistency across the sequence. For simplicity, odometry data as an input and the 3D depth representation as an output are not shown in the figure.

The newly proposed dataset, BUP-ST20, presents new challenges to the research community, such as facilitating the development of novel tracking or spatial-temporal panoptic segmentation approaches which exploit weak labels for training. Furthermore, this paper outlines how weak labels can be obtained in agriculture, and similar domains, by exploiting methods such as PAg-NeRF, allowing for future large-scale pseudo-label based datasets to be created from robotic platforms for agriculture and horticulture.
This paper is organized as follows: first, we review MOT as a technique, outline various MOT datasets, and finally we describe object detection and segmentation. Next, we describe the BUP-ST20 dataset, how it was captured and organized, and how the weak labels were produced. We then introduce our two novel tracklet matching criteria. Our results section displays state-of-the-art tracking approaches on the BUP-ST20 dataset and our ablation studies explore the potential of this new dataset. Finally, we conclude our work and outline the strengths and limitations of both our novel approaches and the dataset itself.
Prior work
In our prior work we establish state-of-the-art MOT approaches and their limitations; we explore why our novel dataset BUP-ST20 is required for the agricultural domain; then finish with a review of object detection and segmentation techniques relevant to this paper.
Multi-object tracking
Multi-object tracking (MOT) is a fundamental task in computer vision and artificial intelligence. It has gained particular interest in the agricultural domain due to the ability to supply farmers with important information about their crop (Halstead et al., 2021; Smitt et al., 2023), particularly the yield and quality of the crop. While, this is one domain in particular that uses MOT, its use is widespread throughout the computer vision community. This includes common tasks such as autonomous driving (Luo et al., 2021), surveillance (Denman et al., 2015), sporting analytics (Cui et al., 2023), and robotics (Ahmadi et al., 2022).
The primary challenge in MOT is accurate association of detected objects across sequential frames, whether from a moving camera or from time-lapse footage. The key to this challenge is the ability to recognize and track objects in spite of varying appearances, occlusions, and distance between images (such as from dropped frames) (Zhang et al., 2021).
In this domain there are generally two forms of MOT: online (Bewley et al., 2016; Du et al., 2023; Zhang et al., 2022; ), and offline (Halstead et al., 2021; Smitt et al., 2023). Online approaches only consider the current and previous frames, while offline approaches consider the full (or partially full) set of frames at once, meaning they can use past, present, and future frames to associate and track objects.
Within these two distinct frameworks exist multiple MOT approaches. A simple and common form of MOT is tracking-via-detection or -segmentation (Du et al., 2023; Halstead et al., 2021; Zhang et al., 2022) which creates and propagates tracklets based on the overlap between detected bounding boxes or segmented regions. However, these approaches often fail when the robot moves considerably or there are large differences between frames, leading to poor tracking due to the lack in overlap between frames. Halstead et al. (2021) aimed to overcome some of this performance degradation by introducing a dynamic radius search region rather than directly using bounding box overlap. In their work they found that even with intermittent frame drop they were better able to track smaller objects in the scene compared to their contemporaries.
Other approaches to MOT include incorporating re-identification (Zhang et al., 2021) in the pipeline. This method is able to aggregate the similarity of objects in the scene which was shown to be beneficial in scenes where the appearance of objects do not change greatly throughout the sequence. Cheng et al. (2021) used spatial-temporal masked attention in their tracking approach, based on detections from Mask2Former (Cheng et al., 2022). This approach achieved promising results on two similar datasets. Heo et al. (2022) employ object token based instance association for MOT, which they call VITA. This is a three step end-to-end MOT approach that detects and assigns tracklets. The method achieves, for the time, state-of-the-art results across a set of datasets. Finally, Ying et al. (2023) incorporate contrastive learning into their MOT approach. This aims to ensure that objects that look similar obtain a similar feature vector, while different objects obtain a vastly different representation through the contrastive loss. Their approach was able to accurately track different objects (pedestrians and animals) in complex scenes and achieve state-of-the-art tracking performance on three different datasets.
While all of these approaches are able to generate state-of-the-art results at the time of publication, in most cases they have similar limitations where horticulture is concerned. Fruit in a glasshouse, like sweet pepper, can appear similar meaning re-identification or contrastive losses do not have enough information to develop unique feature vectors. This results in different objects appearing similar in these approaches. Similarly, particularly for tracking-via-detection and -segmentation approaches if there are dropped frames or large robot movements between frames, overlap between objects is limited. This results in poor performance, although Halstead et al. (2021) did mitigate this limitation somewhat. In this paper, we introduce a new sweet pepper dataset and evaluate different online and offline MOT approaches to gauge how they deal with common agricultural difficulties.
Multi-object tracking datasets
There are a vast number of multi-object tracking datasets, from pedestrian tracking (Leal-Taixé et al., 2015) to vehicle tracking (Geiger et al., 2012). However in the agriculture domain there are very few large tracking datasets, particularly with the additional information typically provided by a robotic platform (i.e. depth images). In general, existing agricultural datasets are small (Saraceni et al., 2024) in number of sequences, or constrained in the data they capture (Kirk et al., 2021). In the BUP-ST20 dataset, we aim to provide large, multifunctional dataset, suitable for tracking and video instance segmentation. Alternatively, the weak labels can be used to train image-based segmentation models to enhance performance.
Due to the rapid growth of the self-driving car industry, tracking datasets are centred around autonomous navigation. For instance, KITTI (Geiger et al., 2012), MOT15 (Leal-Taixé et al., 2015), and MOT16 (Milan et al., 2016) contain image sequences featuring vehicles and pedestrians. These datasets utilize static and moving cameras for data collection and include 50, 22, and 14 sequences, respectively. Recently, large-scale datasets, such as nuScenes (Caesar et al., 2020) with 1000 sequences and the Waymo Open Dataset (Sun et al., 2020) with 1150 sequences, have emerged. They provide more comprehensive tracking data for various driving scenarios. All these datasets provide bounding box annotations and object labels per sequence.
To enable tracking-via-segmentation, also known as multi-object tracking and segmentation (MOTS), Voigtlaender et al. (2019) extend KITTI and MOT16 by providing temporally consistent instance segmentation masks. The DAVIS 2017 (Pont-Tuset et al., 2017) and YouTube-VOS (Xu et al., 2018) datasets contain 150 and 3252 sequences, respectively, providing semantic segmentation masks and object identities across frames. However, these datasets do not support tracking of individual instances. In contrast, YouTube-VIS datasets (Yang et al., 2019, 2021) offer instance segmentation masks with consistent object identities. They include 2883 and 3859 sequences covering 40 categories, although the categories do not include agricultural objects.
In the field of agriculture, object tracking is more challenging due to the homogeneity of the objects in the scene. While some existing MOT datasets contain a substantial number of images, they often consist of fewer longer sequences, which may restrict the diversity of scenes and environmental conditions captured. Additionally, many provide only bounding boxes, limiting their applicability to tasks beyond traditional object tracking. In contrast, our dataset includes both instance segmentation masks and temporal identities, making it well-suited for more complex tasks such as video instance segmentation and tracking-by-segmentation. Saraceni et al. (2024) present a table vineyard dataset, containing 800 RGB images in total for four sequences. Kirk et al. (2021) introduce a strawberry dataset with four sequences, each containing 500 frames captured by an RGB-D camera. These MOT datasets provide bounding box annotations with consistent identities. LettuceMOTS (Hu et al., 2024) consists of 1308 RGB images across 12 sequences. Similarly, De Jong et al. (2022) present APPLE MOTS which initially contained 1673 images, has recently been increased to 2198 images across 12 sequences. These offer fully annotated instance segmentation masks with unique IDs instead of bounding boxes. Our proposed dataset, BUP-ST20, contains 16,240 images for 275 sequences, each with bounding boxes, instance segmentation masks, and temporal identities. To the best of our knowledge, BUP-ST20 is the largest spatial-temporal dataset in agriculture and the first in horticulture. The dataset has weakly labelled training and validation sets, while the evaluation set includes 3810 frames with hand-labelled ground-truth annotations, making it larger than existing datasets of its kind.
Object detection and segmentation
Object detection (Carion et al., 2020; Ren et al., 2015; ) is a crucial task in computer vision. It aims to identify and locate all relevant objects in a scene and describe them via a bounded region. This is particularly important in agriculture as we are only concerned with the crop or fruit of interest and want to ignore background information. Early CNN-based object detection approaches such as Faster R-CNN (Ren et al., 2015) and YOLO (Redmon et al., 2016) revolutionized performance. More recently, we have seen a shift to transformer-based architectures such as DETR (Carion et al., 2020), which has once again improved detection accuracy.
Object segmentation expands beyond simple detection to locate and classify an object on a per-pixel basis. Object segmentation can be further partitioned into: semantic segmentation (Ronneberger et al., 2015), instance-based semantic segmentation (He et al., 2017), and panoptic segmentation (Kirillov et al., 2019b). Semantic segmentation assigns class labels to pixels. This allows for each class to be represented by a single image mask, however it does not distinguish between unique instances. Semantic segmentation methods have recently been superseded by both instance-based semantic segmentation and panoptic segmentation due to their ability to relay more useful information, particularly in agriculture.
Instance-based semantic segmentation expands the pixel labels such that each unique object (instance) is assigned an individual ID (dog 1 is assigned label 1, dog 2 is assigned label 2). A pioneering approach is Mask R-CNN (He et al., 2017), which builds upon Faster R-CNN by incorporating an additional branch to predict instance segmentation masks. Alternative methods, such as SOLO (Wang et al., 2020a), segment objects based on spatial position, dividing the image into grids and assigning masks accordingly. These advancements make instance segmentation more applicable to real-world scenarios, particularly in agriculture, where distinguishing between individual fruits or plants is crucial for tasks such as yield estimation and disease detection.
Panoptic segmentation combines semantic and instance-based segmentation, in that you can have both an instance ID and a semantic map in the same image. Generally, objects considered as ‘things’ (fruit, vehicles, or people) are given a unique instance ID, while ‘stuff’ (soil, sky, or background vegetation) is only given a semantic label. This dual representation enables more comprehensive scene understanding. One of the earliest approaches is Panoptic FPN by Kirillov et al. (2019a), which incorporates a semantic segmentation branch in Mask R-CNN, allowing separate instance and semantic segmentation tasks to be merged. Similarly, Cheng et al. (2020) extends DeepLab (Chen et al., 2018) with an instance segmentation branch. It uses object centre estimates and clustering to segment instances before merging semantic and instance predictions. UPSNet (Xiong et al., 2019) employs a parameter-free panoptic head at the end of the network to improve results. Beyond CNN-based models, vision transformers have recently received wide use. MaX-DeepLab (Wang et al., 2021) integrates CNN and transformer architectures, extending Axial-DeepLab (Wang et al., 2020b) to predict instance masks directly. Lately, Cheng et al. (2022) introduced a transformer-based universal segmentation model, Mask2Former. Its key innovation is the use of masked attention in the transformer decoder to extract localized features. Notably, Mask2Former can also be applied to VIS without requiring architectural modifications (Cheng et al., 2021). VIS extends segmentation to the temporal domain, enabling the consistent tracking and segmentation of objects across frames. MaskTrack R-CNN (Yang et al., 2019) was the first VIS model, building upon Mask R-CNN by introducing a tracking branch to associate instances across a sequence. However, early approaches struggled to consistently maintain identity over long sequences. Transformer models that leverage self-attention mechanisms have been proposed to improve temporal consistency. MinVIS (Huang et al., 2022) utilizes an image-based transformer detector with bipartite matching to enhance object association. In contrast, VITA (Heo et al., 2022), introduced in the MOT section, generates object tokens from spatial features and implicitly incorporates temporal information to effectively capture motion dynamics. More recent VIS models such as IDOL (Wu et al., 2022) have applied contrastive learning to enhance object representations and refine tracking accuracy. Expanding this idea, CTVIS (Ying et al., 2023) which can be used as a MOT approach or for VIS, and was introduced in the MOT section. CTVIS employs contrastive learning and further optimizes object tracking by integrating a memory bank during training, improving object association and overall segmentation performance. To train these VIS models (excluding MinVIS) for agricultural scenarios, we require ID-consistent spatial-temporal datasets, such as BUP-ST20.
As our work creates pseudo-labels for panoptic segmentation we also investigate neural radiance fields (NeRF). NeRF (Mildenhall et al., 2021) is an MLP-based neural network that takes 2D image positions and associated viewing angles as input and outputs the colour and density of the corresponding 3D scene. By leveraging NeRF, we can generate a more coherent segmentation across frames, ensuring that objects retain their identities throughout a scene. Prior works, such as Semantic-NeRF (Zhi et al., 2021) extend NeRF by adding semantic labels into the network, yielding a semantic 3D view of the scene. Further advancing this approach, Kundu et al. (2022) and Fu et al. (2022) integrate semantic and instance-level information to obtain a panoptic scene representation. More recent approaches have improved NeRF’s efficiency by using feature grids as input. For instance, PAg-NeRF by Smitt et al. (2023) employs a permutohedral grid to encode poses and angles, and generates a 3D colour, depth, and panoptic view of the scene. In our study, we utilize PAg-NeRF to create ID-consistent panoptic predictions, ensuring robust performance in challenging agricultural environments.
Materials
The University of Bonn sweet pepper dataset (BUP20) was captured in 2020 at the simulated commercial glasshouse at Campus Klein Altendorf (CKA). It contains two sweet pepper cultivars: Mareva which ripens from green to yellow, and allrounder which ripens from green to red. In this case, BUP20 and by extension, BUP-ST20 are more complex tasks than a standard commercial glasshouse due to the variety of sweet pepper available; usually a glasshouse cell will only have one cultivar. Similar to most commercial glasshouses CKA is fitted with a manual or automatically shutting and opening shade sail. This ensures that the sun/shading is consistent throughout the day, creating constant lighting throughout the entire cell. One complicating factor for glasshouses is the morning and afternoon sun which come in at an angle to the crop. This was allowed for in BUP20 as data was captured at varying times in the day, creating a dataset robust to this challenge. Data was captured by the automated monitoring robot PATHoBot (Smitt et al., 2021) (see Figure 2). PATHoBot is equipped with three Intel RealSense D435i cameras capable of capturing RGB-D images. The glasshouse environment consists of six crop-rows and each row is 34 m long. The top image shows PATHoBot working at CKA in a sweet pepper cell; the bottom image outlines the design specifications of the robotic platform, including the assumed constant distance from the radiators.
PATHoBot is a piperail-based robotic platform designed for monitoring and intervention in a commercial glasshouse. It is based on a scissor lift platform where the floor can have various payloads mounted for intervention, with the option of raising and lowering the platform depending on the location of the fruit. The three Intel RealSense D435i cameras are mounted 1.35m away from the heat rails, and due to the parallel movement of the robot (on the piperails) to the crop this distance is assumed to be constant (as with all commercial glasshouses). This ensures the maximum distance a fruit in the current row can be located away from the cameras is approximately 1.2m; this was shown to be the case in Smitt et al. (2021) and Halstead et al. (2021). The layout specifications of PATHoBot can be seen in Figure 2 in the bottom image, and the top image shows it working in the actual glasshouse at CKA. In addition to the information gathered by the cameras, PATHoBot also outputs robot odometry based on encoders located on the piperail wheels. All of this information, RGB-D, IMU, camera intrinsics and extrinsics, and odometry, is essential for the PAg-NeRF system to output the pseudo-labels.
The BUP20 dataset contains fully annotated still images, along with depth and odometry data for training, validation, and evaluation of sweet pepper detection and segmentation. The dataset comprises a total of 280 images across all three subsets. While these still images are sufficient for the original detection and segmentation task, one of the goals of this paper is to introduce a dataset that contains sequences of horticultural images for MOT and VIS. As such, we leverage the additional images captured by PATHoBot to create sequences for MOT and VIS purposes. We also describe how this process can be adapted to other robotic platforms to create large-scale weak-labelled datasets for agriculture and horticulture.
In total we have approximately 18k RGB-D images captured in 10 videos over multiple days using the six different rows at CKA in the sweet pepper chamber. We split these videos into 275 non-overlapping sequences (16,240 images) taking into account that this data overlaps with the BUP20 still image annotations. Data is split such that sequences in the training and validation sets include at most two annotated images in a sequence, and in 55 sequences there are no annotated images. Sequences in the evaluation set can only contain one labelled BUP20 frame, positioned in the middle of the sequence. This ensures that the evaluation set is not compromised by including labelled training or validation information. Sequences that contain superfluous background noise, such as people in the scene, are removed to ensure that individual data privacy is met. The training set contains 127 sequences with 7914 images, and has the following sequence lengths: one sequence each of 32, 39, 45, 53, and 59 frames, and 122 sequences of 63 frames. The validation set contains 72 sequences and 4516 images, with one sequence of 46 frames, one sequence of 60 frames, and 70 sequences of 63 frames. The evaluation set contains 76 sequences whose lengths range from 29 to 77 images, comprising 3810 images. The overall distribution for the evaluation set can be seen in Figure 3, where we see a greater range of sequence lengths compared to the other two subsets. The count of each sequence length in the evaluation set. The x-axis relates to the index of the specific sequence in the BUP-ST20 dataset and the y-axis represents the count of frames in the current sequence.
To create a spatial-temporal dataset for tracking and video-based segmentation, we need to create annotated data for each sequence. For the training and validation set we use pseudo-labels to expedite the process. For the evaluation set we initialize the annotations with pseudo-labels and then perform manual re-annotation only where necessary (i.e., when the sequence identity is wrong, when there are missing segmentation masks or gross mistakes for the segmentation masks). Generating pseudo-labels consists of two main steps: (1) obtaining instance-based semantic segmentation results in all of the images and (2) then post-processing this information to obtain geometrically consistent instances along with the sequence consistent identity. Algorithm 1 outlines the pseudo-code to create the pseudo-labels for all three subsets.
To obtain the initial instance-based segmentation results, we employ Mask2Former. To train a Mask2Former model we leverage the BUP20 dataset and use the pre-trained COCO (Lin et al., 2014) weights. We then fine-tune the model for 1k iterations with a learning rate of 0.0001, utilizing 4 GPUs and a total batch size of 16. For all other training parameters, we retain the default settings provided in the official Mask2Former implementation. We also do not apply any dataset-specific normalization (e.g. pixel-wise mean and standard deviation), in order to avoid overfitting to the specific distribution of the sweet pepper images (red, green, and yellow) and to promote generalization. This fine-tuned model is then used to predict instance-based segmentation masks for each sweet pepper in each image A comparison of the pseudo-labels obtained from PAg-NeRF and Mask2Former. PAg-NeRF exhibits superior detection performance, particularly for small objects, compared to Mask2Former. Green, red, and blue bounding boxes illustrate accurate detections, false positives, and missed detections, respectively.
We employ PAg-NeRF (Smitt et al., 2023) to post-process Mask2Former labels and generate geometrically and temporarily consistent masks for each sequence. PAg-NeRF is an end-to-end trainable neural radiance field-based system. We acknowledge that NeRF-based systems are computationally expensive, however in our work we consider all tasks (Mask2Former and PAg-NeRF) to be offline tasks and thus do not optimize them for speed or resources during training or inference. To train PAg-NeRF, along with panoptic masks (with frame-wise inconsistent IDs), it also requires the RGB images, robot odometry poses, camera parameters (intrinsics and extrinsics), and optionally the depth images as inputs. The model then outputs 3D scene representations for colour
This data creation technique is generalizable and can be applied across different agricultural settings, crop varieties, and robotic platforms. The pipeline utilizes the official open-source implementations of both Mask2Former and PAg-NeRF without any architectural modifications. Provided that the necessary inputs – RGB images, robot odometry, and camera parameters – are available, the same approach can be applied to other crops, such as tomatoes and sugar beets. Depth images are optional and can be omitted if object filtering is not necessary. Researchers can leverage this strategy with their robotic platforms to efficiently generate spatial-temporal datasets with weak labels, thereby minimizing annotation efforts.
Using PAg-NeRF in this manner creates pseudo-labels for all three subsets, the dataset is released here
1
along with the code required to run various evaluations and hyper-parameters used. To create an evaluation set we use a human in the loop to correct errors in the weak labels. The human operator checks each sequence in turn for ID-consistency of the pseudo-labels, adding annotations for any missed/incorrect detection, and ensuring ID-consistency for all of detections in the sequence (See Figure 5). Creating the evaluation subset in this manner ensures all sweet pepper are included and allows all tracking and detection/segmentation metrics to be run on actual ground truth. Evaluation set corrections. Left: Weakly labelled instances generated by PAg-NeRF. Right: Corrected version of the left image. Red boxes indicate erroneous detections, while green boxes represent the ground truth. As a depth filtering of 1.2 m was applied to the dataset, it provides annotations exclusively for foreground peppers.
Evaluation of instance-based pseudo-labels from PAg-NeRF against manually corrected ground-truth on the evaluation set.
M IoU represents mask-based IoU.
Number of tracklets in BUP-ST20 and number of instances for each sub-class.
Methods
In this paper, we introduce the BUP-ST20 dataset for MOT and VIS tasks. To demonstrate the value of this dataset we benchmark various state-of-the-art tracking algorithms. We also evaluate two novel tracklet matching paradigms, one using the second order statistics of a Kalman filter, and the other based on extending the dynamic radius approach of Halstead et al. (2021). We explore these two novel methods of aggregating tracklets to observe the benefits of using a Kalman filter or not when available. These approaches aim to rectify the known issues within tracking-via-detection where tracklets are lost or identities are switched when frame skips occur or there is a low frame rate camera employed. Frame skips or low frame rates are common issues faced when deploying robots in challenging real-world settings such as agriculture.
MOT algorithms
We consider both online and offline MOT methods. Online approaches (Bewley et al., 2016; Du et al., 2023; Zhang et al., 2022) process each image in the sequence as they appear, meaning they only deal with the past and current images. By contrast, offline MOT approaches (Halstead et al., 2021; Smitt et al., 2023) consider the sequence in its entirety. This allows offline methods to aggregate information from the future, generally making offline approaches more accurate. However, their use is often inappropriate in real-world situations such as a robot actively sensing in the field.
We evaluate various offline and online approaches. We employ algorithms based on tracking-via-detection (Halstead et al., 2021), approaches that utilize re-identification (Zhang et al., 2021) and contrastive learning (Ying et al., 2023), spatial-temporal masked attention (Cheng et al., 2021), and object token based instance associate (Heo et al., 2022). Algorithms are deployed on the same dataset with the same inputs, BUP-ST20, and we evaluate them in a consistent manner using common MOT metrics that outline the performance of our chosen approaches on our novel dataset.
Metrics
The chosen MOT based metrics are: higher order tracking accuracy (HOTA) (Luiten et al., 2021), IDF1 (Ristani et al., 2016), and multiple object tracking accuracy (MOTA) (Bernardin et al., 2006). HOTA is the primary metric as it comprehensively evaluates both detection and association accuracy, and has been shown to produce the most complete evaluation of tracking approaches. This metric can be expressed as
The MOTA metric is defined by
For the final two ablation studies we use video and image-based average precision metrics (AP) to assess segmentation performance. Video-based AP evaluates the segmentation accuracy across an entire sequence, considering spatial and temporal consistency. This metric is crucial for VIS tasks as it ensures that objects are correctly segmented over time while also maintaining their identities across frames. Similarly, image-based AP measures segmentation performance on a per-image basis, focussing on pixel-wise accuracy. This metric is particularly useful for evaluating the effectiveness of instance segmentation models, where precise object boundaries and classification accuracy are critical. Both metrics are reported at different intersection-over-union (IoU) thresholds, such as AP50 (IoU > 50%) and AP75 (IoU > 75%), providing a detailed performance breakdown.
Novel MOT approaches
In this section we describe the two novel assignment methods for MOT that can be added to tracking-via-detection approaches. First, we describe the standard tracking-via-detection approach that uses intersection-over-union (IoU) to aggregate tracklets over a sequence. Then we describe the dynamic radius (DR) approach of Halstead et al. (2021) as this forms the basis of our two novel dynamic radius-extended (DRE) algorithms, which are presented thereafter. We describe two DRE based approaches based on the presence of a Kalman filter or not. DRE-Sigma relies on the Kalman filter to dynamically update the search position, while DRE-Delta only relies on the detected motion and is needed for methods that do not explicitly include a Kalman filter.
Intersection-over-union
The most common form of tracking-via-detection uses the IoU to match two objects into tracklets. These approaches assume that if in consecutive frames objects are detected at a similar or slightly offset locations, then it is the same object. While, some techniques have been developed to reduce the shift between images, such as re-projection (Smitt et al., 2022) or a Kalman filter (Zhang et al., 2022), generally IoU approaches assume good connectivity between images.
Figure 6 (top row) outlines the IoU approach. At image t − 1 the detection algorithm locates objects of interest in the scene. We perform the same object detection routine on image t. Then objects that overlap more than a threshold τ (or the objects with the greatest overlap) are considered the same object and tracklets are created or updated (if they already exist). This tracklet creation and update process can be done in real-time. An illustration of IoU (top row) and dynamic radius (bottom row) tracklet aggregation approaches. For both approaches at t − 1 and t sweet pepper are detected (green bounding box in the left column images). For the IoU the detections at t are compared to that at t − 1 based on the IoU value (blue box in the top right). The detections with the highest IoU values are used to extend a tracklet. For the DR approach a radius (purple arrow) based on the largest edge of the detected bounding box (the height in this instance) is calculated. This radius is then used to search for the centroid (red dots in the bottom right) of objects within the search radius. The centroid can be calculated from either the centre location of the bounding box, or the mean location of segmented pixels as done in Halstead et al. (2021). If an object falls within this radius, the closest object based on the Euclidean distance is selected as the match, and tracklets are updated.
This approach fails when there is a low frame rate or large movement between images. Methods such as ByteTrack (Zhang et al., 2022) attempt to alleviate this limitation by having multiple IoU matching criteria and also include a Kalman filter. This Kalman filter shifts detections at t − 1 to a new position at t to allow a better IoU score. Re-projection can also be used to address this challenge, but requires the camera and robot movement parameters that aren’t always available (and are rarely supplied in MOT datasets).
Dynamic radius
IoU matching, without a Kalman filter or re-projection, also requires objects to be of consistent size to match between frames. DR was first proposed by Halstead et al. (2021), and it aims to alleviate some of the limitations of IoU matching by instead matching within a radius based on the object size and a constant, τDR. As shown in Figure 6 (bottom row), first the radius is calculated as the largest edge of the detected bounding box of the current detected object. In the next frame, all objects are detected and their centre calculated: either the centre of the bounding boxes or the mean location of the segmented objects. The Euclidean distance between the current tracklet (from the previous frame) and the new object centres is calculated. If this distance falls within the dynamic radius the closest object is considered a match and the tracklet is updated. Similarly, the radius is also updated by using the largest edge on the new bounding box, this allows for objects to enter or exit the scene and scale their radius accordingly.
One of the benefits of this approach, as stated in Halstead et al. (2021), was that it was better able to track small objects. This is due to the movement of the robot/camera making smaller objects seem to ‘move’ further making IoU less likely to accurately assign objects to a tracklet. However, this approach (without re-projection) still fails when there is large movement in the scene (through robot movement or frame drops) as it still relies on the spatial distance to be within the search radius. If the objects appear outside of this search radius then no match will occur.
To address this limitation we explore two extensions to the DR approach. First, using the second order statistics in the Kalman filter we create dynamic radius-extended sigma (DRE-S). This extends the DR search radius to include the variance allowing for a greater search area, thus being able to track objects through the scene where greater movement appears (random frame skips, or low frame rate cameras). Also, for situations where the second order statistics are not available, we introduce dynamic radius-extended delta (DRE-D). This similarly extends the search radius, however, it is based on the average physical movement between frames of the tracked objects. In both of our novel tracking extensions, we no longer use a standard circle and instead transform the search area into an ellipse that better describes the object movement.
Dynamic radius extended - Sigma
MOT approaches such as ByteTrack include a Kalman filter to predict object movement in the scene. This allows for the bounding box at t − 1 to be shifted to an estimated position at time t based on the covariance matrix. We exploit the internal Kalman filter to extend the DR approach outlined in the previous section; we call this dynamic radius extended - sigma (DRE-S).
The Kalman filter we employ has an 8 × 8 covariance matrix. The first four dimensions relate to the movement of the objects in 4D space multiplied by a heuristic weight (1/10). The next four dimensions are the velocity of the objects in 4D space multiplied by a second heuristic weight (1/80). Both weights come from the publicly released ByteTrack code and are used to maintain fair reproduction with the original work as stated in Zhang et al. (2022). In contrast to the ByteTrack filter, we use the associated movement in all directions as the input to the Kalman filter, where they only used movement in the fourth dimension and a static constant for the third dimension. We empirically found that using the actual movement and velocity of the objects improved results slightly, and thus we use this Kalman filter formulation for all evaluations. Our modified code can be found at, 1 where we will release the two novel search criteria, the modified offline trackers of Halstead et al. (2018, 2021), and detail how to modify the ByteTrack (Zhang et al., 2022) code to be used with the updated Kalman filter. This Github repository will also outline the location of the repositories for the other MOT approaches used in this work. It will also describe how to download and use the dataset and provide the evaluation results with the best hyper-parameters used for each of the evaluated MOT techniques.
This Kalman filter formulation allows us to directly associate each value in the covariance matrix with a movement or velocity in a given direction. Primarily, we are concerned with movement in the x (left to right) and y (top to bottom) directions. To get the movement in these directions, we extract the diagonal of the covariance matrix and calculate the square root of the elements. The first two values in the resulting vector are the movement in the x and y directions respectively.
In both our novel DRE based matching criterion we create an elliptical shape rather than a circle as used in DR. For DRE-S we compute this ellipse using two distinct methods: maximum and summation. We evaluate these two methods to see which is able to best aggregate tracklets, and present the results in the following section. For the maximum approach we compute the x and y values of the ellipse using the standard radius of the DR method and the variance previously calculated; in this case either the x or y direction can be the major radius. The radii for the ellipse is calculated in the following manner,
The search ellipse can also be formed using a summation operation. In this case the two radii are calculated such that
Finally, the changed search region necessitates a different method to check if a detected object is within the search region of existing tracklets. We use the standard ellipse calculation for this with the following modifications:
This ellipse function now indicates if an object is within the search region of an existing tracklet. If the value of D is in the range [0, 1] then the new object is within the ellipse, if it is greater than one then it is outside the search range. Only objects within the field of search are considered matches, and the object with the smallest D value is added to the tracklet.
Dynamic radius extended - Delta
Our second novel matching criterion is designed for MOT algorithms that do not have a Kalman filter, but still aggregate tracklets over time. Dynamic radius extended delta (DRE-D) is a tracking-via-detection addition that uses a moving average of the distance between tracked objects in images to create an elliptical search region. This allows for, depending on movement prediction, larger search areas to help maintaining tracklets when there are large differences between object detections (either from frame skips or from large robot movements).
While DRE-S takes the σ∗ values from the covariance matrix as movement parameters, DRE-D takes a different approach. At initialization, the two ellipse radii (dx,0 and dy,0) are calculated as
To match new objects to existing tracklets we employ three different criterion for DRE-D: default, maximum, and summation. The default criteria is calculated using
To match a new detection to the tracklet we calculate if it falls within the ellipse by using the same equation as that in the DRE-S method (equation (6)), and values of Γ∗ derived using the default, maximum, and summation criterion. Similar to DRE-S, if D is in range [0, 1] this indicates that an object is close enough to the tracklet to be matched. Finally, once an object has been attached to a tracklet we update d∗ in both directions, using a first-in-first-out approach to ensure only N distances are retained (N = 3 by default).
The offline tracking approach
Finally, we modify the tracking-via-detection approach of Halstead et al. (2021) to suit our purpose. In Halstead et al. (2021) the authors incorporated a kill zone and an initilisation zone, however we found these zones unnecessary. This was because the Mask2Former predictions were far more robust when compared to the Mask R-CNN (He et al., 2017) detections used in Halstead et al. (2021, 2018). Removing these zones allows us to track objects more accurately throughout the scene.
This offline tracker forms the basis for our DRE matching criterions as it showcases the ability to better track objects in the presence of large frame skips. Our GitHub repository includes our version of the code, however, in our evaluations we refer to this method using the original citations.
Results and Discussions
Based on the novel dataset, BUP-ST20, we evaluate multiple state-of-the-art tracking techniques to challenge researchers to improve MOT on a highly challenging, robotically captured dataset. We perform quantitative and qualitative analysis along with three ablation studies that fully demonstrate our novel MOT based matching criterions and the dataset.
The primary evaluation of BUP-ST20 as a tracking dataset only uses the super-class (sweet pepper), even though the dataset does contain sub-classes (ripeness). We deploy twelve state-of-the-art tracking algorithms, and modify ByteTrack to consider different matching criterion (DR, DRE-D, DRE-S) and evaluate if IoU matching is the best policy. For the offline tracker described in the method section, we deploy with both the DR and DRE-D matching criterion (as it does not have a Kalman filter). For all tracking approaches (both ours and the state-of-the-art implementations) we perform a grid search for the best hyper-parameters on the validation set of BUP-ST20. In the evaluation we only report the best-performing approach and not all values of the grid search. In the released GitHub repository we will clearly state which hyper-parameters were selected for each approach.
Following this, we present our first ablation study which evaluates tracker performance when a random numbers of frames are skipped or when simulating reduced frame rates. In Ablation II, we employ state-of-the-art video instance segmentation models on BUP-ST20 to demonstrate its suitability for VIS and evaluate the segmentation accuracy of these models. Finally, in Ablation III, we use the best-performing image-based instance segmentation model on the BUP-ST20 and BUP20 datasets to compare the performance of weakly labelled and fully annotated datasets. Additionally, we assess the mask accuracy of both image-based and video-based segmentation models using the weakly labelled dataset as a reference point.
Multi-object tracking on BUP-ST20
The primary evaluation compares state-of-the-art tracking algorithms and our two novel matching criterion on BUP-ST20. We chose 12 MOT algorithms for this evaluation based on their performance on various tracking datasets. Ordered by the year of their release we include: Bewley et al. (2016); Wojke et al. (2017); Halstead et al. (2018); Zhang et al. (2021); Halstead et al. (2021); Cheng et al. (2021); Zhang et al. (2022); Heo et al. (2022); Du et al. (2023); Smitt et al. (2023); Cao et al. (2023); Ying et al. (2023). These selected MOT algorithms are used to validate the novel dataset as a challenging task due to the unconstrained shapes and occlusions of the sweet peppers.
Quantitative results
Table of the results on BUP20-track – order these based on the year they were released.
Best performances are highlighted in bold.
As expected and as outlined in Table 3 the best-performing technique is an offline algorithm. This is due to the ability to consider past, present and future images, resulting in more robust tracking. For online approaches we see the standard ByteTrack (Zhang et al., 2022) algorithm achieves the best HOTA score, which is the principle tracking metric. With a score of 81.35, it beats the next best online system by an absolute margin of 0.87, however, in general all approaches achieve good results. Interestingly, FairMOT (Zhang et al., 2021), which includes re-identification of detected objects achieves the worst HOTA score at 68.52. We attribute this to the fact that even though sweet pepper are highly deformable, their appearance for re-identification is not sufficiently unique for one-to-one matching. For the other metrics, IDF1 and MOTA, OCSort (Cao et al., 2023) achieves the best performance with only a small reduction in HOTA.
For the offline approaches we see that he modified offline tracking approach (Halstead et al., 2018) with IoU achieves the best results with a HOTA score of 82.12. Compared to PAg-NeRF, which we used to create BUP-ST20’s pseudo-labels, we see an improvement of 0.9, however PAg-NeRF achieves the best MOTA score indicating that it has less missed tracks. Unfortunately, for M2F-VIS and VITA we see a considerable drop off in performance despite being offline methods. We attribute this performance drop to the challenging nature of BUP-ST20 which comprises significant of natural occlusions (other sweet pepper, leaves, and stems). These challenges are natural within a horticultural dataset, and as a result these methods frequently miss detections or detect false positives. Furthermore, the homogeneity of the sweet peppers leads to inaccurate object associations, further hampering performance.
Overall, our novel matching criterion, while achieving close to best performance, are in general not able to improve upon IoU matching. For the offline DRE based approaches we saw that the summation approach was best for DRE-D and the maximum was best for DRE-S. For the offline DRE-D approach, the best-performing ellipse creator was obtained by using the maximum technique, however, there was only a small difference when compared to the summation technique. As such we recommend selecting the summation approach for all of our novel DRE based matching criteria, and for all ablation studies using the DRE criteria we have used this ellipse creation technique. This is in contrast to Halstead et al. (2021), who found that DR performed better than IoU for their application. We attribute some of this performance degradation to the described benefits of the DR approach, primarily, that DR is better for small objects. As we have previously stated Mask2Former has a known issue in that it does not detect small objects very accurately and in BUP-ST20 there are a number of small objects in the scene. This is particularly evident in the right most image of Figure 7 where the ground-truth object is not detected. Missed detections of this type limit tracking ability as we do not detect all relevant objects that would otherwise improve tracker performance (small objects). In future, to achieve better results, a more accurate detection model which better detects small objects could in turn improve the performance of the DR and DRE matching criteria. Various examples of the challenges of the dataset; red bounding boxes indicate the detections and green bounding boxes represent the ground truth. In images (a) and (b) only the detections are displayed as there are no ground truth sweet pepper in the current row. (a) An example of detections prior to depth filtering, where all objects in the scene are more than 1.2 m from the camera and should not be detected. (b) Demonstrates the same image as in (a) after depth filtering, however, only a single sweet pepper in the deeper rows is detected. In the middle left of the image there is a single red incorrect bounding box. (c) Shows where leaves or peduncles are detected as sweet pepper. Also in the middle top there is an example of the detection routine failing due to occlusion. Finally, (d) is an example of where Mask2Former fails with small objects. There are three green bounding boxes (of small objects) which are completely missed.
Overall, the best approach for accuracy is the offline tracking algorithm based on Halstead et al. (2018), achieving best performance in HOTA and IDF1 and comparable performance for MOTA. However, offline trackers are not always appropriate, particularly for online robotic platforms. If an online tracker is required, OCSort or ByteTrack offer the best tradeoff between HOTA, MOTA, and IDF1.
Qualitative results
Here we illustrate some of the issues we encountered during the evaluation. First, while Mask2Former has some limitations, it is still a good detection algorithm for both foreground and background objects. Due to this we need to employ depth filtering on the detections from the Mask2Former model.
Figure 7(a) shows Mask2Former detections prior to depth filtering. A number of sweet pepper in background rows are ‘accurately’ detected. To only track objects in the current row, depth filtering is essential, the results of which can be seen in Figure 7(b). In Figure 7(b) we also note a single failure case of the depth filtering, there is one red bounding box on the left hand side below the centre. Figure 7(a) and (b) are the same image (with and without depth filtering) showcasing that even though we employ depth filtering, extra errors can be introduced into the detection pipeline. This outlines the complexity of a horticultural scene and the difficulties associated with both detection and tracking objects through the scene.
Another common issue in agriculture is the detection of non-objects as objects. This is especially common when target objects have similar appearance to other objects in the scene, i.e. green leaves can look like green sweet pepper. In Figure 7(c) we can see examples of this. It is clear that certain objects are incorrectly detected, particularly green leaves and a peduncle. We also see in this figure a case where one sweet pepper (centre top) is split into multiple objects due to occlusion.
Finally, Figure 7(d) shows where small objects are not detected. In the top of the image we see two small sweet pepper that are not detected, showing one of the major limitations of the Mask2Former model. Overall, while Mask2Former offers good performance, the challenges of a horticultural dataset for MOT and panoptic segmentation are not insubstantial. This outlines the challenge to other researchers to improve detection, segmentation, and MOT on our novel dataset.
Ablation I: Skipping frames
In the first ablation study we evaluate the considered state-of-the-art MOT algorithms when we introduce random frame skipping and low frame rates in the BUP-ST20 dataset. This simulates real-world situations where edge devices such as robots have limitations in processing capacity or speed inconsistencies. The random frame-skipping experiment is based on the Poisson distribution with a λ = 1, in all evaluations we use the same set of randomly distributed frames ensuring fair comparison between MOT techniques. Using λ = 1 ensures that the distribution of the frames is centred around having either 0 or 1 skips, with a smaller probability of having larger skips. This is more reflective of what is witnessed in real-world situations where there is a majority of no or very small skips. For the low frame rate evaluations we decimate the frames by 2, 5, and 10. As the standard camera used in the BUP20 dataset was captured at 15 frames per second we now have approximately: 7, 3, and 1 frames per second.
Random skipping frames and low frame rate results with the HOTA metric reported for each method.
Best performances are highlighted in bold.
Finally, we also only select a subset of MOT techniques for this ablation study. We select ByteTrack (Zhang et al., 2022) (with IoU, DR, DRE-S, and DRE-D) as this best represents online approaches that use the state-of-the-art Kalman filter. As the best-performing approach for MOTA and IDF1 in Table 3 we include OCSort (Cao et al., 2023), and finally due to being a considerably different re-identification based approach we include FairMOT (Zhang et al., 2021) for robustness. For the offline approaches we select Halstead et al. (2018) for the IoU matching and Halstead et al. (2021) for DR and DRE-D. We do not further evaluate PAg-NeRF in this evaluation as it is has a learnt matching criteria and would therefore need to be retrained and optimized on this specific data. This would mean that the assumption of ‘random skipping’ is removed as it would be trained specifically for these sequences.
Table 4 once again splits the evaluated MOT techniques into offline and online, where the bold numbers indicate the best-performing approach for each column. Starting with the online approaches we see that for random frame skips, the introduction of the dynamic radius sigma approach to the already accurate Kalman filter of ByteTrack performs the strongest. This is able to achieve an absolute performance improvement of 2.2 over the standard IoU based approach. This is in contrast to the results seen in Table 3 where the IoU based ByteTrack performed the best. This outlines the benefit of incorporating movement based on the Kalman filter with an elliptical shape when there is potential for frame skipping. In this, and the reduced frame rate experiment, we used summation approach for the ellipse formation. We attribute this to the ability to provide richer movement context in the searching radius. For the reduced frame rate experiment we see that, in general, the standard dynamic radius approach outperforms the others. This is particularly noticeable in the 5f evaluation where it scores an absolute improvement of approximately 10 over the dynamic radius sigma approach. However, for the 10f decimation method we see that the DR and DRE-D approaches are commensurate. For both FairMOT and OCSort we see considerable degradation in performance when either frame skipping or low frame rates are present. Overall, where intermittent frame skipping is possible it is clear that selecting the DRE-S method is beneficial, although, for consistently low frame rate issues it is best to consider the standard dynamic radius approach.
The three offline approaches tell a slightly different story to the online approaches which we attribute to the possibility of looking into the future as well as the past. For these approaches, due to the fact that the Kalman filter is not available, the DRE-D approach achieves either considerable performance boosts (λ1, 5f, and 10f) or commensurate performance (2f). Once again, for the DRE-D approach the summation method is used for shape creation. For the 5f and 10f the selection of DRE-D is clear; we are able to achieve an absolute improvement of approximately 10 (5f) and 20 (10f), outlining its ability to better aggregate tracklets when the frame rate is low. The HOTA scores for these two low frame rate evaluations (5f and 10f) are the best across both the offline and online techniques.
For the online approaches no single approach offers best performance across all scenarios. Where frame skipping may occur the DRE-S approach is beneficial with the summation ellipse calculation. In low frame rate situations there is actually a degradation in performance when selecting this approach and in fact the standard DR approach is more appropriate. For the offline tracking the performance gains of the DRE-D method are considerable, both for frame skipping and low frame rates. This makes it, along with the summation approach, the clear criterion choice.
From a qualitative standpoint there are obvious situations where our novel criteria succeeds and fails. The most common scenarios are outlined in Figure 8, which are consistent across both random frame skips and low frame rates. Success and failure cases of the DRE methods, both sets of images were taken from the same sequence (412) in the evaluation set, however, they display a common trend. From f0 to f2 there are small skips and the IoU version performs well, for f3 there is a large skip and IoU fails in both (a) and (b). In f∗ we denote the ground truth tracklets with green lines, the IoU tracklets with red lines, and the DRE tracklets with the blue lines; we have slightly offset the IoU and DRE lines for visual purposes. (a) An example of where IoU beats DRE; even with a small detection (erroneously small) there is enough IoU to be considered a match. The centre location of the bounding box falls outside of the DRE search radius and thus it fails. (b) An example of where DRE is better than IoU. When there are two objects close together and the movement of the bounding box isn’t sufficient to tell them apart the DRE method still only detects objects that are within the search radius and closest in Euclidean distance, whereas the IoU method solely aggregates tracklets based on the highest overlap (which is incorrect in this example at f2). In (b) when we consider f3 the DRE method is still able to update the tracklets even though there is greater movement between frames.
Figure 8(a) is an example of where the IoU method is able to outperform the DRE methods. In this example, the detection from f1 (which can be shifted depending on the base technique) is used at f2. The detections at f2 are then compared to the f1 based detections, in this case the f2 detections are erroneously small (do not detect the entire object). This means that even though there is a small overlap between the f1 and f2 bounding boxes, it is enough to be considered a match, and the tracklet is updated. Unfortunately, for the DRE methods, even though the ellipse is shifted based on the movement parameters, the centre of the new object in f2 is outside of the search radius and thus a match is not found. This trend could be somewhat rectified with an improved object detection routine that can allow for occlusion. It should be noted here that in f3 due to the considerable frame skip in the scene the IoU based method completely fails and the tracklet is no longer retained.
The novel DRE criterion shows considerable gains over the IoU based matching when there are occlusions from other sweet pepper. In Figure 8(b) we see a clear example of this. There are two sweet pepper close together and the IoU matching simply attempts to aggregate the object with the highest overlap, which is not always beneficial. The DRE method allows for this by shifting the ellipse and searching based on the Euclidean distance for objects that are within the search radius. Due to this internal shifting and search radius the DRE method is also able to track the object as it shifts greater distances: f2 → f3.
Ablation II: Video instance segmentation
This ablation study aims to show the applicability of our large-scale spatial-temporal dataset for VIS models. Existing datasets in the agricultural domain are often limited in size and lack comprehensive annotations, making it challenging to develop robust segmentation models. In contrast, our dataset offers a vast number of annotations, including bounding boxes, segmentation masks, and consistent IDs across frames, which makes it well-suited for VIS tasks.
We evaluate state-of-the-art VIS models including CTVIS Ying et al. (2023), Mask2Former-VIS Cheng et al. (2021) and VITA Heo et al. (2022). These models are chosen due to their high performance on benchmark datasets such as YouTube-VIS 2021 (Yang et al., 2021) and OVIS (Qi et al., 2022). We use 8 NVIDIA A100 GPUs with 80 GB of memory each, and follow baseline settings for each model. We use Swin-L as the backbone, a frame resolution of 960 × 540, and a learning rate of 0.00,005 for all models. CTVIS, Mask2Former-VIS, and VITA are trained on BUP-ST20 for 8k, 11k, and 14k iterations respectively, with a batch size of 8. Pre-trained weights are used for each model.
In this ablation study we use the video-based average precision (AP) metric. While tracking metrics (HOTA, MOTA, IDF1) focus on the accuracy of object associations across frames, AP provides an assessment of the segmentation accuracy of each tracklet. As the primary goal is to evaluate how effective the models segment objects over time rather than just track their identities, AP serves as a more suitable metric here. In addition to the VIS performance comparison, we also calculate AP scores for the standard ByteTrack approach and Halstead et al. (2018), which have the two best HOTA scores in Table 3. While video instance segmentation models provide segmentation masks, this MOT evaluation relies on bounding boxes. To enable a direct evaluation, we extract bounding boxes from the predicted segmentation masks and compute the AP accordingly. This allows us to assess the advantages of video-based segmentation models while maintaining comparability with existing MOT methods.
Overall, from Table 5 we see that CTVIS (Ying et al., 2023) is consistently able to outperform other approaches in all metrics by a considerable margin. This performance can be attributed to the use of contrastive learning and a memory bank during training, which enables enhanced object representation and segmentation accuracy. However, as shown in Figure 9, CTVIS still encounters object association errors. In the main evaluation on BUP-ST20 we saw that M2F-VIS and VITA performed commensurate with the best-performing approaches, yet this is not the case here. For VITA this is particularly evident as it does not explicitly incorporate temporal information, limiting its ability to handle complex horticultural scenarios. In both of these approaches there are missed/incorrect detections and a number of object association errors caused by a myriad of reasons including insufficient object representations. An image sample from the BUP-ST20 evaluation set illustrates that standard ByteTrack’s dual-threshold approach leads to false positives. In VIS models’ outputs, object association errors are evident, as multiple distinct instances are incorrectly merged into a single detection. Green, red, and blue bounding boxes demonstrate accurate detections, false positives, and missed detections respectively. For simplicity, the object IDs are not visualized in the image.
AP results on BUP-ST20 evaluation set.
AP b and AP m scores represent bbox and mask, respectively.
Based on this ablation study, it is clear that for video instance segmentation, CTVIS offers the most robust performance. This, in conjunction with the results of Table 3, highlights CTVIS as an effective tool for both MOT and VIS. These results also prove that BUP-ST20 as a weakly labelled spatial-temporal dataset can be utilized for video instance segmentation in the agricultural domain. Furthermore, beyond traditional tracking-by-detection methods, BUP-ST20 provides segmentation masks, enabling the application of tracking-by-segmentation approaches. This capability allows for more precise object tracking, as it leverages pixel-level segmentation.
Ablation III: Instance-based semantic segmentation improvement
In Ablation III, we leverage the pseudo-labels in BUP-ST20 to explore whether they can improve instance-based semantic segmentation. Generally, for instance-based semantic segmentation hand-labelled masks are used to train segmentation models. Creating these annotations is a laborious process however, and employing weak labels instead of fully annotated data can substantially reduce annotation time while maintaining or even enhancing performance. The fact that our novel dataset has 7914 weakly labelled images compared to just 124 images in BUP20, means that we have almost two orders of magnitude more images to train the models on.
We explore the use of three techniques to exploit these pseudo-labels. First, we train a Mask2Former model using the filtered BUP20 dataset, with 124 annotated images, we refer to this as the baseline. We then train a Mask2Former model using BUP-ST20, with 7914 weakly labelled images, to directly compare the benefits of including the weak labels. Finally, we train CTVIS on the video sequences obtained from the weak labels and then make use of the underlying Mask2Former model; CTVIS consists of a Mask2Former model with an additional re-identification component.
We evaluate all the models on the hand-labelled evaluation set of BUP20. To maintain evaluation conformity, we depth filter all masks and reject any masks further than 1.2 m from the camera. The metric used in this ablation study is image-based average precision (AP). We break this metric down into multiple values: AP, AP50, AP75, APs (small), APm (medium), and APl (large); APs is [02, 722], APm is [722, 1442], and APl is [1442, :] pixels.
Instance segmentation for Mask2Former models on the BUP20 filtered evaluation set.
The best performance is highlighted in bold and the underlined values represent the highest scores achieved using only image-based training.
For the video-based segmentation approach, we observe that CTVIS outperforms image-based segmentation methods in both single- and multi-class scenarios. Specifically, it surpasses the baseline in overall AP for single and multi-class tasks by 5.4 and 12.8 points, respectively. This considerable performance improvement occurs across all metrics and we attribute this to the ability of CTVIS to extract spatio-temporal consistent features through contrastive learning to perform re-identification.
The results demonstrate that BUP-ST20 can enhance image-based instance segmentation performance, particularly for small objects, which is crucial in robotic based agricultural applications. Rather than using fully annotated small datasets, exploiting a large-scale weakly labelled dataset improves the model’s robustness while reducing annotation effort. Beyond image-based segmentation, BUP-ST20 is well-suited for more complex tasks such as VIS and MOT. These tasks inherently require a deeper understanding of spatial-temporal consistency and object association, making them more challenging than image-based segmentation. The large-scale and structured nature of BUP-ST20 provides an opportunity to develop and evaluate models specifically designed for these complex scenarios.
Overall, the results displayed in the primary and ablation studies show that our novel BUP-ST20 dataset is a valuable resource to the agricultural domain. It contains considerably more hand-labelled evaluation images that can be used for MOT, VIS, and instance-based semantic segmentation. It also includes a large training set composed entirely of pseudo-labels, which can be highly valuable for training more accurate models across a multitude of tasks. We show here that while the dataset is a valuable addition to this domain, there is still considerable work that needs to be completed before phenotypic traits can be accurately recognized in agricultural scenes.
Conclusion
In this paper, we introduce and release BUP-ST20, the largest weakly labelled spatial-temporal dataset for tracking and segmentation in agriculture, specifically focussing on sweet pepper, and captured on a robotic platform. We also outline how this large-scale pseudo-label based dataset creation technique can be abstracted to other robotic platforms or crop types depending on the outputs from the robotic platform. Through extensive evaluations of existing MOT and VIS algorithms, we demonstrated the dataset’s applicability and outlined key challenges such as occlusion and object similarity. Additionally, we proposed two novel matching criteria for improved tracklet association under random frame-skipping and low frame rate conditions, which showed promise in maintaining object identity over sequences. Our novel matching criteria can be used with existing MOT techniques that either do (DRE-Sigma) or do not (DRE-Delta) include a Kalman filter. Our results indicate that weak-labelling methods, such as those based on PAg-NeRF, can effectively expand training datasets and improve model performance. Future work will explore self-supervised learning, contrastive methods, and higher-fidelity pseudo-labelling techniques to further improve tracking and segmentation performance in complex agricultural environments. We believe BUP-ST20 will prove to be a valuable resource in advancing robotic perception in agriculture, thereby fostering innovation in automated crop monitoring.
Footnotes
Acknowledgements
We would like to acknowledge the people that assisted in annotating the BUP-ST20 dataset, particularly Claus Smitt, Patrick Zimmer, and Alireza Ahmadi.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) 459376902 and under Germany’s Excellence Strategy - EXC 2070 – 390732324, and partially funded by the German Federal Ministry of Education and Research (BMBF) in the project ‘Robotics Institute Germany’, grant number 16ME0999.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Copyright statement
Please be aware that the use of this LATEX2ɛ class file is governed by the following conditions. Copyright ⓒ 2025 SAGE Publications Ltd, 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, UK. All rights reserved.
Rules of use
This class file is made available for use by authors who wish to prepare an article for publication in a SAGE Publications journal. The user may not exploit any part of the class file commercially. This class file is provided on an as is basis, without warranties of any kind, either express or implied, including but not limited to warranties of title, or implied warranties of merchantablility or fitness for a particular purpose. There will be no duty on the author[s] of the software or SAGE Publications Ltd to correct any errors or defects in the software. Any statutory rights you may have remain unaffected by your acceptance of these rules of use.
