Dynamic Scale Perception-Driven Multi-UAV Collaborative 3D Object Detection Method
-
摘要: 多无人机协同三维目标检测是低空智能感知领域的核心技术,鸟瞰图(Bird’s Eye View, BEV)特征表征范式为该任务提供了全局空间一致性支撑。但在实际应用中,受遥感图像目标尺度小、分布稀疏等特性影响,现有基于Transformer的BEV感知方法因采用全图同质化特征处理策略,会造成大量计算资源浪费,或者容易丢失小目标的精细特征,难以实现计算效率与检测精度的平衡。针对上述问题,该文提出一种适用于多无人机协同场景的动态尺度感知检测网络,核心思路是通过尺度差异化特征处理机制,实现计算效率与检测精度的协同优化。为此,设计两个核心创新模块:动态尺度感知BEV生成模块(DSBG)与自适应BEV特征协同聚合模块(ACFA)。其中,DSBG模块基于各无人机的特征图目标分布情况动态感知生成多分辨率BEV特征;ACFA模块对多分辨率BEV特征进行自适应加权融合,生成全局一致的协同BEV特征,再输入检测解码器完成目标预测。实验结果表明,所提网络在AeroCollab3D和Air-Co-Pred两个多无人机协同仿真数据集上均表现优异,平均预测精度(mAP)分别达到64.0%和80.6%,相较于其他先进方法分别提升1.5%和7.2%;同时计算成本最大降低41.6%,实现了计算效率与检测精度的高效平衡。Abstract:
Objective Multi-UAV collaborative 3D object detection is a core technology in low-altitude intelligent perception, with the Bird’s-Eye View (BEV) feature representation paradigm providing global spatial consistency support. However, in practical UAV remote sensing scenarios, targets are extremely small in scale, sparse in distribution, and occupy a high proportion of background. Existing Transformer-based BEV perception methods adopt a full-image homogeneous feature processing strategy, which not only causes significant waste of computing resources due to excessive calculations on large-area background regions but also easily dilutes small-target features with background noise, making it difficult to balance computational efficiency and detection accuracy. Meanwhile, multi-UAV collaboration requires cross-device information interaction to achieve view complementarity and information gain, but the interaction process is prone to introducing redundant information and even feature conflicts. Traditional fixed-weight aggregation methods fail to accurately screen effective information and suppress redundancy, resulting in poor consistency of global BEV features and affecting the accuracy of collaborative detection. To address these challenges, developing a detection network adaptive to multi-UAV aerial scenarios is of great practical significance. Methods A dynamic scale-aware detection network is proposed to achieve efficient and accurate 3D object detection through two core modules: Dynamic Scale-Aware BEV Generation (DSBG) module and Adaptive Collaborative BEV-Feature Aggregation (ACFA) module. The network constructs an end-to-end flow of "multi-view image input—dynamic scale adaptive feature encoding—BEV space 3D detection" ( Fig.1 ). Firstly, observed images collected by each UAV are independently processed by a parameter-sharing ResNet-50 backbone network to generate feature maps with consistent structure. The DSBG module then takes these feature maps as input, calculates the amplitude of feature responses in each spatial region via the Local Scale-Aware Unit to estimate target distribution, and dynamically allocates differentiated BEV grid encoding: high-resolution dense grids for high-response target regions to preserve fine-grained features, and low-resolution sparse grids for low-response background regions to reduce invalid computations. Simultaneously, it generates target query vectors with spatial position priors. The ACFA module receives multi-resolution BEV features from the DSBG module, superimposes dual-resolution features of different UAVs in the channel dimension, upsamples low-resolution features to align with high-resolution ones, models local correlations of two-scale features through 3×3 convolution, and obtains a globally consistent BEV feature map via element-wise weighted summation. Finally, the global BEV features are input into the DETR decoder to complete 3D target prediction, with Focal Loss used for classification and Smooth L1 Loss for regression (Eqs.5-6).Results and Discussions Extensive experiments are carried out on two public multi-UAV collaborative simulation datasets, namely AeroCollab3D and Air-Co-Pred. The results demonstrate that the proposed method achieves outstanding performance on both datasets. Compared with the current state-of-the-art methods and baseline models, it not only realizes a significant improvement in mean Average Precision (mAP) (up to 7.2 percentage points) but also markedly reduces the key evaluation metrics such as mean size error (reduced by more than 48%), mean localization error and mean orientation error. In particular, it shows prominent advantages in small target detection and fine-grained category recognition, with the pedestrian detection accuracy improved by nearly 10 percentage points. Ablation experiments verify the effectiveness of the DSBG and ACFA modules. The proposed method achieves a steady improvement in detection accuracy while significantly reducing computational costs by up to 41.6%, thus successfully realizing the synergistic optimization of accuracy and efficiency. Visualization results ( Fig. 3 ) indicate that the predicted bounding boxes of the proposed method have higher spatial alignment with the ground truth, which effectively alleviates the common problems of target overlap and missed detection in traditional methods.Figure 4 more intuitively demonstrates the technical superiority of multi-UAV collaborative detection. Even for targets occluded by obstacles, the proposed method can achieve efficient detection, which effectively enhances the comprehensive perception capability for the global region.Conclusions This paper proposes a dynamic scale-aware detection network for multi-UAV collaborative 3D object detection, addressing the core challenges of unbalanced efficiency-accuracy and poor feature consistency in traditional methods. The DSBG module realizes dynamic matching between BEV encoding scale and target distribution, reducing redundant computations, while the ACFA module optimizes multi-scale and multi-view feature aggregation, ensuring global feature consistency and accuracy. Experimental results on two datasets confirm that the proposed method outperforms existing advanced methods in detection accuracy, computational efficiency, and robustness. Future work will focus on optimizing dynamic scale adjustment strategies with temporal information and exploring multi-sensor fusion with lightweight LiDAR data to enhance detection stability in complex scenarios. -
表 1 AeroCollab3D数据集对比实验结果
方法 BEV网格大小 mAP↑(%) mATE↓(m) mASE↓ mAOE↓ Cost↓ BEVDet[10] 128×128 55.4 0.512 0.196 0.498 4.712 BEVDet4D[11] 58.7 0.499 0.102 0.317 4.712 BEVLongTerm[10] 33.5 0.527 0.298 0.515 4.712 BEVDepth[12] 59.9 0.489 0.106 0.495 4.712 Where2comm[25] 52.3 0.473 0.199 0.415 4.712 UCDNet[21] 62.5 0.487 0.188 0.399 4.712 本文方法 − 64.0 0.460 0.086 0.288 3.505 表 2 AeroCollab3D数据集细粒度检测结果
目标类别 mAP↑(%) mATE↓(m) mASE↓ mAOE↓ 小轿车 79.7 0.300 0.096 0.043 货车 64.2 0.515 0.089 0.049 公交车 57.6 0.493 0.070 0.050 行人 54.7 0.536 0.093 1.011 表 3 Air-Co-Pred数据集对比实验结果
表 4 AeroCollab3D数据集细粒度基线对比实验结果
方法 BEV网格大小 car_AP(%) truck_AP(%) bus_AP(%) pedestrian_AP(%) mAP↑(%) 交互传输比 Cost↓ 基线模型 50×50 71.4 57.4 55.0 36.4 55.0 0.0625 2.000 200×200 74.6 63.3 56.2 45.5 58.5 1.0000 6.000 +DSBG − 72.7 62.3 54.8 35.3 56.2 0.1775 3.317 +DSBG+ACFA − 79.7 64.2 57.6 54.7 64.0 0.1787 3.505 -
[1] ZONG Zhuofan, JIANG Dongzhi, SONG Guanglu, et al. Temporal enhanced training of multi-view 3D object detector via historical object prediction[C]. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 3758–3767. doi: 10.1109/ICCV51070.2023.00350. [2] 何江, 喻莞芯, 黄浩, 等. 多无人机分布式感知任务分配-通信基站关联与飞行策略联合优化设计[J]. 电子与信息学报, 2025, 47(5): 1402–1417. doi: 10.11999/JEIT240738.HE Jiang, YU Wanxin, HUANG Hao, et al. Joint task allocation, communication base station association and flight strategy optimization design for distributed sensing unmanned aerial vehicles[J]. Journal of Electronics & Information Technology, 2025, 47(5): 1402–1417. doi: 10.11999/JEIT240738. [3] YANG Dingkang, YANG Kun, WANG Yuzheng, et al. How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1093. [4] HU Senkang, FANG Zhengru, DENG Yiqin, et al. Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities[J]. IEEE Wireless Communications, 2025, 32(5): 228–234. doi: 10.1109/MWC.002.2400348. [5] LI Xueping, TUPAYACHI J, SHARMIN A, et al. Drone-aided delivery methods, challenge, and the future: A methodological review[J]. Drones, 2023, 7(3): 191. doi: 10.3390/drones7030191. [6] LI Zhenxin, LAN Shiyi, ALVAREZ J M, et al. BEVNeXt: Reviving dense BEV frameworks for 3D object detection[C]. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 20113–20123. doi: 10.1109/CVPR52733.2024.01901. [7] WANG Xiaoming, CHEN Hao, CHU Xiangxiang, et al. AODet: Aerial object detection using transformers for foreground regions[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4106711. doi: 10.1109/TGRS.2024.3407815. [8] WANG Yuchao, WANG Zhirui, CHENG Peirui, et al. AVCPNet: An AAV-vehicle collaborative perception network for 3-D object detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5615916. doi: 10.1109/TGRS.2025.3546669. [9] PHILION J and FIDLER S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 194–210. doi: 10.1007/978-3-030-58568-6_12. [10] HUANG Junjie, HUANG Guan, ZHU Zheng, et al. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view[EB/OL]. https://arxiv.org/abs/2112.11790, 2021. [11] HUANG Junjie and HUANG Guan. BEVDet4D: Exploit temporal cues in multi-camera 3D object detection[EB/OL]. https://arxiv.org/abs/2203.17054, 2022. [12] LI Yinhao, GE Zheng, YU Guanyi, et al. BEVDepth: Acquisition of reliable depth for multi-view 3D object detection[C]. Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 1477–1485. doi: 10.1609/aaai.v37i2.25233. [13] WANG Yue, GUIZILINI V C, ZHANG Tianyuan, et al. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries[C]. Proceedings of the 5th Conference on Robot Learning, London, UK, 2022: 180–191. [14] LI Zhiqi, WANG Wenhai, LI Hongyang, et al. BEVFormer: Learning bird's-eye-view representation from LiDAR-camera via spatiotemporal transformers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(3): 2020–2036. doi: 10.1109/TPAMI.2024.3515454. [15] ZHU Pengfei, ZHENG Jiayu, DU Dawei, et al. Multi-drone-based single object tracking with agent sharing network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(10): 4058–4070. doi: 10.1109/TCSVT.2020.3045747. [16] CAO Yaru, HE Zhijian, WANG Lujia, et al. VisDrone-DET2021: The vision meets drone object detection challenge results[C]. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, Canada, 2021: 2847–2854. doi: 10.1109/ICCVW54120.2021.00319. [17] 姚婷婷, 肇恒鑫, 冯子豪, 等. 上下文感知多感受野融合网络的定向遥感目标检测[J]. 电子与信息学报, 2025, 47(1): 233–243. doi: 10.11999/JEIT240560.YAO Tingting, ZHAO Hengxin, FENG Zihao, et al. A context-aware multiple receptive field fusion network for oriented object detection in remote sensing images[J]. Journal of Electronics & Information Technology, 2025, 47(1): 233–243. doi: 10.11999/JEIT240560. [18] ZHU Xizhou, SU Weijie, LU Lewei, et al. Deformable DETR: Deformable transformers for end-to-end object detection[C]. 9th International Conference on Learning Representations, Vienna, Austria, 2021. (查阅网上资料, 未找到本条文献页码, 请确认). [19] KINGMA D P and BA J. Adam: A method for stochastic optimization[C]. 3rd International Conference on Learning Representations, San Diego, USA, 2015. (查阅网上资料, 未找到本条文献页码, 请确认). [20] WANG Zhechao, CHENG Peirui, CHEN Mingxin, et al. Drones help drones: A collaborative framework for multi-drone object trajectory prediction and beyond[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 2061. [21] CHEN Mingxin, WANG Zhirui, WANG Zhechao, et al. C2F-Net: Coarse-to-fine multidrone collaborative perception network for object trajectory prediction[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025, 18: 6314–6328. doi: 10.1109/JSTARS.2025.3541249. [22] TIAN Pengju, WANG Zhirui, CHENG Peirui, et al. UCDNet: Multi-UAV collaborative 3-D object detection network by reliable feature mapping[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5602016. doi: 10.1109/TGRS.2024.3517594. [23] CAESAR H, BANKIT V, LANG A H, et al. nuScenes: A multimodal dataset for autonomous driving[C]. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11618–11628. doi: 10.1109/CVPR42600.2020.01164. [24] 梁燕, 杨会林, 邵凯. 自适应特征选择的车路协同3D目标检测方案[J]. 电子与信息学报, 2025, 47(12): 5214–5225. doi: 10.11999/JEIT250601.LIANG Yan, YANG Huilin, and SHAO Kai. A vehicle-infrastructure cooperative 3D object detection scheme based on adaptive feature selection[J]. Journal of Electronics & Information Technology, 2025, 47(12): 5214–5225. doi: 10.11999/JEIT250601. [25] HU Yue, FANG Shaoheng, LEI Zixing X, et al. Where2comm: Communication-efficient collaborative perception via spatial confidence maps[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 352. doi: 10.5555/3600270.3600622. (查阅网上资料,未找到本条文献doi,请确认). -
下载:
下载: