A Spatial-semantic Combine Perception for Infrared UAV Target Tracking

YU Guodong; JIANG Yichun; LIU Yunqing; WANG Yijun; ZHAN Weida; WANG Chunyang; FENG Jianghai; HAN Yueyi

doi:10.11999/JEIT250613

Volume 47 Issue 11

Nov. 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(11): 4242-4253

YU Guodong, JIANG Yichun, LIU Yunqing, WANG Yijun, ZHAN Weida, WANG Chunyang, FENG Jianghai, HAN Yueyi. A Spatial-semantic Combine Perception for Infrared UAV Target Tracking[J]. Journal of Electronics & Information Technology, 2025, 47(11): 4242-4253. doi: 10.11999/JEIT250613

Citation:

YU Guodong, JIANG Yichun, LIU Yunqing, WANG Yijun, ZHAN Weida, WANG Chunyang, FENG Jianghai, HAN Yueyi. A Spatial-semantic Combine Perception for Infrared UAV Target Tracking[J]. Journal of Electronics & Information Technology, 2025, 47(11): 4242-4253. doi: 10.11999/JEIT250613

Citation:

YU Guodong, JIANG Yichun, LIU Yunqing, WANG Yijun, ZHAN Weida, WANG Chunyang, FENG Jianghai, HAN Yueyi. A Spatial-semantic Combine Perception for Infrared UAV Target Tracking[J]. Journal of Electronics & Information Technology, 2025, 47(11): 4242-4253. doi: 10.11999/JEIT250613

PDF( 6284 KB)

A Spatial-semantic Combine Perception for Infrared UAV Target Tracking

doi: 10.11999/JEIT250613 cstr: 32379.14.JEIT250613

1.
Unit 63869, People’s Liberation Army, Baicheng 137001, China
2.
School of Electronical and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China

Funds: Jilin Province Development and Reform Commission Special Fund for Innovation Capacity Development (2024C021-8)

Received Date: 2025-07-01
Rev Recd Date: 2025-10-13

Available Online: 2025-10-16

Publish Date: 2025-11-10

Abstract

Abstract

Objective In recent years, infrared image-based UAV target tracking technology has attracted widespread attention. In real-world scenarios, infrared UAV target tracking still faces significant challenges due to factors such as complex backgrounds, UAV target deformation, and camera movement. Siamese network-based tracking methods have made breakthroughs in balancing tracking accuracy and efficiency. However, existing approaches rely solely on high-level feature outputs from deep networks to predict target positions, neglecting the effective use of low-level features. This leads to the loss of spatial detail features of infrared UAV targets, severely affecting tracking performance. To efficiently utilize low-level features, some methods have incorporated Feature Pyramid Networks (FPN) into the tracking framework, progressively fusing cross-layer feature maps in a top-down manner, thereby effectively enhancing tracking performance for multi-scale targets. Nevertheless, these methods directly adopt traditional FPN channel reduction operations, which result in significant loss of spatial contextual information and channel semantic information. To address the above issues, a novel infrared UAV target tracking method based on spatial-semantic combine perception is proposed. By capturing spatial multi-scale features and channel semantic information, the proposed approach enhances the model’s capability to track infrared UAV targets in complex backgrounds. Methods The proposed method comprises four main components: a backbone network, multi-scale feature fusion, template-search feature interaction, and a detection head. Initially, template and search images containing infrared UAV targets are input into a weight-sharing backbone network to extract features. Subsequently, an FPN is constructed, within which a Spatial-semantic Combine Attention Module (SCAM) is integrated to efficiently fuse multi-scale features. Finally, a Dual-branch global Feature interaction Module (DFM) is employed to facilitate feature interaction between the template and search branches, and the final tracking results are obtained through the detection head. The proposed SCAM enhances the network’s focus on spatial and semantic information by jointly leveraging spatial and channel attention mechanisms, thereby mitigating the loss of spatial and semantic information in low-level features caused by channel dimensionality reduction in traditional FPN. SCAM primarily consists of two components: the Spatial Multi-scale Attention module (SMA) and the Global-local Channel Semantic Attention module (GCSA). The SMA captures long-range multi-scale dependencies efficiently through axial positional embedding and multi-branch grouped feature extraction, thereby improving the network’s perception of global contextual information. GCSA adopts a dual-branch design to effectively integrate global and local information across feature channels, suppress irrelevant background noise, and enable more rational channel-wise feature weighting. The proposed DFM treats the template branch features as the query source for the search branch and applies global cross-attention to capture more comprehensive features of infrared UAV targets. This enhances the tracking network’s ability to attend to the spatial location and boundary details of infrared UAV targets. Results and Discussions The proposed method has been validated on the infrared UAV benchmark dataset (Anti-UAV). Quantitative analysis (Table 1) demonstrates that, compared to 10 state-of-the-art methods, the proposed approach achieves the highest average normalized precision score of 76.9%, surpassing the second-best method, LGTrack, by 4.4%. Qualitative analysis (Figs. 6～8) further confirms that the proposed method exhibits strong adaptability and robustness when addressing various typical challenges in infrared UAV tracking, such as out of view, distracting objects and complex backgrounds. The collaborative design of the individual modules significantly enhances the model’s ability to perceive and represent small targets and dynamic scenes. In addition, qualitative experiments (Fig. 9) conducted on a self-constructed infrared UAV tracking dataset demonstrate the effectiveness and generalization capability of the proposed method in real-world tracking scenarios. Ablation studies (Tables 2～5) reveal that integrating any individual proposed module consistently improves tracking performance. Conclusions This paper conducts a systematic theoretical analysis and experimental validation addressing the issue of spatial and semantic information loss in infrared UAV target tracking. Focusing on the limitations of existing FPN-based infrared UAV tracking methods, particularly the drawbacks associated with channel reduction in multi-scale low-level features, a novel infrared UAV target tracking method based on spatial-semantic combine perception is proposed which fully leverages the complementary advantages of spatial and channel attention mechanisms. This method enhances the network’s focus on spatial context and critical semantic information, thereby improving overall tracking performance. The following main conclusions are obtained: (1) The proposed SCAM combining SMA and GCSA, where SMA captures spatial long-range feature dependencies through position coordinate embedding and one-dimensional convolution operations, ensuring the acquisition of multi-scale contextual information, while GCSA achieves more comprehensive semantic feature attention by interacting local and global channel features. (2) The designed DFM, which realizes feature interaction between search branch features and template branch features through global cross-attention, enabling the dual-branch features to complement each other and enhancing network tracking performance. (3) Extensive experimental results demonstrate that the proposed algorithm outperforms existing advanced methods in both quantitative evaluation and qualitative analysis, with an average state accuracy of 0.769, success rate of 0.743, and precision of 0.935, achieving more accurate tracking of infrared UAV targets. Although the algorithm in this paper has been optimized in terms of computing resource utilization efficiency, further research is needed on efficient deployment strategies for embedded and mobile devices to improve real-time performance and computing adaptability.
- Target tracking,
- Semantic information,
- Spatial context information,
- Long-range dependence,
- Feature interaction

FullText(HTML)

References(26)

References

[1]	聂伟, 张中洋, 杨小龙, 等. 基于梅尔倒谱系数的无人机探测与识别方法[J]. 电子与信息学报, 2025, 47(4): 1076–1084. doi: 10.11999/JEIT241111. NIE Wei, ZHANG Zhongyang, YANG Xiaolong, et al. Unmanned aerial vehicles detection and recognition method based on Mel frequency cepstral coefficients[J]. Journal of Electronics & Information Technology, 2025, 47(4): 1076–1084. doi: 10.11999/JEIT241111.
[2]	GUO Dongyan, WANG Jun, CUI Ying, et al. SiamCAR: Siamese fully convolutional classification and regression for visual tracking[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6268–6276. doi: 10.1109/CVPR42600.2020.00630.
[3]	ZHANG Zhipeng, PENG Houwen, FU Jianlong, et al. Ocean: Object-aware anchor-free tracking[C]. 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, 2020: 771–787. doi: 10.1007/978-3-030-58589-1_46.
[4]	侯志强, 王卓, 马素刚, 等. 长时视觉跟踪中基于双模板Siamese结构的目标漂移判定网络[J]. 电子与信息学报, 2024, 46(4): 1458–1467. doi: 10.11999/JEIT230496. HOU Zhiqiang, WANG Zhuo, MA Sugang, et al. Target drift discriminative network based on dual-template Siamese structure in long-term tracking[J]. Journal of Electronics & Information Technology, 2024, 46(4): 1458–1467. doi: 10.11999/JEIT230496.
[5]	JIANG Nan, WANG Kuiran, PENG Xiaoke, et al. Anti-UAV: A large multi-modal benchmark for UAV tracking[J]. arXiv preprint arXiv: 2101.08466, 2021. doi: 10.48550/arXiv.2101.08466.
[6]	GAO Shenyuan, ZHOU Chunluan, MA Chao, et al. AiATrack: Attention in attention for transformer visual tracking[C]. The 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 146–164. doi: 10.1007/978-3-031-20047-2_9.
[7]	YAN Bin, JIANG Yi, SUN Peize, et al. Towards grand unification of object tracking[C]. 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 733–751. doi: 10.1007/978-3-031-19803-8_43.
[8]	计忠平, 王相威, 何志伟, 等. 集成全局局部特征交互与角动量机制的端到端多目标跟踪算法[J]. 电子与信息学报, 2024, 46(9): 3703–3712. doi: 10.11999/JEIT240277. JI Zhongping, WANG Xiangwei, HE Zhiwei, et al. End-to-end multi-object tracking algorithm integrating global local feature interaction and angular momentum mechanism[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3703–3712. doi: 10.11999/JEIT240277.
[9]	FANG Houzhang, WANG Xiaolin, LIAO Zikai, et al. A real-time anti-distractor infrared UAV tracker with channel feature refinement module[C]. The 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, Canada, 2021: 1240. doi: 10.1109/ICCVW54120.2021.00144.
[10]	李华耀, 钟小勇, 杨智能, 等. 结合孪生网络和Transformer的轻量级无人机目标跟踪算法[J]. 电光与控制, 2025, 32(6): 31–37. doi: 10.3969/j.issn.1671-637X.2025.06.005. LI Huayao, ZHONG Xiaoyong, YANG Zhineng, et al. A lightweight UAV tracking algorithm combining Siamese network with Transformer[J]. Electronics Optics & Control, 2025, 32(6): 31–37. doi: 10.3969/j.issn.1671-637X.2025.06.005.
[11]	齐咏生, 姜政廷, 刘利强, 等. SiamMT: 基于自适应特征融合机制的可修正RGBT目标跟踪算法[J]. 控制与决策, 2025, 40(4): 1312–1320. doi: 10.13195/j.kzyjc.2024.0205. QI Yongsheng, JIANG Zhengting, LIU Liqiang, et al. SiamMT: Amodifiable RGBT target tracking algorithm based on adaptive feature fusion mechanism[J]. Control and Decision, 2025, 40(4): 1312–1320. doi: 10.13195/j.kzyjc.2024.0205.
[12]	SHAN Yunxiao, ZHOU Xiaomei, LIU Shanghua, et al. SiamFPN: A deep learning method for accurate and real-time maritime ship tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(1): 315–325. doi: 10.1109/TCSVT.2020.2978194.
[13]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 936–944. doi: 10.1109/CVPR.2017.106.
[14]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[15]	HU Jie, SHEN Li, and SUN Gang. Squeeze-and-excitation networks[C]. The 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132–7141. doi: 10.1109/CVPR.2018.00745.
[16]	SHEN Zhuoran, ZHANG Mingyuan, ZHAO Haiyu, et al. Efficient attention: Attention with linear complexities[C]. The 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2021: 3530–3538. doi: 10.1109/WACV48630.2021.00357.
[17]	JOCHER G, STOKEN A, BOROVEC J, et al. “YOLOv5,” https://github.com/ultralytics/yolov5, 2021.
[18]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]. The 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2999–3007. doi: 10.1109/ICCV.2017.324.
[19]	REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: A metric and a loss for bounding box regression[C]. The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 658–666. doi: 10.1109/CVPR.2019.00075.
[20]	HUANG Lianghua, ZHAO Xin, and HUANG Kaiqi. GlobalTrack: A simple and strong baseline for long-term tracking[C]. The Thirty-Fourth AAAI Conference on Artificial Intelligence, Palo Alto, USA, 2020: 11037–11044. doi: 10.1609/aaai.v34i07.6758.
[21]	GAO Shenyuan, ZHOU Chunluan, and ZHANG Jun. Generalized relation modeling for transformer tracking[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 18686–18695. doi: 10.1109/CVPR52729.2023.01792.
[22]	YE Botao, CHANG Hong, MA Bingpeng, et al. Joint feature learning and relation modeling for tracking: A one-stream framework[C]. 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 341–357. doi: 10.1007/978-3-031-20047-2_20.
[23]	GU Fengwei, LU Jun, CAI Chengtao, et al. EANTrack: An efficient attention network for visual tracking[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21(4): 5911–5928. doi: 10.1109/TASE.2023.3319676.
[24]	LIU Chang, ZHAO Jie, BO Chunjuan, et al. LGTrack: Exploiting local and global properties for robust visual tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(9): 8161–8171. doi: 10.1109/TCSVT.2024.3390054.
[25]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]. 15th European Conference on Computer Vision – ECCV, Munich, Germany, 2018: 3–19. doi: 10.1007/978-3-030-01234-2_1.
[26]	HUANG Hejun, CHEN Zuguo, ZOU Ying, et al. Channel prior convolutional attention for medical image segmentation[J]. Computers in Biology and Medicine, 2024, 178: 108784. doi: 10.1016/j.compbiomed.2024.108784.