Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification

LIU Changyuan; ZHAO Haijian; WU Haibin

doi:10.11999/JEIT260044

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 >

LIU Changyuan, ZHAO Haijian, WU Haibin. Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260044

Citation:

LIU Changyuan, ZHAO Haijian, WU Haibin. Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260044

LIU Changyuan, ZHAO Haijian, WU Haibin. Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260044

Citation:

LIU Changyuan, ZHAO Haijian, WU Haibin. Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260044

PDF( 6925 KB)

Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification

doi: 10.11999/JEIT260044 cstr: 32379.14.JEIT260044

College of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China

Funds: Scientific and Technological Project of the Department of Transportation of Heilongjiang Province (HJK2024B002)

Received Date: 2026-01-13
Accepted Date: 2026-04-13
Rev Recd Date: 2026-04-13

Available Online: 2026-04-30

Abstract

Abstract

Objective Agricultural pest images are commonly affected by severe challenges, including complex background interference, significant appearance differences across morphological stages, diverse shooting angles, and massive scale variations. These issues result in distinct insufficiencies in feature extraction and morphological adaptability within existing fine-grained classification models. To address these challenges, an Agricultural Pest Multi-dimensional Dataset (APMD) comprehensively covering multiple morphological stages, viewing angles, and object scales is constructed. Furthermore, a fine-grained pest classification network based on dynamic focus and semantic prompts (DFS-PestNet) is proposed. A decoupled parallel architecture combining a main feature stream and a prompt enhancement stream is designed. Through a Spatial Dependency Perception (SDP) module, crucial discriminative regions (e.g., pest spots and wing veins) are dynamically focused upon to enhance local subtle feature extraction under complex backgrounds. An Advanced Haptic-Visual Prompting (AHVP) module is introduced to explicitly integrate category semantics and spatial position information into shallow and middle-level features, substantially improving adaptability to morphological variations across developmental stages. Simultaneously, Dual-Branch Saliency Sampling (DSS) is adopted to adaptively aggregate critical features of essential pest body parts through learnable prototype components and dual-branch saliency fusion. This strategy enhances the precise recognition capability for small targets, including tiny pests and early-stage larvae. Experimental results demonstrate that the proposed model achieves superior classification performance compared to baseline and mainstream methods on both public and self-constructed datasets. The effectiveness and application potential of the model in complex agricultural scenarios are fully validated, providing a reliable technical reference for intelligent pest monitoring and precise control in smart agriculture. Methods To tackle the problem of insufficient classification accuracy in existing models under complex background interference and multi-morphological conditions, the Agricultural Pest Multi-dimensional Dataset (APMD) is initially constructed. This comprehensive dataset encompasses extensive image data across various morphological stages of pests, multiple viewing angles, and different scales. Specifically, it contains a total of 15,680 images covering 58 distinct species, which are rigorously divided into training, validation, and testing sets with a standard ratio of 7:2:1 (Fig. 1) (Table 1). This dataset provides crucial and high-quality resource support for further research on fine-grained pest classification. Subsequently, the Dynamic Focus and Semantic Prompt Network for Fine-Grained Pest Classification (DFS-PestNet) is formally proposed. Within this network architecture, the Spatial Dependency Perception (SDP) module is carefully designed to adaptively locate and structurally enhance the key discriminative regions of pests. By successfully overcoming pose variations and complex background interference, more accurate fine-grained pest feature extraction is achieved. In addition, the Advanced Haptic-Visual Prompting (AHVP) module is introduced into the network pipeline to embed deep category semantics and spatial position information. This module guides the network to consistently focus on crucial discriminative features across different morphological periods, thereby effectively improving the overall recognition robustness regarding dramatic morphological changes throughout the pest life cycle. Furthermore, Dual-Branch Saliency Sampling (DSS) is proposed to adaptively aggregate the features of essential pest body parts. This strategy structurally strengthens the precise recognition capability for challenging small targets, effectively resolving the inherent difficulties of small target detection in fine-grained pest classification tasks. Results and Discussions The superior performance of the DFS-PestNet model in fine-grained pest classification tasks is comprehensively evaluated and verified through multi-dimensional experiments. Firstly, in terms of qualitative visualization analysis, Grad-CAM heatmaps intuitively indicate that compared to the baseline model, which is highly susceptible to severe interference from complex farmland backgrounds and plant stems, DFS-PestNet is capable of effectively suppressing background noise. It precisely focuses on fine-grained discriminative parts, such as pest heads and antennae (Fig. 6). Significant advantages are explicitly demonstrated in capturing features of tiny targets (e.g., leafhopper nymphs) and pests in different life stages (e.g., Chilo suppressalis hidden within stems). The t-SNE feature dimensionality reduction results further confirm that the proposed model effectively alleviates the feature confusion problem in multi-morphological scenarios, enabling high-dimensional features to exhibit clearer inter-class separation and tighter intra-class clustering within a two-dimensional visual space (Fig. 7). Secondly, regarding quantitative ablation and parameter optimization experiments, the ablation studies fully validate the powerful synergistic enhancement effect of the three major improved modules: SDP, AHVP, and DSS (Table 2). The organic combination of these three modules significantly increases the classification accuracy of the baseline model by 2.21%, successfully reaching 77.24%, with all core evaluation metrics achieving optimal values. Concurrently, hyperparameter optimization experiments explicitly determine the optimal number of prompt position tokens to be 6 and the optimal feature dropout rate to be 0.2 (Fig. 8). This specific configuration guarantees complete semantic expression while simultaneously achieving the best balance between simulating natural occlusion and enhancing overall model robustness. Finally, in comparative experiments with mainstream state-of-the-art models, DFS-PestNet achieves the highest accuracies of 77.24% and 98.01% on the large-scale public dataset IP102 and the highly challenging self-constructed multi-dimensional dataset APMD, respectively, when directly compared with existing frontier Convolutional Neural Network (CNN) and Transformer architectures, such as Gate-ViT and EST (Table 3) (Table 4). These quantitative results comprehensively lead to various fine-grained classification metrics. More importantly, while guaranteeing extremely high classification accuracy, the inference speeds of the proposed model reach remarkably high levels of 158 frames/s and 164 frames/s, respectively. In summary, DFS-PestNet achieves a perfect unification of top-tier classification accuracy and excellent inference efficiency in complex pest feature extraction across massive scales and multiple morphological stages, which lays a solid operational foundation for efficient deployment and implementation in practical smart agriculture scenarios. Conclusions To address the challenges of multi-morphological variations and small target recognition in fine-grained pest classification, the multi-dimensional dataset APMD is initially constructed, and the DFS-PestNet model is proposed based on the MPSA baseline. Specifically, the SDP module is introduced to adaptively focus on pose- and morphology-invariant discriminative features; the AHVP module embeds robust category semantics and spatial position information into shallow and middle-level networks; and the DSS module adaptively aggregates crucial body part features to significantly enhance small target detection. Experimental results consistently verify the superiority of DFS-PestNet over mainstream models on both the IP102 and APMD datasets across varying developmental stages, angles, and scales. Future work will focus on exploring lightweight model modifications for efficient edge deployment and investigating open-set recognition tasks to accurately issue early warnings for unknown pest categories in complex real-world environments.
- Agricultural pest,
- Fine-grained classification,
- Spatial Dependency Perception(SDP),
- Advanced Haptic-Visual Prompting(AHVP),
- Small target

FullText(HTML)

References(27)

References

[1]	陆宴辉, 刘杨, 杨现明, 等. 中国农业害虫综合防治研究进展: 2018年-2022年[J]. 植物保护, 2023, 49(5): 145–166. doi: 10.16688/j.zwbh.2023207. LU Yanhui, LIU Yang, YANG Xianming, et al. Advances in integrated management of agricultural insect pests in China: 2018-2022[J]. Plant Protection, 2023, 49(5): 145–166. doi: 10.16688/j.zwbh.2023207.
[2]	赵雪如, 李晖, 胡欣仪, 等. 基于深度学习的田间害虫自动识别技术综述[J]. 图像与信号处理, 2023, 12(2): 77–88. doi: 10.12677/JISP.2023.122008. ZHAO Xueru, LI Hui, HU Xinyi, et al. Survey of automatic identification of field pests based on deep learning[J]. Journal of Image and Signal Processing, 2023, 12(2): 77–88. doi: 10.12677/JISP.2023.122008.
[3]	WU Xiaoping, ZHAN Chi, LAI Yukun, et al. IP102: A large-scale benchmark dataset for insect pest recognition[C]. The 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 8779–8788. doi: 10.1109/CVPR.2019.00899.
[4]	陈磊, 刘立波, 王晓丽. 2020年宁夏枸杞虫害图文跨模态检索数据集[J]. 中国科学数据, 2022, 7(3): 1–8. doi: 10.11922/11-6035.nasdc.2021.0058.zh. CHEN Lei, LIU Libo, and WANG Xiaoli. A dataset of image-text cross-modal retrieval of Lycium barbarum pests in Ningxia in 2020[J]. China Scientific Data, 2022, 7(3): 1–8. doi: 10.11922/11-6035.nasdc.2021.0058.zh.
[5]	LI Yanfen, WANG Hanxiang, DANG L M, et al. Crop pest recognition in natural scenes using convolutional neural networks[J]. Computers and Electronics in Agriculture, 2020, 169: 105174. doi: 10.1016/j.compag.2019.105174.
[6]	BOLLIS E, PEDRINI H, and AVILA S. Weakly supervised learning guided by activation mapping applied to a novel citrus pest benchmark[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2020: 310–319. doi: 10.1109/CVPRW50498.2020.00043.
[7]	FANG Mingwei, TAN Zhiping, TANG Yu, et al. Pest-ConFormer: A hybrid CNN-Transformer architecture for large-scale multi-class crop pest recognition[J]. Expert Systems with Applications, 2024, 255: 124833. doi: 10.1016/j.eswa.2024.124833.
[8]	CHENG Zekai and XIA Wan. Fine-grained image classification on agricultural pest larvae[J]. IOP Conference Series: Earth and Environmental Science, 2021, 792: 012037. doi: 10.1088/1755-1315/792/1/012037.
[9]	AMARATHUNGA D C, RATNAYAKE M N, GRUNDY J, et al. Fine-grained image classification of microscopic insect pest species: Western Flower thrips and Plague thrips[J]. Computers and Electronics in Agriculture, 2022, 203: 107462. doi: 10.1016/j.compag.2022.107462.
[10]	WANG Linfeng, LIU Yong, LI Jiayao, et al. Based on the multi-scale information sharing network of fine-grained attention for agricultural pest detection[J]. PLoS One, 2023, 18(10): e0286732. doi: 10.1371/journal.pone.0286732.
[11]	赵凤, 耿苗苗, 刘汉强, 等. 卷积神经网络与视觉Transformer联合驱动的跨层多尺度融合网络高光谱图像分类方法[J]. 电子与信息学报, 2024, 46(5): 2237–2248. doi: 10.11999/JEIT231209. ZHAO Feng, GENG Miaomiao, LIU Hanqiang, et al. Convolutional neural network and vision Transformer-driven cross-layer multi-scale fusion network for hyperspectral image classification[J]. Journal of Electronics & Information Technology, 2024, 46(5): 2237–2248. doi: 10.11999/JEIT231209.
[12]	文泓力, 胡庆浩, 黄立威, 等. 基于参数高效ViT与多模态导引的遥感图像小样本分类方法[J]. 电子与信息学报, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996. WEN Hongli, HU Qinghao, HUANG Liwei, et al. Few-shot remote sensing image classification based on parameter-efficient vision transformer and multimodal guidance[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996.
[13]	宋婉莹, 刘毓琛, 王杰, 等. 面向高分辨遥感图像的熵驱动自适应融合网络构建与场景分类研究[J/OL]. 电子与信息学报, https://link.cnki.net/urlid/11.4494.TN.20260405.2112.010, 2025. SONG Wanying, LIU Yuchen, WANG Jie, et al. Entropy-driven adaptive fusion network for scene classification of high-resolution remote sensing images[J]. Journal of Electronics & Information Technology, https://link.cnki.net/urlid/11.4494.TN.20260405.2112.010, 2025.
[14]	HAN Yuantao, ZHANG Cong, ZHAN Xiaoyun, et al. Crossing multiple life stages: Fine-grained classification of agricultural pests[J]. Plant Methods, 2024, 20(1): 191. doi: 10.1186/s13007-024-01317-w.
[15]	WANG Jiahui, XU Qin, JIANG Bo, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529–4542. doi: 10.1109/TIP.2024.3441813.
[16]	DAI Jifeng, QI Haozhi, XIONG Yuwen, et al. Deformable convolutional networks[C]. The 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 764–773. doi: 10.1109/ICCV.2017.89.
[17]	KIRKLAND E J. Advanced Computing in Electron Microscopy[M]. 2nd ed. New York: Springer, 2010: 261–263. doi: 10.1007/978-1-4419-6533-2.
[18]	NANDHINI C and BRINDHA M. Visual regenerative fusion network for pest recognition[J]. Neural Computing and Applications, 2024, 36(6): 2867–2882. doi: 10.1007/s00521-023-09173-w.
[19]	MALIK P and PARIDA M K. Classification of insect pest using transfer learning mechanism[C]. The 8th International Conference on Computer Vision and Image Processing, Jammu, India, 2023: 78–89. doi: 10.1007/978-3-031-58535-7_7.
[20]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, 2021.
[21]	XU Qin, WANG Jiahui, JIANG Bo, et al. Fine-grained visual classification via internal ensemble learning transformer[J]. IEEE Transactions on Multimedia, 2023, 25: 9015–9028. doi: 10.1109/TMM.2023.3244340.
[22]	张文丽, 宋威. 基于特征融合与集成学习的细粒度图像分类[J]. 激光与光电子学进展, 2024, 61(22): 2237010. doi: 10.3788/LOP240759. ZHANG Wenli and SONG Wei. Fine-grained image classification based on feature fusion and ensemble learning[J]. Laser & Optoelectronics Progress, 2024, 61(22): 2237010. doi: 10.3788/LOP240759.
[23]	LIU Honglin, ZHAN Yongzhao, XIA Huifen, et al. Self-supervised transformer-based pre-training method using latent semantic masking auto-encoder for pest and disease classification[J]. Computers and Electronics in Agriculture, 2022, 203: 107448. doi: 10.1016/j.compag.2022.107448.
[24]	LU Xiaowei, WANG Kanqi, WANG Peiyu, et al. Gate-ViT: Gated vision transformer for fine-grained visual classification[C]. The 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 2025: 468–479. doi: 10.1007/978-981-96-8180-8_37.
[25]	LIU Wei and ZHANG Ao. Plant disease detection algorithm based on efficient Swin transformer[J]. Computers, Materials & Continua, 2025, 82(2): 3045–3068. doi: 10.32604/cmc.2024.058640.
[26]	RIOS E A, YUANDA J C, GHANZ V L, et al. Cross-layer cache aggregation for token reduction in ultra-fine-grained image recognition[C]. 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/icassp49660.2025.10890489.
[27]	RIOS E A, HU Minchun, and LAI Bocheng. Global-local similarity for efficient fine-grained image recognition with vision transformers[C]. 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, UK, 2025: 1–5. doi: 10.1109/ISCAS56072.2025.11043866.