Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding
-
摘要: 现有视觉定位方法在文本引导目标定位和特征融合方面存在显著不足,主要表现为未能充分利用文本信息,并且整体性能过于依赖特征提取后的融合过程。针对这一问题,该文提出一种多粒度文本感知分层特征交互的视觉定位方法。该方法在图像分支中引入分层特征交互模块,利用文本信息增强与文本相关的图像特征;多粒度文本感知模块深入挖掘文本语义内容,生成具有空间和语义增强的加权文本。在此基础上,采用基于哈达玛积的初步融合策略融合加权文本和图像,为跨模态特征融合提供更为精细的图像表示。利用Transformer编码器进行跨模态特征融合,通过多层感知机回归定位坐标。实验结果表明,该文方法在5个经典视觉定位数据集上均取得了显著的精度提升,成功解决了传统方法过度依赖特征融合模块而导致的性能瓶颈问题。Abstract:
Objective Visual grounding requires effective use of textual information for accurate target localization. Traditional methods primarily emphasize feature fusion but often neglect the guiding role of text, which limits localization accuracy. To address this limitation, a Multi-granularity Text Perception and Hierarchical Feature Interaction method for Visual Grounding (ThiVG) is proposed. In this method, the hierarchical feature interaction module is progressively incorporated into the image encoder to enhance the semantic representation of image features. The multi-granularity text-aware module is designed to generate weighted text with spatial and semantic enhancement, and a preliminary Hadamard product–based fusion strategy is applied to refine image features for cross-modal fusion. Experimental results show that the proposed method substantially improves localization accuracy and effectively alleviates the performance bottleneck arising from over-reliance on feature fusion modules in conventional approaches. Methods The proposed method comprises an image–text feature extraction network, a hierarchical feature interaction module, a multi-granularity text perception module, and a graphic–text cross-modal fusion and target localization network ( Fig. 1 ). The image–text feature extraction network includes image and text branches for extracting their respective features (Fig. 2 ). In the image branch, text features are incorporated into the image encoder through the hierarchical feature interaction module (Fig. 3 ). This enables text information to filter and update image features, thereby strengthening their semantic expressiveness. The multi-granularity text perception module employs three perception mechanisms to fully extract spatial and semantic information from the text (Fig. 4 ). It generates weighted text, which is preliminarily fused with image features through a Hadamard product–based strategy, providing fine-grained image features for subsequent cross-modal fusion. The graphic–text cross-modal fusion module then deeply integrates image and text features using a Transformer encoder (Fig. 5 ), capturing their complex relationships. Finally, a Multilayer Perceptron (MLP) performs regression to predict the bounding box coordinates of the target location. This method not only achieves effective integration of image and text information but also improves accuracy and robustness in visual grounding tasks through hierarchical feature interaction and deep cross-modal fusion, offering a novel approach to complex localization challenges.Results and Discussions Comparison experiments demonstrate that the proposed method achieves substantial accuracy gains across five benchmark visual localization datasets ( Tables 1 and2 ), with particularly strong performance on the long-text RefCOCOg dataset. Although the model has a larger parameter size, comparisons of parameter counts and training–inference times indicate that its overall performance still exceeds that of traditional methods (Table 3 ). Ablation studies further verify the contribution of each key module (Table 4 ). The hierarchical feature interaction module improves the semantic representation of image features by incorporating textual information into the image encoder (Table 5 ). The multi-granularity text perception module enhances attention to key textual components through perception mechanisms and adaptive weighting (Table 6 ). By avoiding excessive modification of the text structure, it markedly strengthens the model’s capacity to process long text and complex sentences. Experiments on the number of encoder layers in the cross-modal fusion module show that a 6-layer deep fusion encoder effectively filters irrelevant background information (Table 7 ), yielding a more precise feature representation for the localization regression MLP. Generalization tests and visualization analyses further demonstrate that the proposed method maintains high adaptability and accuracy across diverse and challenging localization scenarios (Figs. 6 , and7 ).Conclusions This study proposes a visual grounding algorithm that integrates multi-granularity text perception with hierarchical feature interaction, effectively addressing the under-utilization of textual information and the reliance on single-feature fusion in existing approaches. Key innovations include the hierarchical feature interaction module in the image branch, which markedly enhances the semantic representation of image features; the multi-granularity text perception module, which fully exploits textual information to generate weighted text with spatial and semantic enhancement; and a preliminary Hadamard product–based fusion strategy, which provides fine-grained image representations for cross-modal fusion. Experimental results show that the proposed method achieves substantial accuracy improvements on classical vision datasets and demonstrates strong adaptability and robustness across diverse and complex localization scenarios. Future work will focus on extending this method to accommodate more diverse text inputs and further improving localization performance in challenging visual environments. -
1 ThiVG
输入:原始图像${\boldsymbol{I}}$,原始文本${\boldsymbol{T}}$ 输出:视觉定位位置坐标$(x,y,h,w)$ (1) ${\mathrm{Function}}\; {\mathrm{ThiVG}}({\boldsymbol{I}}{,}{\boldsymbol{T}})$ (2) //图像文本特征提取 (3) $ {{\boldsymbol{I}}_{\rm m}} \leftarrow {\boldsymbol{I}},{{\boldsymbol{T}}_{\rm m}} \leftarrow {\boldsymbol{T}} $ //原始图像文本生成掩码图像文本 (4) $ {{\boldsymbol{F}}_{\rm T}} \leftarrow {\mathrm{BERT}}({\boldsymbol{T}},{{\boldsymbol{T}}_{\rm m}}); $ (5) $ {\boldsymbol{Z}},{{\boldsymbol{Z}}_{\rm m}} \leftarrow {\mathrm{ResNet}}({\boldsymbol{I}},{{\boldsymbol{I}}_{\rm m}}); $ (6) ${\mathrm{for}}\; n = 1 \;{\mathrm{to}} 6\;{\mathrm{ do}}$ (7) ${\boldsymbol{F}}_{\mathrm{I}}^n \leftarrow {\mathrm{Transformer}}({\boldsymbol{Z}},{{\boldsymbol{Z}}_{\rm m}},{{\boldsymbol{F}}_{\rm T}});$ (8) ${\boldsymbol{F}}_{{\mathrm{IT}}}^n \leftarrow {\mathrm{HFI}}({\boldsymbol{F}}_{\mathrm{I}}^n,{{\boldsymbol{F}}_{\rm T}});$ //分层特征交互 (9) $ {\boldsymbol{Z}} \leftarrow {\boldsymbol{F}}_{{\mathrm{IT}}}^n; $ //特征更新 (10) ${{\boldsymbol{F}}'_{{\mathrm{IT}}}} \leftarrow {\boldsymbol{Z}};$ //文本筛选的图像特征 (11) //多粒度文本感知 (12) ${\boldsymbol{F}}_{\rm T}^{{\text{bi-gru}}} \leftarrow {\text{Bi-GRU}}({{\boldsymbol{F}}_{\rm T}});$ // 文本隐藏状态特征 (13) ${{\boldsymbol{W}}_{{\mathrm{CDS}}}} \leftarrow {\mathrm{MTP}}({\mathrm{Linear}}({\boldsymbol{F}}_{\rm T}^{{\text{bi-gru}}}));$ //感知权重生成 (14) ${{\boldsymbol{F}}'_{\rm T}} \leftarrow {\mathrm{ATW}}({{\boldsymbol{W}}_{{\mathrm{CDS}}}},{{\boldsymbol{F}}_{\rm T}});$ //自适应文本加权 (15) ${{\boldsymbol{F}}_u} \leftarrow {\mathrm{Hadamard}}({{\boldsymbol{F}}'_{{\mathrm{IT}}}},{{\boldsymbol{F}}'_{\rm T}});$ //哈达玛积(初步融合) (16) //图文跨模态融合(深度融合) (17) ${\mathrm{for}}\; n = 1\; {\mathrm{to}} \;6\; {\mathrm{do}}$ (18) ${\boldsymbol{F}}_{\mathrm{u}}^n \leftarrow {\mathrm{Transformer}}({{\boldsymbol{F}}_u},{{\boldsymbol{Z}}_{\rm m}},{{\boldsymbol{F}}_{\rm T}});$ (19) ${\mathrm{REG}} \leftarrow {\boldsymbol{F}}_{\mathrm{u}}^n;$ //更新REG标记 (20) 回归定位框$(x,y,h,w) \leftarrow {\mathrm{MLP}}({\mathrm{REG}});$ (21) 计算总损失${L_{{\mathrm{total}}}} = {L_{{\mathrm{L1}}}} + {L_{{\mathrm{GIOU}}}};$ (22) return 表 1 RefCOCO, RefCOCO+和RefCOCOg数据集对比实验(%)
方法 模型 骨干
网络RefCOCO RefCOCO+ RefCOCOg val testA testB val testA testB val-g val-u test-u 2阶段
方法CMN[3] VGG16 - 71.03 65.77 - 54.32 47.76 57.47 - - MAttNet[4] ResNet101 76.65 71.14 69.99 65.33 71.62 56.02 - 66.58 67.27 DGA[5] VGG16 - 78.42 65.53 - 69.07 51.99 - - 63.28 Ref-NMS[6] ResNet101 80.70 84.00 76.04 68.25 73.68 59.42 - 70.55 70.62 1阶段
方法FAOA[9] DarkNet53 72.54 74.35 68.50 56.81 60.23 49.60 56.12 61.33 60.36 RCCF[10] DLA34 - 81.06 71.85 - 70.35 56.32 - - 65.73 ReSC-Large[11] DarkNet53 77.63 80.45 72.30 63.59 68.36 56.81 63.12 67.30 67.20 LBYL-Net[12] DarkNet53 79.63 82.91 74.15 68.64 73.38 59.48 62.70 - - 1阶段
Transformer
方法TransVG[14] ResNet101 81.02 82.72 78.35 64.82 70.70 56.94 67.02 68.67 67.73 VLTVG[15] ResNet101 84.77 87.24 80.49 74.19 78.93 65.17 72.98 76.04 74.18 VGTR[16] ResNet101 79.30 82.16 74.38 64.40 70.85 55.84 64.05 66.83 67.28 CMI[17] ResNet101 81.92 83.40 77.37 68.49 72.18 60.30 68.39 69.08 69.04 TransCP[18] ResNet50 84.25 87.38 79.78 73.07 78.05 63.35 72.60 - - ScanFormer[19] ViLT 83.40 85.86 78.81 72.96 77.57 62.50 - 74.10 74.14 EEVG[20] ResNet101 82.19 85.34 77.18 71.35 76.76 60.73 - 70.18 71.28 ThiVG(our) ResNet50 84.78 87.68 80.04 73.94 78.55 64.55 73.93 76.14 74.88 ThiVG(our) ResNet101 85.01 87.96 80.47 74.44 79.21 65.27 74.41 77.01 75.21 注:红色、绿色、蓝色数字分别为前3名值,“-“表示未发现。 表 2 ReferItGame和Flickr30k Entities数据集对比实验(%)
方法 模型 骨干网络 ReferItGame
testFlickr30k
Entities test2阶段
方法CMN[3] VGG16 28.33 - MAttNet[4] ResNet101 29.04 - CITE[7] ResNet101 35.07 61.33 DDPN[8] ResNet101 63.00 73.30 1阶段
方法FAOA[9] DarkNet53 60.67 68.71 RCCF[10] DLA34 63.79 - ReSC-Large[11] DarkNet53 64.60 69.28 LBYL-Net[12] DarkNet53 64.47 - 1阶段
Transformer
方法TransVG[14] ResNet101 70.73 79.10 VLTVG[15] ResNet101 71.98 79.84 VGTR[16] ResNet101 - 75.32 CMI[17] ResNet101 71.07 79.15 TransCP[18] ResNet50 72.05 80.04 ScanFormer[19] ViLT 68.85 - LMSVA[21] ResNet50 72.98 - ThiVG(our) ResNet50 73.42 80.69 ThiVG(our) ResNet101 74.03 81.26 注:红色、绿色、蓝色数字分别为前3名值,“—“表示未发现。 表 3 ThiVG与经典模型对比结果
表 4 本文各个模块组合消融实验结果(%)
模型命名 分层特征
交互多粒度
文本感知多粒度
文本感知定位准确率 (HFI) (无ATW) (有ATW) val testA testB ThiVG(1) 80.23 83.37 76.38 ThiVG(2) √ 83.15 86.31 78.65 ThiVG(3) √ 81.36 84.49 77.91 ThiVG(4) √ 82.07 85.21 78.45 ThiVG(5) √ √ 84.17 87.34 79.61 ThiVG(6) √ √ 84.78 87.68 80.04 表 5 分层特征交互模块层数实验结果(%)
层数 定位准确率 val testA testB 0 82.07 85.21 78.45 1 83.89 86.88 79.25 2 84.17 87.02 79.53 4 84.52 87.41 79.90 6 84.78 87.68 80.04 表 6 不同感知偏置的实验结果(%)
方位词
激励($ {\delta _{\mathrm{D}}} $)停用词
抑制($ {\delta _{\mathrm{S}}} $)定位准确率(无ATW) 定位准确率(有ATW) val testA testB val testA testB 0.5 –0.5 84.03 87.21 79.49 84.57 87.42 79.78 1.0 –1.0 84.27 87.41 79.69 84.69 87.60 79.95 1.5 –1.5 84.21 87.38 79.66 84.73 87.63 79.97 2.0 –2.0 84.17 87.34 79.61 84.78 87.68 80.04 2.5 –2.5 83.97 87.16 79.45 84.72 87.65 80.03 表 7 图文跨模态融合模块Transformer编码器层数实验结果(%)
Transformer
编码器层数定位准确率 val testA testB 1 56.21 61.27 52.63 2 68.74 72.03 63.37 4 77.52 80.25 72.91 6 84.78 87.68 80.04 -
[1] XIAO Linhui, YANG Xiaoshan, LAN Xiangyuan, et al. Towards visual grounding: A survey[EB/OL]. https://arxiv.org/abs/2412.20206, 2024. [2] LI Yong, WANG Yuanzhi, and CUI Zhen. Decoupled multimodal distilling for emotion recognition[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6631–6640. doi: 10.1109/CVPR52729.2023.00641. [3] HU Ronghang, ROHRBACH M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4418–4427. doi: 10.1109/CVPR.2017.470. [4] YU Licheng, LIN Zhe, SHEN Xiaohui, et al. MAttNet: Modular attention network for referring expression comprehension[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1307–1315. doi: 10.1109/CVPR.2018.00142. [5] YANG Sibei, LI Guanbin, and YU Yizhou. Dynamic graph attention for referring expression comprehension[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4643–4652. doi: 10.1109/ICCV.2019.00474. [6] CHEN Long, MA Wenbo, XIAO Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C]. The 35th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021: 1036–1044. doi: 10.1609/aaai.v35i2.16188. (查阅网上资料,未找到本条文献出版地信息,请确认). [7] PLUMMER B A, KORDAS P, KIAPOUR M H, et al. Conditional image-text embedding networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 258–274. doi: 10.1007/978-3-030-01258-8_16. [8] YU Zhou, YU Jun, XIANG Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C]. The Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 1114–1120. doi: 10.24963/ijcai.2018/155. [9] YANG Zhengyuan, GONG Boqing, WANG Liwei, et al. A fast and accurate one-stage approach to visual grounding[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4682–4692. doi: 10.1109/ICCV.2019.00478. [10] LIAO Yue, LIU Si, LI Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10877–10886. doi: 10.1109/CVPR42600.2020.01089. [11] YANG Zhengyuan, CHEN Tianlang, WANG Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi: 10.1007/978-3-030-58568-6_23. [12] HUANG Binbin, LIAN Dongze, LUO Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi: 10.1109/CVPR46437.2021.01661. [13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010. [14] DENG Jiajun, YANG Zhengyuan, CHEN Tianlang, et al. TransVG: End-to-end visual grounding with transformers[C]. The 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi: 10.1109/ICCV48922.2021.00179. [15] YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi: 10.1109/CVPR52688.2022.00928. [16] DU Ye, FU Zehua, LIU Qingjie, et al. Visual grounding with transformers[C]. The 2022 IEEE International Conference on Multimedia and Expo, Taipei, China, 2022: 1–6. doi: 10.1109/ICME52920.2022.9859880. [17] LI Kun, LI Jiaxiu, GUO Dan, et al. Transformer-based visual grounding with cross-modality interaction[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(6): 183. doi: 10.1145/3587251. [18] TANG Wei, LI Liang, LIU Xuejing, et al. Context disentangling and prototype inheriting for robust visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 3213–3229. doi: 10.1109/TPAMI.2023.3339628. [19] SU Wei, MIAO Peihan, DOU Huanzhang, et al. ScanFormer: Referring expression comprehension by iteratively scanning[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13449–13458. doi: 10.1109/cvpr52733.2024.01277. [20] CHEN Wei, CHEN Long, and WU Yu. An efficient and effective transformer decoder-based framework for multi-task visual grounding[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 125–141. doi: 10.1007/978-3-031-72995-9_8. [21] YAO Haibo, WANG Lipeng, CAI Chengtao, et al. Language conditioned multi-scale visual attention networks for visual grounding[J]. Image and Vision Computing, 2024, 150: 105242. doi: 10.1016/j.imavis.2024.105242. [22] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 213–229. doi: 10.1007/978-3-030-58452-8_13. [23] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423. [24] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90. [25] 丁博, 张立宝, 秦健, 等. 语义增强图像-文本预训练模型的零样本三维模型分类[J]. 电子与信息学报, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.DING Bo, ZHANG Libao, QIN Jian, et al. Zero-shot 3D shape classification based on semantic-enhanced language-image pre-training model[J]. Journal of Electronics & Information Technology, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161. [26] JAIN R, RAI R S, JAIN S, et al. Real time sentiment analysis of natural language using multimedia input[J]. Multimedia Tools and Applications, 2023, 82(26): 41021–41036. doi: 10.1007/s11042-023-15213-3. [27] HAN Tian, ZHANG Zhu, REN Mingyuan, et al. Text emotion recognition based on XLNet-BiGRU-Att[J]. Electronics, 2023, 12(12): 2704. doi: 10.3390/electronics12122704. [28] ZHANG Xiangsen, WU Zhongqiang, LIU Ke, et al. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU[J]. Sensors, 2023, 23(3): 1481. doi: 10.3390/s23031481. [29] YU Jun, ZHANG Donglin, SHU Zhenqiu, et al. Adaptive multi-modal fusion hashing via Hadamard matrix[J]. Applied Intelligence, 2022, 52(15): 17170–17184. doi: 10.1007/s10489-022-03367-w. [30] YU Licheng, POIRSON P, YANG Shan, et al. Modeling context in referring expressions[C]. The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 69–85. doi: 10.1007/978-3-319-46475-6_5. [31] MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi: 10.1109/CVPR.2016.9. [32] KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: Referring to objects in photographs of natural scenes[C]. The 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 787–798, Doha, Qatar. doi: 10.3115/v1/D14-1086. [33] PLUMMER B A, WANG Liwei, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 2641–2649. doi: 10.1109/ICCV.2015.303. -