多粒度文本感知分层特征交互的视觉定位方法

才华; 冉越; 付强; 李军龑; 张晨洁; 孙俊喜

doi:10.11999/JEIT250387

多粒度文本感知分层特征交互的视觉定位方法

doi: 10.11999/JEIT250387 cstr: 32379.14.JEIT250387

才华^1, ,,
冉越¹,
付强²,
李军龑¹,
张晨洁¹,
孙俊喜³

1.
长春理工大学电子信息工程学院长春 130022
2.
长春理工大学空间光电技术研究所长春 130022
3.
东北师范大学信息科学与技术学院长春 130117

基金项目: 国家自然科学基金重大项目(61890963)，国家自然基金联合基金(U2341226)，吉林省科技厅(20240302089GX)

详细信息

作者简介:
才华：男，教授，研究方向为计算机视觉与自然语言处理

冉越：男，硕士生，研究方向为计算机视觉与视觉语言多模态

付强：男，教授，研究方向为光学传输特性测试与多维度成像探测

李军龑：男，硕士生，研究方向为计算机视觉与视觉语言多模态

张晨洁：女，副教授，研究方向为机器学习与图像处理

孙俊喜：男，教授，研究方向为AI视觉与智能感知技术

通讯作者:
才华　caihua@cust.edu.cn

中图分类号: TN911.73; TP391.41
计量
- 文章访问数: 30
- HTML全文浏览量: 17
- PDF下载量: 3
- 被引次数: 0
出版历程
- 收稿日期: 2025-05-08
- 修回日期: 2025-09-12
- 网络出版日期: 2025-09-18

Multi-granularity Text Perception and Hierarchical Feature Interaction Method for Visual Grounding

1.
School of Electronical and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China
2.
Institute of Space Opto-Electronic Technology, Changchun University of Science and Technology, Changchun 130022, China
3.
School of Information Science and Technology, Northeast Normal University, Changchun 130117, China

Funds: The Major Program of the National Natural Science Foundation of China (61890963), The Joint Funds of the National Natural Foundation of China (U2341226), Jilin Provincial Department of Science and Technology (20240302089GX)

摘要

摘要: 现有视觉定位方法在文本引导目标定位和特征融合方面存在显著不足，主要表现为未能充分利用文本信息，并且整体性能过于依赖特征提取后的融合过程。针对这一问题，该文提出一种多粒度文本感知分层特征交互的视觉定位方法。该方法在图像分支中引入分层特征交互模块，利用文本信息增强与文本相关的图像特征；多粒度文本感知模块深入挖掘文本语义内容，生成具有空间和语义增强的加权文本。在此基础上，采用基于哈达玛积的初步融合策略融合加权文本和图像，为跨模态特征融合提供更为精细的图像表示。利用Transformer编码器进行跨模态特征融合，通过多层感知机回归定位坐标。实验结果表明，该文方法在5个经典视觉定位数据集上均取得了显著的精度提升，成功解决了传统方法过度依赖特征融合模块而导致的性能瓶颈问题。
- 视觉定位 /
- 多粒度文本感知 /
- 分层特征交互 /
- 自适应文本加权 /
- 哈达玛积 /
- XXXX
Abstract: Objective Visual grounding requires effective use of textual information for accurate target localization. Traditional methods primarily emphasize feature fusion but often neglect the guiding role of text, which limits localization accuracy. To address this limitation, a Multi-granularity Text Perception and Hierarchical Feature Interaction method for Visual Grounding (ThiVG) is proposed. In this method, the hierarchical feature interaction module is progressively incorporated into the image encoder to enhance the semantic representation of image features. The multi-granularity text-aware module is designed to generate weighted text with spatial and semantic enhancement, and a preliminary Hadamard product–based fusion strategy is applied to refine image features for cross-modal fusion. Experimental results show that the proposed method substantially improves localization accuracy and effectively alleviates the performance bottleneck arising from over-reliance on feature fusion modules in conventional approaches. Methods The proposed method comprises an image–text feature extraction network, a hierarchical feature interaction module, a multi-granularity text perception module, and a graphic–text cross-modal fusion and target localization network (Fig. 1). The image–text feature extraction network includes image and text branches for extracting their respective features (Fig. 2). In the image branch, text features are incorporated into the image encoder through the hierarchical feature interaction module (Fig. 3). This enables text information to filter and update image features, thereby strengthening their semantic expressiveness. The multi-granularity text perception module employs three perception mechanisms to fully extract spatial and semantic information from the text (Fig. 4). It generates weighted text, which is preliminarily fused with image features through a Hadamard product–based strategy, providing fine-grained image features for subsequent cross-modal fusion. The graphic–text cross-modal fusion module then deeply integrates image and text features using a Transformer encoder (Fig. 5), capturing their complex relationships. Finally, a Multilayer Perceptron (MLP) performs regression to predict the bounding box coordinates of the target location. This method not only achieves effective integration of image and text information but also improves accuracy and robustness in visual grounding tasks through hierarchical feature interaction and deep cross-modal fusion, offering a novel approach to complex localization challenges. Results and Discussions Comparison experiments demonstrate that the proposed method achieves substantial accuracy gains across five benchmark visual localization datasets (Tables 1 and 2), with particularly strong performance on the long-text RefCOCOg dataset. Although the model has a larger parameter size, comparisons of parameter counts and training–inference times indicate that its overall performance still exceeds that of traditional methods (Table 3). Ablation studies further verify the contribution of each key module (Table 4). The hierarchical feature interaction module improves the semantic representation of image features by incorporating textual information into the image encoder (Table 5). The multi-granularity text perception module enhances attention to key textual components through perception mechanisms and adaptive weighting (Table 6). By avoiding excessive modification of the text structure, it markedly strengthens the model’s capacity to process long text and complex sentences. Experiments on the number of encoder layers in the cross-modal fusion module show that a 6-layer deep fusion encoder effectively filters irrelevant background information (Table 7), yielding a more precise feature representation for the localization regression MLP. Generalization tests and visualization analyses further demonstrate that the proposed method maintains high adaptability and accuracy across diverse and challenging localization scenarios (Figs. 6, and 7). Conclusions This study proposes a visual grounding algorithm that integrates multi-granularity text perception with hierarchical feature interaction, effectively addressing the under-utilization of textual information and the reliance on single-feature fusion in existing approaches. Key innovations include the hierarchical feature interaction module in the image branch, which markedly enhances the semantic representation of image features; the multi-granularity text perception module, which fully exploits textual information to generate weighted text with spatial and semantic enhancement; and a preliminary Hadamard product–based fusion strategy, which provides fine-grained image representations for cross-modal fusion. Experimental results show that the proposed method achieves substantial accuracy improvements on classical vision datasets and demonstrates strong adaptability and robustness across diverse and complex localization scenarios. Future work will focus on extending this method to accommodate more diverse text inputs and further improving localization performance in challenging visual environments.
- Visual grounding /
- Multi-granularity /
- Text perception /
- Hierarchical feature interaction /
- Adaptive text weighting /
- Hadamard product

HTML全文

图 1 ThiVG网络架构图

下载: 全尺寸图片幻灯片

图 2 图像文本特征提取网络

下载: 全尺寸图片幻灯片

图 3 分层特征交互模块

下载: 全尺寸图片幻灯片

图 4 多粒度文本感知模块

下载: 全尺寸图片幻灯片

图 5 图文跨模态融合与目标定位网络

下载: 全尺寸图片幻灯片

图 6 真实场景泛化测试结果

下载: 全尺寸图片幻灯片

图 7 本文方法可视化结果

下载: 全尺寸图片幻灯片

1 ThiVG

输入：原始图像${\boldsymbol{I}}$，原始文本${\boldsymbol{T}}$
输出：视觉定位位置坐标$(x,y,h,w)$
(1) ${\mathrm{Function}}\; {\mathrm{ThiVG}}({\boldsymbol{I}}{,}{\boldsymbol{T}})$
(2) 　//图像文本特征提取
(3) 　$ {{\boldsymbol{I}}_{\rm m}} \leftarrow {\boldsymbol{I}},{{\boldsymbol{T}}_{\rm m}} \leftarrow {\boldsymbol{T}} $ //原始图像文本生成掩码图像文本
(4) 　$ {{\boldsymbol{F}}_{\rm T}} \leftarrow {\mathrm{BERT}}({\boldsymbol{T}},{{\boldsymbol{T}}_{\rm m}}); $
(5) 　$ {\boldsymbol{Z}},{{\boldsymbol{Z}}_{\rm m}} \leftarrow {\mathrm{ResNet}}({\boldsymbol{I}},{{\boldsymbol{I}}_{\rm m}}); $
(6) 　${\mathrm{for}}\; n = 1 \;{\mathrm{to}} 6\;{\mathrm{ do}}$
(7) 　　${\boldsymbol{F}}_{\mathrm{I}}^n \leftarrow {\mathrm{Transformer}}({\boldsymbol{Z}},{{\boldsymbol{Z}}_{\rm m}},{{\boldsymbol{F}}_{\rm T}});$
(8) 　　${\boldsymbol{F}}_{{\mathrm{IT}}}^n \leftarrow {\mathrm{HFI}}({\boldsymbol{F}}_{\mathrm{I}}^n,{{\boldsymbol{F}}_{\rm T}});$ //分层特征交互
(9) 　　$ {\boldsymbol{Z}} \leftarrow {\boldsymbol{F}}_{{\mathrm{IT}}}^n; $ //特征更新
(10) ${{\boldsymbol{F}}'_{{\mathrm{IT}}}} \leftarrow {\boldsymbol{Z}};$ //文本筛选的图像特征
(11) //多粒度文本感知
(12) ${\boldsymbol{F}}_{\rm T}^{{\text{bi-gru}}} \leftarrow {\text{Bi-GRU}}({{\boldsymbol{F}}_{\rm T}});$ // 文本隐藏状态特征
(13) ${{\boldsymbol{W}}_{{\mathrm{CDS}}}} \leftarrow {\mathrm{MTP}}({\mathrm{Linear}}({\boldsymbol{F}}_{\rm T}^{{\text{bi-gru}}}));$ //感知权重生成
(14) ${{\boldsymbol{F}}'_{\rm T}} \leftarrow {\mathrm{ATW}}({{\boldsymbol{W}}_{{\mathrm{CDS}}}},{{\boldsymbol{F}}_{\rm T}});$ //自适应文本加权
(15) ${{\boldsymbol{F}}_u} \leftarrow {\mathrm{Hadamard}}({{\boldsymbol{F}}'_{{\mathrm{IT}}}},{{\boldsymbol{F}}'_{\rm T}});$ //哈达玛积(初步融合)
(16) //图文跨模态融合(深度融合)
(17) ${\mathrm{for}}\; n = 1\; {\mathrm{to}} \;6\; {\mathrm{do}}$
(18) 　${\boldsymbol{F}}_{\mathrm{u}}^n \leftarrow {\mathrm{Transformer}}({{\boldsymbol{F}}_u},{{\boldsymbol{Z}}_{\rm m}},{{\boldsymbol{F}}_{\rm T}});$
(19) 　${\mathrm{REG}} \leftarrow {\boldsymbol{F}}_{\mathrm{u}}^n;$ //更新REG标记
(20) 回归定位框$(x,y,h,w) \leftarrow {\mathrm{MLP}}({\mathrm{REG}});$
(21) 计算总损失${L_{{\mathrm{total}}}} = {L_{{\mathrm{L1}}}} + {L_{{\mathrm{GIOU}}}};$
(22) return

下载: 导出CSV

表 1 RefCOCO, RefCOCO+和RefCOCOg数据集对比实验(%)

方法	模型	骨干网络	RefCOCO			RefCOCO+			RefCOCOg
方法	模型	骨干网络	val	testA	testB	val	testA	testB	val-g	val-u	test-u
2阶段方法	CMN^[3]	VGG16	-	71.03	65.77	-	54.32	47.76	57.47	-	-
	MAttNet^[4]	ResNet101	76.65	71.14	69.99	65.33	71.62	56.02	-	66.58	67.27
	DGA^[5]	VGG16	-	78.42	65.53	-	69.07	51.99	-	-	63.28
	Ref-NMS^[6]	ResNet101	80.70	84.00	76.04	68.25	73.68	59.42	-	70.55	70.62
1阶段方法	FAOA^[9]	DarkNet53	72.54	74.35	68.50	56.81	60.23	49.60	56.12	61.33	60.36
	RCCF^[10]	DLA34	-	81.06	71.85	-	70.35	56.32	-	-	65.73
	ReSC-Large^[11]	DarkNet53	77.63	80.45	72.30	63.59	68.36	56.81	63.12	67.30	67.20
	LBYL-Net^[12]	DarkNet53	79.63	82.91	74.15	68.64	73.38	59.48	62.70	-	-
1阶段 Transformer 方法	TransVG^[14]	ResNet101	81.02	82.72	78.35	64.82	70.70	56.94	67.02	68.67	67.73
	VLTVG^[15]	ResNet101	84.77	87.24	80.49	74.19	78.93	65.17	72.98	76.04	74.18
	VGTR^[16]	ResNet101	79.30	82.16	74.38	64.40	70.85	55.84	64.05	66.83	67.28
	CMI^[17]	ResNet101	81.92	83.40	77.37	68.49	72.18	60.30	68.39	69.08	69.04
	TransCP^[18]	ResNet50	84.25	87.38	79.78	73.07	78.05	63.35	72.60	-	-
	ScanFormer^[19]	ViLT	83.40	85.86	78.81	72.96	77.57	62.50	-	74.10	74.14
	EEVG^[20]	ResNet101	82.19	85.34	77.18	71.35	76.76	60.73	-	70.18	71.28
	ThiVG(our)	ResNet50	84.78	87.68	80.04	73.94	78.55	64.55	73.93	76.14	74.88
	ThiVG(our)	ResNet101	85.01	87.96	80.47	74.44	79.21	65.27	74.41	77.01	75.21
注：红色、绿色、蓝色数字分别为前3名值，“-“表示未发现。

下载: 导出CSV

表 2 ReferItGame和Flickr30k Entities数据集对比实验(%)

方法	模型	骨干网络	ReferItGame test	Flickr30k Entities test
2阶段方法	CMN^[3]	VGG16	28.33	-
	MAttNet^[4]	ResNet101	29.04	-
	CITE^[7]	ResNet101	35.07	61.33
	DDPN^[8]	ResNet101	63.00	73.30
1阶段方法	FAOA^[9]	DarkNet53	60.67	68.71
	RCCF^[10]	DLA34	63.79	-
	ReSC-Large^[11]	DarkNet53	64.60	69.28
	LBYL-Net^[12]	DarkNet53	64.47	-
1阶段 Transformer 方法	TransVG^[14]	ResNet101	70.73	79.10
	VLTVG^[15]	ResNet101	71.98	79.84
	VGTR^[16]	ResNet101	-	75.32
	CMI^[17]	ResNet101	71.07	79.15
	TransCP^[18]	ResNet50	72.05	80.04
	ScanFormer^[19]	ViLT	68.85	-
	LMSVA^[21]	ResNet50	72.98	-
	ThiVG(our)	ResNet50	73.42	80.69
	ThiVG(our)	ResNet101	74.03	81.26
注：红色、绿色、蓝色数字分别为前3名值，“—“表示未发现。

下载: 导出CSV

表 3 ThiVG与经典模型对比结果

模型	训练参数量 (M)	训练平均时间 (ms)	推理FLOPs (G)	推理平均时间 (ms)
TransVG^[14]	149.52	183.75	41.28	31.93
VLTVG^[15]	151.26	174.93	38.74	30.46
ThiVG(our)	154.89	192.56	42.73	32.57

下载: 导出CSV

表 4 本文各个模块组合消融实验结果(%)

模型命名	分层特征交互	多粒度文本感知	多粒度文本感知	定位准确率
模型命名	(HFI)	(无ATW)	(有ATW)	val	testA	testB
ThiVG(1)				80.23	83.37	76.38
ThiVG(2)	√			83.15	86.31	78.65
ThiVG(3)		√		81.36	84.49	77.91
ThiVG(4)			√	82.07	85.21	78.45
ThiVG(5)	√	√		84.17	87.34	79.61
ThiVG(6)	√		√	84.78	87.68	80.04

下载: 导出CSV

表 5 分层特征交互模块层数实验结果(%)

层数	定位准确率
层数	val	testA	testB
0	82.07	85.21	78.45
1	83.89	86.88	79.25
2	84.17	87.02	79.53
4	84.52	87.41	79.90
6	84.78	87.68	80.04

下载: 导出CSV

表 6 不同感知偏置的实验结果(%)

方位词激励($ {\delta _{\mathrm{D}}} $)	停用词抑制($ {\delta _{\mathrm{S}}} $)	定位准确率(无ATW)			定位准确率(有ATW)
方位词激励($ {\delta _{\mathrm{D}}} $)	停用词抑制($ {\delta _{\mathrm{S}}} $)	val	testA	testB	val	testA	testB
0.5	–0.5	84.03	87.21	79.49	84.57	87.42	79.78
1.0	–1.0	84.27	87.41	79.69	84.69	87.60	79.95
1.5	–1.5	84.21	87.38	79.66	84.73	87.63	79.97
2.0	–2.0	84.17	87.34	79.61	84.78	87.68	80.04
2.5	–2.5	83.97	87.16	79.45	84.72	87.65	80.03

下载: 导出CSV

表 7 图文跨模态融合模块Transformer编码器层数实验结果(%)

Transformer 编码器层数	定位准确率
Transformer 编码器层数	val	testA	testB
1	56.21	61.27	52.63
2	68.74	72.03	63.37
4	77.52	80.25	72.91
6	84.78	87.68	80.04

下载: 导出CSV

参考文献(33)

[1]	XIAO Linhui, YANG Xiaoshan, LAN Xiangyuan, et al. Towards visual grounding: A survey[EB/OL]. https://arxiv.org/abs/2412.20206, 2024.
[2]	LI Yong, WANG Yuanzhi, and CUI Zhen. Decoupled multimodal distilling for emotion recognition[C]. The 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 6631–6640. doi: 10.1109/CVPR52729.2023.00641.
[3]	HU Ronghang, ROHRBACH M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4418–4427. doi: 10.1109/CVPR.2017.470.
[4]	YU Licheng, LIN Zhe, SHEN Xiaohui, et al. MAttNet: Modular attention network for referring expression comprehension[C]. The 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1307–1315. doi: 10.1109/CVPR.2018.00142.
[5]	YANG Sibei, LI Guanbin, and YU Yizhou. Dynamic graph attention for referring expression comprehension[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4643–4652. doi: 10.1109/ICCV.2019.00474.
[6]	CHEN Long, MA Wenbo, XIAO Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C]. The 35th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2021: 1036–1044. doi: 10.1609/aaai.v35i2.16188. (查阅网上资料,未找到本条文献出版地信息,请确认).
[7]	PLUMMER B A, KORDAS P, KIAPOUR M H, et al. Conditional image-text embedding networks[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 258–274. doi: 10.1007/978-3-030-01258-8_16.
[8]	YU Zhou, YU Jun, XIANG Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C]. The Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 1114–1120. doi: 10.24963/ijcai.2018/155.
[9]	YANG Zhengyuan, GONG Boqing, WANG Liwei, et al. A fast and accurate one-stage approach to visual grounding[C]. The 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 4682–4692. doi: 10.1109/ICCV.2019.00478.
[10]	LIAO Yue, LIU Si, LI Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C]. The 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10877–10886. doi: 10.1109/CVPR42600.2020.01089.
[11]	YANG Zhengyuan, CHEN Tianlang, WANG Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 387–404. doi: 10.1007/978-3-030-58568-6_23.
[12]	HUANG Binbin, LIAN Dongze, LUO Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C]. The 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 16883–16892. doi: 10.1109/CVPR46437.2021.01661.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[14]	DENG Jiajun, YANG Zhengyuan, CHEN Tianlang, et al. TransVG: End-to-end visual grounding with transformers[C]. The 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 1749–1759. doi: 10.1109/ICCV48922.2021.00179.
[15]	YANG Li, XU Yan, YUAN Chunfeng, et al. Improving visual grounding with visual-linguistic verification and iterative reasoning[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 9489–9498. doi: 10.1109/CVPR52688.2022.00928.
[16]	DU Ye, FU Zehua, LIU Qingjie, et al. Visual grounding with transformers[C]. The 2022 IEEE International Conference on Multimedia and Expo, Taipei, China, 2022: 1–6. doi: 10.1109/ICME52920.2022.9859880.
[17]	LI Kun, LI Jiaxiu, GUO Dan, et al. Transformer-based visual grounding with cross-modality interaction[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(6): 183. doi: 10.1145/3587251.
[18]	TANG Wei, LI Liang, LIU Xuejing, et al. Context disentangling and prototype inheriting for robust visual grounding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(5): 3213–3229. doi: 10.1109/TPAMI.2023.3339628.
[19]	SU Wei, MIAO Peihan, DOU Huanzhang, et al. ScanFormer: Referring expression comprehension by iteratively scanning[C]. The 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 13449–13458. doi: 10.1109/cvpr52733.2024.01277.
[20]	CHEN Wei, CHEN Long, and WU Yu. An efficient and effective transformer decoder-based framework for multi-task visual grounding[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 125–141. doi: 10.1007/978-3-031-72995-9_8.
[21]	YAO Haibo, WANG Lipeng, CAI Chengtao, et al. Language conditioned multi-scale visual attention networks for visual grounding[J]. Image and Vision Computing, 2024, 150: 105242. doi: 10.1016/j.imavis.2024.105242.
[22]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 213–229. doi: 10.1007/978-3-030-58452-8_13.
[23]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
[24]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[25]	丁博, 张立宝, 秦健, 等. 语义增强图像-文本预训练模型的零样本三维模型分类[J]. 电子与信息学报, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161. DING Bo, ZHANG Libao, QIN Jian, et al. Zero-shot 3D shape classification based on semantic-enhanced language-image pre-training model[J]. Journal of Electronics & Information Technology, 2024, 46(8): 3314–3323. doi: 10.11999/JEIT231161.
[26]	JAIN R, RAI R S, JAIN S, et al. Real time sentiment analysis of natural language using multimedia input[J]. Multimedia Tools and Applications, 2023, 82(26): 41021–41036. doi: 10.1007/s11042-023-15213-3.
[27]	HAN Tian, ZHANG Zhu, REN Mingyuan, et al. Text emotion recognition based on XLNet-BiGRU-Att[J]. Electronics, 2023, 12(12): 2704. doi: 10.3390/electronics12122704.
[28]	ZHANG Xiangsen, WU Zhongqiang, LIU Ke, et al. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU[J]. Sensors, 2023, 23(3): 1481. doi: 10.3390/s23031481.
[29]	YU Jun, ZHANG Donglin, SHU Zhenqiu, et al. Adaptive multi-modal fusion hashing via Hadamard matrix[J]. Applied Intelligence, 2022, 52(15): 17170–17184. doi: 10.1007/s10489-022-03367-w.
[30]	YU Licheng, POIRSON P, YANG Shan, et al. Modeling context in referring expressions[C]. The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 69–85. doi: 10.1007/978-3-319-46475-6_5.
[31]	MAO Junhua, HUANG J, TOSHEV A, et al. Generation and comprehension of unambiguous object descriptions[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 11–20. doi: 10.1109/CVPR.2016.9.
[32]	KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. ReferItGame: Referring to objects in photographs of natural scenes[C]. The 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 787–798, Doha, Qatar. doi: 10.3115/v1/D14-1086.
[33]	PLUMMER B A, WANG Liwei, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models[C]. The 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 2641–2649. doi: 10.1109/ICCV.2015.303.