From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition

CHI Wei; XU Jin

doi:10.11999/JEIT260158

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 >

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

Citation:

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

Citation:

CHI Wei, XU Jin. From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260158

PDF( 0 KB)

From Touch to Semantics: A Cross-Modal Framework for Zero-Shot Spiking Tactile Object Recognition

doi: 10.11999/JEIT260158 cstr: 32379.14.JEIT260158

CHI Wei,
XU Jin^{1
,
,}

School of Computer Science, Peking University, Beijing 100871, China

Funds: The National Major Scientific Research Instrument Development Project (62427811), The National Natural Science Foundation General Project (62572008), The National Natural Science Foundation Youth Project (62403011, 62502025), The National Natural Science Foundation Key Project (62332006)

Accepted Date: 2026-04-17
Rev Recd Date: 2026-04-17

Available Online: 2026-04-30

Abstract

Abstract

Objective Tactile perception is essential for robots to understand object properties and enable dexterous interactions. However, tactile data acquisition is costly and difficult to scale, limiting the applicability of conventional supervised learning in open-world scenarios. Zero-shot learning (ZSL) offers a promising solution by transferring knowledge from seen to unseen categories via semantic representations. However, existing tactile ZSL methods either rely on auxiliary visual information or depend on manually designed attributes, which are often subjective and lack generalization. Meanwhile, event-based tactile sensors produce sparse, asynchronous spiking signals with rich spatiotemporal dynamics, posing additional challenges for semantic modeling. Consequently, systematic studies on zero-shot recognition of such data remain limited. To address these issues, we propose a zero-shot object recognition framework for event-based spiking tactile perception, aiming to bridge low-level tactile dynamics with high-level semantics in a scalable manner. Methods The proposed framework consists of three key components (Fig. 1): spiking tactile feature extraction, semantic prototype construction, and cross-modal tactile–semantic alignment. First, a biomimetic spiking graph neural network is employed to model raw event-based tactile signals. By integrating leaky integrate-and-fire (LIF) neurons with graph-based message passing, the model captures both temporal firing dynamics and spatial relationships among tactile sensing units, producing discriminative and biologically interpretable high-level tactile embeddings. Second, instead of relying on manually annotated attributes, large language models (LLMs) are introduced to generate structured, fine-grained, and extensible tactile attribute descriptions for each object category. These textual descriptions are further encoded into continuous semantic vectors, forming class-level semantic prototypes with consistent dimensionality across categories. This strategy enables flexible semantic expansion and avoids the labor-intensive process of attribute engineering. Third, a bidirectional tactile–semantic alignment mechanism is designed to enhance generalization to unseen categories. Specifically, a forward mapping projects tactile embeddings into the semantic space for classification, while a reverse mapping reconstructs tactile features from semantic representations. A cycle-consistency constraint is imposed between the two mappings to enforce structural coherence and semantic stability across modalities. The overall framework is trained on seen categories only, and zero-shot inference is performed by matching tactile embeddings of unseen samples with their corresponding semantic prototypes in the shared embedding space. Results and Discussions The proposed method is evaluated under a strict zero-shot setting on event-based spiking tactile datasets with disjoint seen and unseen sets. Performance is assessed using mean class accuracy, Top-k accuracy, and semantic alignment score. The framework consistently outperforms state-of-the-art tactile ZSL baselines across all metrics. Ablation studies validate each component: removing the spiking graph neural network leads to notable performance degradation, confirming the importance of explicitly modeling spatiotemporal tactile dynamics; replacing LLM-generated semantics with manually defined attributes reduces generalization, highlighting the advantage of structured and semantically rich language-driven representations. t-SNE visualization shows that cycle-consistent alignment produces more compact intra-class clusters and clearer inter-class boundaries for unseen categories. The bidirectional alignment mechanism also improves semantic stability and reduces projection bias. These results indicate that combining biologically inspired spiking models with language-extended semantics offers a robust solution for open-set tactile perception. Conclusions This paper presents a novel zero-shot object recognition framework for spiking tactile perception by integrating spiking graph neural networks with semantic representations. The proposed method addresses the limitations of existing tactile ZSL approaches by avoiding reliance on visual data and manual attribute design, while effectively modeling the spatiotemporal dynamics of spiking tactile signals. Experimental results demonstrate superior performance under strict zero-shot settings, confirming the effectiveness and robustness of the proposed approach. This work provides a strong baseline for zero-shot spiking tactile recognition and provides a principled pathway toward open-world tactile cognition in robotic systems. Future work will explore multimodal extensions and real-world robotic deployment under noisy and dynamic sensing conditions.
- Spiking tactile perception,
- Zero-shot learning,
- Spiking graph neural network,
- Multimodal alignment,
- Large language models

FullText(HTML)

References(33)

References

[1]	LI Baojiang, LI Liang, WANG Haiyan, et al. TVT-transformer: A tactile-visual-textual fusion network for object recognition[J]. Information Fusion, 2025, 118: 102943. doi: 10.1016/j.inffus.2025.102943.
[2]	LUO Shan, LEPORA N F, YUAN Wenzhen, et al. Tactile robotics: An outlook[J]. IEEE Transactions on Robotics, 2025, 41: 5564–5583. doi: 10.1109/TRO.2025.3608686.
[3]	ZHANG Yupo, LI Xiaoyu, FANG Senlin, et al. Multi-branch multi-scale channel fusion graph convolutional networks with transfer cost for robotic tactile recognition tasks[J]. IEEE Transactions on Automation Science and Engineering, 2025, 22: 11856–11868. doi: 10.1109/TASE.2025.3541339.
[4]	LI Liang, QIU Shengjie, LI Baojiang, et al. Object recognition based on tactile information: A generalized recognition network combining wavelet transform and transformer model for small sample datasets[J]. Information Sciences, 2025, 719: 122464. doi: 10.1016/j.ins.2025.122464.
[5]	UEDA S, HASHIMOTO A, HAMAYA M, et al. Visuo-tactile zero-shot object recognition with vision-language model[C]. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 2024: 7243–7250. doi: 10.1109/IROS58592.2024.10801766.
[6]	TAUNYAZOV T, SNG W, LIM B, et al. Event-driven visual-tactile sensing and learning for robots[C]. Robotics: Science and Systems XVI, Corvalis, USA, 2020. doi: 10.15607/RSS.2020.XVI.020.
[7]	张铁林, 李澄宇, 王刚, 等. 适合类脑脉冲神经网络的应用任务范式分析与展望[J]. 电子与信息学报, 2023, 45(8): 2675–2688. doi: 10.11999/JEIT221459. ZHANG Tielin, LI Chengyu, WANG Gang, et al. Research advances and new paradigms for biology-inspired spiking neural networks[J]. Journal of Electronics & Information Technology, 2023, 45(8): 2675–2688. doi: 10.11999/JEIT221459.
[8]	YANG Jing, LIU Tingqing, REN Yaping, et al. AM-SGCN: Tactile object recognition for adaptive multichannel spiking graph convolutional neural networks[J]. IEEE Sensors Journal, 2023, 23(24): 30805–30820. doi: 10.1109/JSEN.2023.3329559.
[9]	NAG S, ZHU Xiatian, SONG Yizhe, et al. Zero-shot temporal action detection via vision-language prompting[C]. 17th European Conference on Computer Vision–ECCV, Tel Aviv, Israel, 2022: 681–697. doi: 10.1007/978-3-031-20062-5_39.
[10]	YANG Fengyu, FENG Chao, CHEN Ziyang, et al. YANG Fengyu, FENG Chao, CHEN Ziyang, et al. Binding touch to everything: Learning unified multimodal tactile representations[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2024: 26330–26343. doi: 10.1109/CVPR52733.2024.02488.
[11]	CAO Guanqun, JIANG Jiaqi, BOLLEGALA D, et al. Multimodal zero-shot learning for tactile texture recognition[J]. Robotics and Autonomous Systems, 2024, 176: 104688. doi: 10.1016/j.robot.2024.104688.
[12]	MERIBOUT M, TAKELE N A, DEREGE O, et al. Tactile sensors: A review[J]. Measurement, 2024, 238: 115332. doi: 10.1016/j.measurement.2024.115332.
[13]	GUO Fangming, YU Fangwen, LI Mingyan, et al. Event-driven tactile sensing with dense spiking graph neural networks[J]. IEEE Transactions on Instrumentation and Measurement, 2025, 74: 2508113. doi: 10.1109/TIM.2025.3541787.
[14]	HU Jiarui, ZHOU Yanmin, WANG Zhipeng, et al. X-Tacformer: Spatio-tempral attention model for tactile recognition[C]. 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024: 9638–9644. doi: 10.1109/ICRA57147.2024.10610365.
[15]	杨静, 吉晓阳, 李少波, 等. 具有正则化约束的脉冲神经网络机器人触觉物体识别方法[J]. 电子与信息学报, 2023, 45(7): 2595–2604. doi: 10.11999/JEIT220711. YANG Jing, JI Xiaoyang, LI Shaobo, et al. Spiking neural network robot tactile object recognition method with regularization constraints[J]. Journal of Electronics & Information Technology, 2023, 45(7): 2595–2604. doi: 10.11999/JEIT220711.
[16]	POURPANAH F, ABDAR M, LUO Yuxuan, et al. A review of generalized zero-shot learning methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4051–4070. doi: 10.1109/TPAMI.2022.3191696.
[17]	CAO Weipeng, WU Yuhao, SUN Yixuan, et al. A review on multimodal zero-shot learning[J]. WIREs Data Mining and Knowledge Discovery, 2023, 13(2): e1488. doi: 10.1002/widm.1488.
[18]	FOTEINOPOULOU N M and PATRAS I. EmoCLIP: A vision-language method for zero-shot video facial expression recognition[C]. 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkiye, 2024: 1–10. doi: 10.1109/FG59268.2024.10581982.
[19]	ZHONG Shaohong, ALBINI A, MAIOLINO P, et al. TactGen: Tactile sensory data generation via zero-shot sim-to-real transfer[J]. IEEE Transactions on Robotics, 2025, 41: 1316–1328. doi: 10.1109/TRO.2024.3521967.
[20]	LIU Huaping, SUN Fuchun, FANG Bin, et al. Cross-modal zero-shot-learning for tactile object recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(7): 2466–2474. doi: 10.1109/TSMC.2018.2818184.
[21]	BI T, SFERRAZZA C, and D’ANDREA R. Zero-shot sim-to-real transfer of tactile control policies for aggressive swing-up manipulation[J]. IEEE Robotics and Automation Letters, 2021, 6(3): 5761–5768. doi: 10.1109/LRA.2021.3084880.
[22]	MIRZA M J, KARLINSKY L, LIN Wei, et al. Meta-prompting for automating zero-shot visual recognition with LLMs[C]. 18th European Conference on Computer Vision–ECCV, Milan, Italy, 2025: 370–387. doi: 10.1007/978-3-031-72627-9_21.
[23]	NAGAR A, JAISWAL S, and TAN C. Zero-shot visual reasoning by vision-language models: Benchmarking and analysis[C]. 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024: 1–8. doi: 10.1109/IJCNN60899.2024.10650020.
[24]	LIU Cheng, WANG Chao, PENG Yan, et al. ZVQAF: Zero-shot visual question answering with feedback from large language models[J]. Neurocomputing, 2024, 580: 127505. doi: 10.1016/j.neucom.2024.127505.
[25]	YELENOV A, UMURBEKOV I, KENZHEBEK D, et al. Zero-shot reasoning with haptic and visual feedback in vision-language-action robotic manipulation[C]. Proceedings of the CLAWAR 2025 Conference on AI Enabled Robotic Loco-Manipulation, Yokohama, Japan, 2025: 205–216. doi: 10.1007/978-3-032-09427-8_20.
[26]	BOULENGER V, MARTEL M, BOUVET C, et al. Feeling better: Tactile verbs speed up tactile detection[J]. Brain and Cognition, 2020, 142: 105582. doi: 10.1016/j.bandc.2020.105582.
[27]	MILLER T M, BLANKENBURG F, and PULVERMÜLLER F. Language, but not music, shapes tactile perception[J]. Language and Cognition, 2025, 17: e53. doi: 10.1017/langcog.2025.10006.
[28]	AKATA Z, PERRONNIN F, HARCHAOUI Z, et al. Label-embedding for attribute-based classification[C]. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013: 819–826. doi: 10.1109/CVPR.2013.111.
[29]	CHI Wei, ZHANG Ying, ZHANG Xiaolu, et al. MA-SGNN: A multi-view adaptive spiking graph neural network for event-based tactile recognition[C]. 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Wuhan, China, 2025: 2088–2093. doi: 10.1109/BIBM66473.2025.11355976.
[30]	曹毅, 吴伟官, 李平, 等. 基于时空特征增强图卷积网络的骨架行为识别[J]. 电子与信息学报, 2023, 45(8): 3022–3031. doi: 10.11999/JEIT220749. CAO Yi, WU Weiguan, LI Ping, et al. Skeleton action recognition based on spatio-temporal feature enhanced graph convolutional network[J]. Journal of Electronics & Information Technology, 2023, 45(8): 3022–3031. doi: 10.11999/JEIT220749.
[31]	XIAN Yongqin, LORENZ T, SCHIELE B, et al. Feature generating networks for zero-shot learning[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5542–5551. doi: 10.1109/CVPR.2018.00581.
[32]	XU Zhengtong, UPPULURI R, ZHANG Xinwei, et al. UniT: Data efficient tactile representation with generalization to unseen objects[J]. IEEE Robotics and Automation Letters, 2025, 10(6): 5481–5488. doi: 10.1109/LRA.2025.3559835.
[33]	YU S, LIN K, XIAO Anxing, et al. Octopi: Object property reasoning with large tactile-language models[C]. Robotics: Science and Systems, Delft, Netherlands, 2024.