高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

LLMA-GCN:语义增强的层次化时空图卷积动作识别网络

贾桂敏 周熹龙

贾桂敏, 周熹龙. LLMA-GCN:语义增强的层次化时空图卷积动作识别网络[J]. 电子与信息学报. doi: 10.11999/JEIT260154
引用本文: 贾桂敏, 周熹龙. LLMA-GCN:语义增强的层次化时空图卷积动作识别网络[J]. 电子与信息学报. doi: 10.11999/JEIT260154
JIA Guimin, ZHOU Xilong. LLMA-GCN: Semantic-Enhanced Hierarchical Spatiotemporal Graph Convolutional Network for Action Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260154
Citation: JIA Guimin, ZHOU Xilong. LLMA-GCN: Semantic-Enhanced Hierarchical Spatiotemporal Graph Convolutional Network for Action Recognition[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT260154

LLMA-GCN:语义增强的层次化时空图卷积动作识别网络

doi: 10.11999/JEIT260154 cstr: 32379.14.JEIT260154
基金项目: 天津市自然科学基金联合基金(25JCLMJC00930)
详细信息
    作者简介:

    贾桂敏:女,副教授,硕士生导师,研究方向为模式识别,多模态信息融合认知

    周熹龙:男,硕士,研究方向为多模态特征融合和动作识别

    通讯作者:

    贾桂敏 gmjia@cauc.edu.cn

  • 中图分类号: TN911.73; TP391.41; TP18

LLMA-GCN: Semantic-Enhanced Hierarchical Spatiotemporal Graph Convolutional Network for Action Recognition

  • 摘要: 针对当前骨架动作识别方法在引入大语言模型(LLM)时存在的语义引导与空间拓扑学习脱节、时间建模缺乏层次化语义支持、传统分类范式泛化能力不足的三大问题,本文提出一种基于LLM增强的骨架动作识别新方法(LLMA-GCN)。该方法首先提出一种基于LLM的混合图拓扑学习策略,将物理骨架结构先验知识与LLM提取的动作语义信息深度融合以指导空间建模;其次,将混合图拓扑学习策略与层次化视觉序列编码器相结合,通过设计的LLM精炼块实现语义引导下的多尺度时空特征提取;最后,构建了LLM驱动的文本原型引导决策学习机制,实现语义引导的视觉-文本对齐和分类决策联合学习,从而提升模型整体性能。在NTU RGB+D 60、NTU RGB+D 120以及PKU-MMD公开数据集上进行了大量对比实验及消融实验,证明了新方法的有效性和先进性。
  • 图  1  LLMA-GCN整体结构图

    图  2  层次化视觉序列编码器结构图

    图  3  LRB精炼块结构图

    图  4  邻接矩阵可视化结果

    表  1  动作文本原型提示词配置

    构建维度提示词模板侧重点
    描述性"Describe the action '{action name}': A person is performing {action name}."动作的执行过程与动态流程
    观察性"What does '{action name}' look like? Someone is doing {action name}."动作的视觉表征与外观特征
    陈述性"Action description for '{action name}': The person is {action name}."动作的定义与状态
    下载: 导出CSV

    表  2  三个公开数据集上的对比实验结果(%)

    动作识别模型NTU RGB+D 60NTU RGB+D 120PKU-MMDⅠ
    X-subX-viewX-subX-setX-subX-view
    ST-GCN[8](2018)81.588.370.773.2--
    DS-STGCN[10](2024)93.297.589.491.2--
    DSDC-GCN[22](2024)93.097.189.990.697.6-
    STA-GCN-Transformer[9](2025)86.091.9----
    BlockGCN[27](2024)93.197.090.391.5--
    SMS-GCN[28](2025)92.696.989.390.7--
    TSGCNeXt[29](2025)93.297.090.291.7--
    GAP+CTR-GCN[11](2023)92.997.089.991.1--
    MICA[12](2025)85.390.677.476.093.0-
    CEPCLR[13](2025)86.991.380.581.995.3-
    HS-Rep[14](2025)87.893.778.982.2--
    LLM-AR[15](2024)95.098.488.791.5--
    KEHCN[23](2025)93.597.390.491.8--
    MMNet[24](2025)96.098.892.994.497.498.6
    MMINet[25](2022)94.396.591.792.693.694.2
    TBCNet[26](2025)96.397.191.592.993.893.6
    LLMA-GCN96.597.591.892.598.397.4
    下载: 导出CSV

    表  3  LLMA-GCN框架消融实验结果(%)

    实验设置 M S T X-sub 实验设置 M S T X-sub
    B × × × 90.8 B+M+S × 97.0
    B+M × × 93.9 B+M+T × 97.3
    B+S × × 96.8 B+S+T × 97.8
    B+T × × 94.3 LLMA-GCN 98.3
    下载: 导出CSV

    表  4  混合邻接矩阵中融合权重实验分析结果(%)

    α初始值α固定取值α自动学习
    0.2093.5396.1
    0.3093.2398.3
    0.4092.7195.6
    下载: 导出CSV

    表  5  层次化时间特征提取模块细粒度消融实验结果(%)

    具体设置357911(3,5)(3,7)(5,7)(3,5,7)(5,7,9)(3,5,9)(3,7,11)
    准确率94.5695.1295.2394.2394.0594.2394.9395.1998.3095.0494.9393.57
    下载: 导出CSV

    表  6  模型参数量及计算量分析实验结果

    方法模型参数量(M)GFLOPs方法模型参数量(M)GFLOPs
    DS-STGCN[10]5.647.88TSGCNeXt[29]15.247.05
    DSDC-GCN[22]7.729.42GAP+CTR-GCN[11]8.4834.40
    LLMA-GCN14.6119.37
    下载: 导出CSV
  • [1] GODASE V V. Edge AI for smart surveillance: Real-time human activity recognition on low-power devices[J]. International Journal of AI and Machine Learning Innovations in Electronics and Communication Technology, 2025, 1(1): 29–46. doi: 10.2139/ssrn.5383804.
    [2] 孙中华, 吴双, 贾克斌, 等. 基于对比学习的动作识别研究综述[J]. 电子与信息学报, 2025, 47(8): 2473–2485. doi: 10.11999/JEIT250131.

    SUN Zhonghua, WU Shuang, JIA Kebin, et al. A review on action recognition based on contrastive learning[J]. Journal of Electronics & Information Technology, 2025, 47(8): 2473–2485. doi: 10.11999/JEIT250131.
    [3] SUN Zehua, KE Qiuhong, RAHMANI H, et al. Human action recognition from various data modalities: A review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(3): 3200–3225. doi: 10.1109/TPAMI.2022.3183112.
    [4] ZHANG Yumin and WANG Yanyong. A comprehensive survey on RGB-D-based human action recognition: Algorithms, datasets, and popular applications[J]. EURASIP Journal on Image and Video Processing, 2025, 2025(1): 15. doi: 10.1186/s13640–025-00677–0.
    [5] XIA Lu, CHEN C C, and AGGARWAL J K. View invariant human action recognition using histograms of 3D joints[C]. 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, USA, 2012: 20–27. doi: 10.1109/CVPRW.2012.6239233.
    [6] HUSSEIN M E, TORKI M, GOWAYYED M A, et al. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations[C]. Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 2013: 2466–2472.
    [7] KE Qiuhong, BENNAMOUN M, AN Senjian, et al. A new representation of skeleton sequences for 3D action recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4570–4579. doi: 10.1109/CVPR.2017.486.
    [8] YAN Sijie, XIONG Yuanjun, and LIN Dahua. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018. doi: 10.1609/aaai.v32i1.12328.
    [9] 韩宗旺, 杨涵, 吴世青, 等. 时空自适应图卷积与Transformer结合的动作识别网络[J]. 电子与信息学报, 2024, 46(6): 2587–2595. doi: 10.11999/JEIT230551.

    HAN Zongwang, YANG Han, WU Shiqing, et al. Action recognition network combining spatio-temporal adaptive graph convolution and transformer[J]. Journal of Electronics & Information Technology, 2024, 46(6): 2587–2595. doi: 10.11999/JEIT230551.
    [10] XIE Jianyang, MENG Yanda, ZHAO Yitian, et al. Dynamic semantic-based spatial-temporal graph convolution network for skeleton-based human action recognition[J]. IEEE Transactions on Image Processing, 2024, 33: 6691–6704. doi: 10.1109/TIP.2024.3497837.
    [11] XIANG Wangmeng, LI Chao, ZHOU Yuxuan, et al. Generative action description prompts for skeleton-based action recognition[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 10242–10251. doi: 10.1109/ICCV51070.2023.00943.
    [12] XIN Wentian, TENG Yue, ZHANG Jikang, et al. Modeling the internal and contextual attention for self-supervised skeleton-based action recognition[J]. Sensors, 2025, 25(21): 6532. doi: 10.3390/s25216532.
    [13] XIN Wentian, LIU Yi, FU Xianping, et al. LLMs encounter critical elements prompts: Semantically guided partial supervision skeleton-based action recognition[J]. IEEE Sensors Journal, 2025, 25(10): 17350–17363. doi: 10.1109/JSEN.2025.3556580.
    [14] WANG Hongsong, MA Xiaoyan, KUANG Jidong, et al. Heterogeneous skeleton-based action representation learning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2025: 19154–19164. doi: 10.1109/CVPR52734.2025.01784.
    [15] QU Haoxuan, CAI Yujun, and LIU Jun. LLMs are good action recognizers[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 18395–18406. doi: 10.1109/CVPR52733.2024.01741.
    [16] YAN Tingbing, ZENG Wenzheng, XIAO Yang, et al. CrossGLG: LLM guides one-shot skeleton-based 3D action recognition in a cross-level manner[C]. Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 2024: 113–131. doi: 10.1007/978-3-031-72661-3_7.
    [17] REIMERS N and GUREVYCH I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks[C]. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 2019: 3982–3992. doi: 10.18653/v1/D19-1410.
    [18] SHAHROUDY A, LIU Jun, NG T T, et al. NTU RGB+D: A large scale dataset for 3D human activity analysis[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 1010–1019. doi: 10.1109/CVPR.2016.115.
    [19] LIU Jun, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684–2701. doi: 10.1109/TPAMI.2019.2916873.
    [20] LIU Chunhui, HU Yueyu, LI Yanghao, et al. PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding[C]. Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities, Mountain View, USA, 2017: 1–8. doi: 10.1145/3132734.3132739.
    [21] YANG An, LI Anfeng, YANG Baosong, et al. Qwen3 technical report[J]. arXiv preprint arXiv: 2505.09388, 2025. (查阅网上资料, 不确定文献类型及格式是否正确, 请确认).
    [22] ZHUANG Tianming, QIN Zhen, DING Yi, et al. DSDC-GCN: Decoupled static-dynamic co-occurrence graph convolutional networks for skeleton-based action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(3): 2101–2117. doi: 10.1109/TCSVT.2024.3491133.
    [23] MA Nan, SUN Beining, HAN Yiheng, et al. Kinematic enhanced hypergraph convolutional network for skeleton-based human action recognition with LLM training guides[C]. Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 2025: 1920–1928. doi: 10.1145/3746027.3755538.
    [24] YU B X B, LIU Yan, ZHANG Xiang, et al. MMNet: A model-based multimodal network for human action recognition in RGB-D videos[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3522–3538. doi: 10.1109/TPAMI.2022.3177813.
    [25] CHENG Qin, LIU Zhen, REN Ziliang, et al. Spatial-temporal information aggregation and cross-modality interactive learning for RGB-D-based human action recognition[J]. IEEE Access, 2022, 10: 104190–104201. doi: 10.1109/ACCESS.2022.3201227.
    [26] YANG Yingyuan, LIANG Guoyuan, WANG Can, et al. Trunk-branch contrastive network with multi-view deformable aggregation for multi-view action recognition[J]. Pattern Recognition, 2026, 169: 111923. doi: 10.1016/j.patcog.2025.111923.
    [27] ZHOU Yuxuan, YAN Xudong, CHENG Zhiqi, et al. BlockGCN: Redefine topology awareness for skeleton-based action recognition[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 2049–2058. doi: 10.1109/CVPR52733.2024.00200.
    [28] 曹毅, 李杰, 叶培涛, 等. 利用可选择多尺度图卷积网络的骨架行为识别[J]. 电子与信息学报, 2025, 47(3): 839–849. doi: 10.11999/JEIT240702.

    CAO Yi, LI Jie, YE Peitao, et al. Skeleton-based action recognition with selective multi-scale graph convolutional network[J]. Journal of Electronics & Information Technology, 2025, 47(3): 839–849. doi: 10.11999/JEIT240702.
    [29] LIU Dongjingdian, LI Xiaomeng, CAI Zijie, et al. TSGCNeXt: Dynamic-static multi-graph convolution for efficient skeleton-based action recognition[J]. Expert Systems with Applications, 2025, 276: 127081. doi: 10.1016/j.eswa.2025.127081.
  • 加载中
图(4) / 表(6)
计量
  • 文章访问数:  9
  • HTML全文浏览量:  4
  • PDF下载量:  2
  • 被引次数: 0
出版历程
  • 收稿日期:  2026-02-05
  • 修回日期:  2026-05-29
  • 录用日期:  2026-05-29
  • 网络出版日期:  2026-06-10

目录

    /

    返回文章
    返回