高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

Mamba-YOWO:高效时空表征的动作检测框架

马莉 辛江博 王璐 代新冠 宋爽

马莉, 辛江博, 王璐, 代新冠, 宋爽. Mamba-YOWO:高效时空表征的动作检测框架[J]. 电子与信息学报. doi: 10.11999/JEIT251124
引用本文: 马莉, 辛江博, 王璐, 代新冠, 宋爽. Mamba-YOWO:高效时空表征的动作检测框架[J]. 电子与信息学报. doi: 10.11999/JEIT251124
MA Li, XIN Jiangbo, WANG Lu, DAI Xinguan, SONG Shuang. Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251124
Citation: MA Li, XIN Jiangbo, WANG Lu, DAI Xinguan, SONG Shuang. Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251124

Mamba-YOWO:高效时空表征的动作检测框架

doi: 10.11999/JEIT251124 cstr: 32379.14.JEIT251124
基金项目: 国家自然科学基金面上项目(52574269)
详细信息
    作者简介:

    马莉:女,副教授,研究方向为计算机视觉、时空行为分析、深度学习、智能安全监控

    辛江博:男,硕士研究生,研究方向为计算机视觉、时空行为分析、深度学习

    通讯作者:

    辛江博 vincent@stu.xust.edu.cn

  • 中图分类号: TN911.73; TP391.41

Mamba-YOWO: An Efficient Spatio-Temporal Representation Framework for Action Detection

Funds: National Natural Science Foundation of China (52574269)
  • 摘要: 针对时空动作检测中现有方法难以统一框架中高效协同建模外观语义与动态运动特征,以及主流框架往往因高计算复杂度和局部感受野限制,难以兼顾长程时序依赖建模与实时推理效率的问题。本文提出一种基于选择性状态空间模型的Mamba-YOWO轻量化时空动作检测框架。首先,引入Mamba模块重构YOWOv3的时序建模骨干,在保持线性计算复杂度的同时建模长程时序依赖。其次,设计高效多尺度时空融合模块,实现多尺度空间特征与动态时间上下文的有效融合,增强判别性表征。最后,在UCF101-24和JHMDB数据集上进行实验。结果表明,本方法较YOWOv3参数量减少7.3%,计算量(FLOPs)降低5.4%,帧级mAP分别达到90.24%和83.2%,显著优于现有实时检测方法。验证了所提方法在实时时空动作检测任务中的精度-效率平衡上的优势。
  • 图  1  Mamba-YOWO框架结构图

    图  2  时序建模网络

    图  3  时空交织扫描策略

    图  4  空间序列化扫描策略对比

    图  5  高效多尺度时空融合模块

    图  6  Mamba-YOWO与YOWOv3/L检测结果可视化

    图  7  模型端侧部署性能对比

    表  1  与现有方法在UCF101-24和JHMDB数据集上的性能对比(%)

    发表时间
    (年)
    方法 UCF101-24 JHMDB
    F-mAP V-mAP F-mAP V-mAP
    @0.2 @0.5 @0.75 0.5:0.95 @0.2 @0.5 @0.75 0.5:0.95
    2020 MOC[31] 78.00 82.80 53.80 29.60 28.30 70.80 77.30 77.20 71.70 59.10
    2020 Li et al[32] - 71.10 54.00 21.80 - - 76.10 74.30 56.40 -
    2021 SAMOC[33] 79.30 80.50 53.50 30.30 28.70 73.10 79.20 78.30 70.50 58.70
    2022 TubeR[34] 81.30 85.30 60.20 - 29.70 - 81.80 80.70 -
    2023 HIT[35] 84.80 88.80 74.30 - - 83.80 89.70 88.10 - -
    2023 EVAD[36] 85.10 - - - - 83.80 - - - -
    2023 STMixer[37] 86.70 - - - - 83.70 - - - -
    2024 OSTAD[38] 87.20 - 58.30 - - 78.60 - 86.50 - -
    2024 YWOM[39] 84.60 - 49.60 - - 73.30 - 83.70 - -
    - Ours 90.24 87.90 87.90 31.20 30.20 83.20 87.90 86.70 78.52 59.25
    下载: 导出CSV

    表  2  与YOWO系列模型在UCF101-24数据集上的性能对比(%)

    发表时间
    (年)
    模型 主干网络 参数量(M) 计算量
    (GFLOPs)
    UCF101-24
    F-mAP V-mAP
    @0.2 @0.5 @0.75 0.5:0.95
    2019 YOWO[23] DarkNet+ShuffleNetV2 78.98 7.10 71.40 - - - -
    2019 YOWO[23] DarkNet+ResNext-101 121.40 43.70 80.40 75.80 48.80 - -
    2024 YOWOv2/L[24] FreeYolo-L+ResNext-101 109.70 92.00 85.20 80.42 52.00 23.50 24.76
    2025 YOWOv3/L[25] Yolov8-M+I3D 59.80 39.80 88.33 85.20 54.32 28.70 27.24
    - Ours Yolov8-M+DBFM 55.43 37.65 90.24 87.90 60.32 31.20 30.20
    下载: 导出CSV

    表  3  网络核心模块消融实验(%)

    模型配置3D分支架构扫描策略特征融合模块UCF101-24JHMDB
    F-mAPV-mAP@0.5F-mAPV-mAP@0.5
    基线I3D-CFAM88.3354.3278.6282.30
    变体ADBFM网络(N=4)1D-ScanCFAM88.7058.3781.3083.24
    变体BDBFM网络(N=4)1D-ScanEMSTF89.1560.5481.3785.21
    变体CDBFM网络(N=4)STISCFAM89.9375.3682.4786.26
    完整模型DBFM网络(N=4)STISEMSTF90.2487.9083.2086.70
    下载: 导出CSV
  • [1] 王彩玲, 闫晶晶, 张智栋. 基于多模态数据的人体行为识别方法研究综述[J]. 计算机工程与应用, 2024, 60(9): 1–18. doi: 10.3778/j.issn.1002-8331.2310-0090.

    WANG Cailing, YAN Jingjing, and ZHANG Zhidong. Review on human action recognition methods based on multimodal data[J]. Computer Engineering and Applications, 2024, 60(9): 1–18. doi: 10.3778/j.issn.1002-8331.2310-0090.
    [2] ZHEN Rui, SONG Wenchao, HE Qiang, et al. Human-computer interaction system: A survey of talking-head generation[J]. Electronics, 2023, 12(1): 218. doi: 10.3390/electronics12010218.
    [3] SIMONYAN K and ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, Montreal, Canada, 2014: 568–576.
    [4] WANG Limin, XIONG Yuanjun, WANG Zhe, et al. Temporal segment networks: Towards good practices for deep action recognition[C]. Proceedings of 14th European Conference on Computer Vision – ECCV 2016, Amsterdam, The Netherlands, 2016: 20–36. doi: 10.1007/978-3-319-46484-8_2.
    [5] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4489–4497. doi: 10.1109/ICCV.2015.510.
    [6] CARREIRA J and ZISSERMAN A. Quo Vadis, action recognition? A new model and the kinetics dataset[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4724–4733. doi: 10.1109/CVPR.2017.502.
    [7] 钱惠敏, 陈实, 皇甫晓瑛. 基于双流-非局部时空残差卷积神经网络的人体行为识别[J]. 电子与信息学报, 2024, 46(3): 1100–1108. doi: 10.11999/JEIT230168.

    QIAN Huimin, CHEN Shi, and HUANGFU Xiaoying. Human activities recognition based on two-stream NonLocal spatial temporal residual convolution neural network[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1100–1108. doi: 10.11999/JEIT230168.
    [8] WANG Xiaolong, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7794–7803. doi: 10.1109/CVPR.2018.00813.
    [9] DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 2625–2634. doi: 10.1109/CVPR.2015.7298878.
    [10] 曹毅, 李杰, 叶培涛, 等. 利用可选择多尺度图卷积网络的骨架行为识别[J]. 电子与信息学报, 2025, 47(3): 839–849. doi: 10.11999/JEIT240702.

    CAO Yi, LI Jie, YE Peitao, et al. Skeleton-based action recognition with selective multi-scale graph convolutional network[J]. Journal of Electronics & Information Technology, 2025, 47(3): 839–849. doi: 10.11999/JEIT240702.
    [11] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. Proceedings of 9th International Conference on Learning Representations, Austria, 2021. (查阅网上资料, 未找到出版地信息, 请确认).
    [12] BERTASIUS G, WANG Heng, and TORRESANI L. Is space-time attention all you need for video understanding?[C]. Proceedings of the 38th International Conference on Machine Learning, 2021: 813–824. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [13] ZHANG Chenlin, WU Jianxin, and LI Yin. ActionFormer: Localizing moments of actions with transformers[C]. Proceedings of 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, 2022: 492–510. doi: 10.1007/978-3-031-19772-7_29.
    [14] 韩宗旺, 杨涵, 吴世青, 等. 时空自适应图卷积与Transformer结合的动作识别网络[J]. 电子与信息学报, 2024, 46(6): 2587–2595. doi: 10.11999/JEIT230551.

    HAN Zongwang, YANG Han, WU Shiqing, et al. Action recognition network combining spatio-temporal adaptive graph convolution and Transformer[J]. Journal of Electronics & Information Technology, 2024, 46(6): 2587–2595. doi: 10.11999/JEIT230551.
    [15] GU A and DAO T. Mamba: Linear-time sequence modeling with selective state spaces[C]. Conference on Language Modeling, Philadelphia, 2024. (查阅网上资料, 未找到本条文献信息, 请确认).
    [16] ZHU Lianghui, LIAO Bencheng, ZHANG Qian, et al. Vision mamba: Efficient visual representation learning with bidirectional state space model[C]. Procedding of Forty-First International Conference on Machine Learning, Vienna, Austria, 2024.
    [17] LIU Yue, TIAN Yunjie, ZHAO Yuzhong, et al. VMamba: Visual state space model[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3273.
    [18] LI Kunchang, LI Xinhao, WANG Yi, et al. VideoMamba: State space model for efficient video understanding[C]. Proceedings of the 18th European Conference on Computer Vision – ECCV 2024, Milan, Italy, 2025: 237–255. doi: 10.1007/978-3-031-73347-5_14.
    [19] LEE S H, SON T, SEO S W, et al. JARViS: Detecting actions in video using unified actor-scene context relation modeling[J]. Neurocomputing, 2024, 610: 128616. doi: 10.1016/j.neucom.2024.128616.
    [20] REKA A, BORZA D L, REILLY D, et al. Introducing gating and context into temporal action detection[C]. Proceedings of Computer Vision – ECCV 2024 Workshops, Milan, Italy, 2025: 322–334. doi: 10.1007/978-3-031-91581-9_23.
    [21] KWON D, KIM I, and KWAK S. Boosting semi-supervised video action detection with temporal context[C]. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, USA, 2025: 847–858. doi: 10.1109/WACV61041.2025.00092.
    [22] ZHOU Xuyang, WANG Ye, TAO Fei, et al. Hierarchical chat-based strategies with MLLMs for Spatio-temporal action detection[J]. Information Processing & Management, 2025, 62(4): 104094. doi: 10.1016/j.ipm.2025.104094.
    [23] Köpüklü O, WEI Xiangyu, and RIGOLL G. You only watch once: A unified CNN architecture for real-time spatiotemporal action localization[J]. arXiv preprint arXiv: 1911.06644, 2019. (查阅网上资料, 不确定本文献类型是否正确, 请确认).
    [24] JIANG Zhiqiang, YANG Jianhua, JIANG Nan, et al. YOWOv2: A stronger yet efficient multi-level detection framework for real-time spatio-temporal action detection[C]. Proceedings of 17th International Conference on Intelligent Robotics and Applications, Xi’an, China, 2025: 33–48. doi: 10.1007/978-981-96-0774-7_3.
    [25] NGUYEN D D M, NHAN B D, WANG J C, et al. YOWOv3: An efficient and generalized framework for spatiotemporal action detection[J]. IEEE Intelligent Systems, 2026, 41(1): 75–85. doi: 10.1109/MIS.2025.3581100.
    [26] VARGHESE R and M S. YOLOv8: A novel object detection algorithm with enhanced performance and robustness[C]. 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 2024: 1–6. doi: 10.1109/ADICS58448.2024.10533619.
    [27] LIU Ze, NING Jia, CAO Yue, et al. Video swin transformer[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 3192–3201. doi: 10.1109/CVPR52688.2022.00320.
    [28] XIAO Haoke, TANG Lv, JIANG Pengtao, et al. Boosting vision state space model with fractal scanning[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 8646–8654. doi: 10.1609/aaai.v39i8.32934.
    [29] SOOMRO K, ZAMIR A R, and SHAH M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv: 1212.0402, 2012. (查阅网上资料, 不确定本文献类型是否正确, 请确认).
    [30] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A large video database for human motion recognition[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2556–2563. doi: 10.1109/ICCV.2011.6126543.
    [31] LI Yixuan, WANG Zixu, WANG Limin, et al. Actions as moving points[C]. Proceedings of 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, 2020: 68–84. doi: 10.1007/978-3-030-58517-4_5.
    [32] LI Yuxi, LIN Weiyao, WANG Tao, et al. Finding action tubes with a sparse-to-dense framework[C]. Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York Hilton Midtown, USA, 2020: 11466–11473. doi: 10.1609/aaai.v34i07.6811.
    [33] MA Xurui, LUO Zhigang, ZHANG Xiang, et al. Spatio-temporal action detector with self-attention[C]. 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021: 1–8. doi: 10.1109/IJCNN52387.2021.9533300.
    [34] ZHAO Jiaojiao, ZHANG Yanyi, LI Xinyu, et al. TubeR: Tubelet transformer for video action detection[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 13588–13597. doi: 10.1109/CVPR52688.2022.01323.
    [35] FAURE G J, CHEN M H, and LAI Shanghong. Holistic interaction transformer network for action detection[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, 2023: 3329–3339. doi: 10.1109/WACV56688.2023.00334.
    [36] CHEN Lei, TONG Zhan, SONG Yibing, et al. Efficient video action detection with token dropout and context refinement[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 10354–10365. doi: 10.1109/ICCV51070.2023.00953.
    [37] WU Tao, CAO Mengqi, GAO Ziteng, et al. STMixer: A one-stage sparse action detector[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 14720–14729. doi: 10.1109/CVPR52729.2023.01414.
    [38] SU Shaowen and GAN Minggang. Online spatio-temporal action detection with adaptive sampling and hierarchical modulation[J]. Multimedia Systems, 2024, 30(6): 349. doi: 10.1007/s00530-024-01543-1.
    [39] QIN Yefeng, CHEN Lei, BEN Xianye, et al. You watch once more: A more effective CNN architecture for video spatio-temporal action localization[J]. Multimedia Systems, 2024, 30(1): 53. doi: 10.1007/s00530-023-01254-z.
  • 加载中
图(7) / 表(3)
计量
  • 文章访问数:  16
  • HTML全文浏览量:  6
  • PDF下载量:  0
  • 被引次数: 0
出版历程
  • 修回日期:  2026-02-05
  • 录用日期:  2026-02-05
  • 网络出版日期:  2026-02-13

目录

    /

    返回文章
    返回