高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

时频域多尺度信息交互策略的单声道语音分离方法研究

兰朝凤 杨国涛 陈英淇 郭小霞

兰朝凤, 杨国涛, 陈英淇, 郭小霞. 时频域多尺度信息交互策略的单声道语音分离方法研究[J]. 电子与信息学报. doi: 10.11999/JEIT251340
引用本文: 兰朝凤, 杨国涛, 陈英淇, 郭小霞. 时频域多尺度信息交互策略的单声道语音分离方法研究[J]. 电子与信息学报. doi: 10.11999/JEIT251340
LAN Chaofeng, YANG Guotao, CHEN Yingqi, GUO Xiaoxia. Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-Scale Information Interaction Strategy[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251340
Citation: LAN Chaofeng, YANG Guotao, CHEN Yingqi, GUO Xiaoxia. Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-Scale Information Interaction Strategy[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251340

时频域多尺度信息交互策略的单声道语音分离方法研究

doi: 10.11999/JEIT251340 cstr: 32379.14.JEIT251340
基金项目: 国家自然科学基金(11804068),黑龙江省优秀青年教师基础研究支持计划(YQJH2024064)
详细信息
    作者简介:

    兰朝凤:女,教授,博士生导师,研究方向为声信号分析与处理、智能任务规划与决策等

    杨国涛:男,硕士研究生,研究方向为语音分离

    陈英淇:男,博士研究生,研究方向为声信号分析与处理

    郭小霞:女,博士,研究方向为声信号分析与处理

    通讯作者:

    兰朝凤 lanchaofeng@hrbust.edu.cn

  • 中图分类号: TP391.42

Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-Scale Information Interaction Strategy

Funds: The National Natural Science Foundation of China (11804068), Heilongjiang Provincial Outstanding Young Teachers Basic Research Support Programme (YQJH2024064)
  • 摘要: 针对现有基于注意力机制的单声道语音分离模型在多尺度特征交互与时频信息融合方面的不足,本文提出一种融合时频域信息的多尺度注意力模型(Multi-Scale Attention Model Integrating Time-Frequency Domain Information, MSA-TF)。该模型通过构建时频融合模块与多尺度交互分离器,引入频带分裂、动态门控与交叉注意力机制,实现时域与频域特征的高效互补;并借助跨尺度残差连接与自适应池化策略,促进全局语义与局部细节之间的双向流动,从而增强对语音信号中长短时依赖的联合建模能力。利用信号失真比(Signal Distortion Ratio, SDR)、尺度不变信噪比(Scale-Invariant Signal-to-Noise Ratio, SI-SNR)评价指标,在WSJ0-2mix和Libri-2mix数据集进行实验测试。研究表明,MSA-TF在WSJ0-2mix数据集上,测试结果相较于Conv-Tasnet基线模型在尺度不变信噪比(SI-SNR)上平均提升2.3 dB,在未训练的Libri-2mix测试集上,泛化性能与在该集上训练的基线性能相当。由此可见,本文方法能通过提取不同尺度下的时域信息与频域信息进行互补与融合,促进全局语义与局部细节建模,获得更好更具泛化性的分离结果。
  • 图  1  时频融合模块

    图  2  多尺度交互分离器

    图  3  MAS-TF模型架构

    表  1  消融实验的结果

    模型SI-SNR (dB)SDR (dB)Param(M)
    TDAnet14.514.82.3
    TDAnet+TF14.815.23.1
    TDAnet+MS17.017.514.8
    TDAnet+TF+MS17.617.815.6
    下载: 导出CSV

    表  2  不同GF-LF迭代次数下的性能与效率对比

    模型配置迭代次数 (N)SI-SNR (dB)SDR (dB)GMACs
    TDAnet+TF+MS115.115.517.47
    TDAnet+TF+MS316.817.252.21
    TDAnet+TF+MS617.617.8104.32
    下载: 导出CSV

    表  3  WSJ0-2mix数据集(测试集)上的结果

    模型SI-SNR (dB)SDR (dB)Param (M)
    Two-stagePARIS[21]14.615.0n.a.
    Conv-Tasnet[1]15.315.65.1
    SeliNet[22]15.6115.82.1
    Two-Step CTN[23]16.1n.a.8.6
    Tiny-SepformerS-32[24]15.216.05.3
    MGST[6]17.017.3n.a.
    SuperFormer[25]17.617.812.8
    MSA-TF17.617.815.6
    下载: 导出CSV

    表  4  Libri-2mix数据集(测试集)上的结果

    模型SI-SNR (dB)SDR (dB)Param(M)
    BLSTM-TasNet[9]13.513.923.6
    Conv-TasNet[9]14.414.78.8
    ESEDNet[26]13.2414.082.31
    MSA-TF14.214.715.6
    下载: 导出CSV

    表  5  各模型的计算效率与资源消耗对比

    模型 Params (M) GMACs 显存占用(MB)
    Conv-TasNet 5.1 20.47 1290.29
    SepFormer 26.0 145.87 8159.30
    MSA-TF 15.62 104.32 3179.57
    下载: 导出CSV

    表  6  MSA-TF 在不同输入语音长度下的鲁棒性分析

    音频时长 样本数 SI-SNR (dB) 平均推理时间 (s)
    3 s~4 s 12 18.44 0.1048
    4 s~5 s 87 17.35 0.1142
    5 s~6 s 285 17.31 0.1217
    6 s~7 s 497 17.49 0.1361
    7 s~8 s 357 17.44 0.1503
    8 s~9 s 427 17.65 0.1693
    9 s~10 s 439 17.54 0.1871
    10 s~15 s 857 17.33 0.2259
    15 s~20 s 39 18.31 0.3441
    下载: 导出CSV
  • [1] LI Kai, CHEN Guo, SANG Wendi, et al. Advances in speech separation: Techniques, challenges, and future trends[J]. arXiv preprint arXiv: 2508.10830, 2025. doi: 10.48550/arXiv.2508.10830. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
    [2] LUO Yi and MESGARANI N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266. doi: 10.1109/TASLP.2019.2915167.
    [3] ZHANG Liwen, SHI Ziqiang, HAN Jiqing, et al. FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks[C]. Proceedings of the 26th International Conference on Multimedia Modeling, Daejeon, South Korea, 2020: 653–665. doi: 10.1007/978-3-030-37731-1_53.
    [4] SHI Huiyu, CHEN Xi, KONG Tianlong, et al. GLMSnet: Single channel speech separation framework in noisy and reverberant environments[C]. Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 2021: 663–670. doi: 10.1109/ASRU51503.2021.9688217.
    [5] LUO Yi, CHEN Zhuo, and YOSHIOKA T. Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 46–50. doi: 10.1109/ICASSP40776.2020.9054266.
    [6] SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 21–25. doi: 10.1109/ICASSP39728.2021.9413901.
    [7] ZHAO Yucheng, LUO Chong, ZHA Zhengjun, et al. Multi-scale group transformer for long sequence modeling in speech separation[C]. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 2021: 450.
    [8] RIXEN J and RENZ M. SFSRNet: Super-resolution for single-channel audio source separation[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 11220–11228. doi: 10.1609/aaai.v36i10.21372. (查阅网上资料,未找到本条文献出版地信息,请确认).
    [9] TONG Weinan, ZHU Jiaxu, CHEN Jun, et al. TFCnet: Time-frequency domain corrector for speech separation[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096785.
    [10] ROUARD S, MASSA F, and DÉFOSSEZ A. Hybrid transformers for music source separation[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096956.
    [11] TZINIS E, WANG Zhepei, and SMARAGDIS P. Sudo RM -RF: Efficient networks for universal audio source separation[C]. Proceedings of 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 2020: 1–6. doi: 10.1109/MLSP49062.2020.9231900.
    [12] LI Kai, YANG Runxuan, and HU Xiaolin. An efficient encoder-decoder architecture with top-down attention for speech separation[J]. arXiv preprint arXiv: 2209.15200, 2022. doi: 10.48550/arXiv.2209.15200. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
    [13] GOEL K, GU A, DONAHUE C, et al. It’s raw! Audio generation with state-space models[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 7616–7633.
    [14] CHEN Chen, YANG C H H, LI Kai, et al. A neural state-space modeling approach to efficient speech separation[C]. Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 3784–3788.
    [15] XU Mohan, LI Kai, CHEN Guo, et al. TIGER: Time-frequency interleaved gain extraction and reconstruction for efficient speech separation[C]. Proceedings of the 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
    [16] OH H, YI J, and LEE Y. Papez: Resource-efficient speech separation with auditory working memory[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095136.
    [17] HUA Weizhe, DAI Zihang, LIU Hanxiao, et al. Transformer quality in linear time[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 9099–9117.
    [18] ZHAO Shengkui and MA Bin. MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096646.
    [19] ZHAO Shengkui, MA Yukun, NI Chongjia, et al. MossFormer2: Combining transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation[C]. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024: 10356–10360. doi: 10.1109/ICASSP48485.2024.10445985.
    [20] HU Xiaolin, LI Kai, ZHANG Weiyi, et al. Speech separation using an asynchronous fully recurrent convolutional neural network[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 1724. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
    [21] PAN Zexu, WICHERN G, GERMAIN F G, et al. PARIS: Pseudo-AutoRegressIve Siamese training for online speech separation[C]. Proceedings of the 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024.
    [22] TAN H M, VU D Q, and WANG J C. Selinet: A lightweight model for single channel speech separation[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10097121.
    [23] TZINIS E, VENKATARAMANI S, WANG Zhepei, et al. Two-step sound source separation: Training on learned latent targets[C]. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 31–35. doi: 10.1109/ICASSP40776.2020.9054172.
    [24] LUO Jian, WANG Jianzong, CHENG Ning, et al. Tiny-Sepformer: A tiny time-domain transformer network for speech separation[J]. arXiv preprint arXiv: 2206.13689, 2022. doi: 10.48550/arXiv.2206.13689. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
    [25] JIANG Yanji, QIU Youli, SHEN Xueli, et al. SuperFormer: Enhanced multi-speaker speech separation network combining channel and spatial adaptability[J]. Applied Sciences, 2022, 12(15): 7650. doi: 10.3390/app12157650.
    [26] LIU Debang, ZHANG Tianqi, CHRISTENSEN M G, et al. Efficient time-domain speech separation using short encoded sequence network[J]. Speech Communication, 2025, 166: 103150. doi: 10.1016/j.specom.2024.103150.
    [27] 侯进, 盛尧宝, 张波. 基于二阶统计特性的方向向量估计算法的DOA估计[J]. 电子与信息学报, 2024, 46(2): 697–704. doi: 10.11999/JEIT230172.

    HOU Jin, SHENG Yaobao, and ZHANG Bo. DOA estimation of direction vector estimation algorithm based on second-order statistical properties[J]. Journal of Electronics & Information Technology, 2024, 46(2): 697–704. doi: 10.11999/JEIT230172.
    [28] 田浩原, 陈宇轩, 陈北京, 等. 抵抗语音转换伪造的扩散重构式主动防御方法[J]. 电子与信息学报, 2026, 48(2): 818–828. doi: 10.11999/JEIT250709.

    TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, et al. Defeating voice conversion forgery by active defense with diffusion reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(2): 818–828. doi: 10.11999/JEIT250709.
    [29] 刘佳, 张洋瑞, 陈大鹏, 等. 结合双流注意力与对抗互重建的双模态情绪识别方法[J]. 电子与信息学报, 2026, 48(1): 277–286. doi: 10.11999/JEIT250424.

    LIU Jia, ZHANG Yangrui, CHEN Dapeng, et al. Bimodal emotion recognition method based on dual-stream attention and adversarial mutual reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(1): 277–286. doi: 10.11999/JEIT250424.
  • 加载中
图(3) / 表(6)
计量
  • 文章访问数:  10
  • HTML全文浏览量:  4
  • PDF下载量:  0
  • 被引次数: 0
出版历程
  • 修回日期:  2026-02-25
  • 录用日期:  2026-03-03
  • 网络出版日期:  2026-03-14

目录

    /

    返回文章
    返回