Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-Scale Information Interaction Strategy
-
摘要: 针对现有基于注意力机制的单声道语音分离模型在多尺度特征交互与时频信息融合方面的不足,本文提出一种融合时频域信息的多尺度注意力模型(Multi-Scale Attention Model Integrating Time-Frequency Domain Information, MSA-TF)。该模型通过构建时频融合模块与多尺度交互分离器,引入频带分裂、动态门控与交叉注意力机制,实现时域与频域特征的高效互补;并借助跨尺度残差连接与自适应池化策略,促进全局语义与局部细节之间的双向流动,从而增强对语音信号中长短时依赖的联合建模能力。利用信号失真比(Signal Distortion Ratio, SDR)、尺度不变信噪比(Scale-Invariant Signal-to-Noise Ratio, SI-SNR)评价指标,在WSJ0-2mix和Libri-2mix数据集进行实验测试。研究表明,MSA-TF在WSJ0-2mix数据集上,测试结果相较于Conv-Tasnet基线模型在尺度不变信噪比(SI-SNR)上平均提升2.3 dB,在未训练的Libri-2mix测试集上,泛化性能与在该集上训练的基线性能相当。由此可见,本文方法能通过提取不同尺度下的时域信息与频域信息进行互补与融合,促进全局语义与局部细节建模,获得更好更具泛化性的分离结果。Abstract:
Objective Monaural speech separation, which aims to extract individual speaker signals from a single-channel mixture, is a core technology for solving the "cocktail party problem" and holds significant application value in low-resource, low-latency scenarios such as mobile voice assistants, teleconferencing, and hearing aids. However, the inherent lack of spatial cues in single-channel signals and the substantial overlap of multiple speakers' voices in both time-domain waveforms and frequency-domain spectra make accurate separation while preserving the integrity and clarity of target speech extremely challenging. Current deep learning-based models often face limitations in three interrelated aspects: effective multi-scale dependency coordination, efficient time-frequency (T-F) information fusion, and computational complexity control. To address these challenges holistically, this paper proposes a novel Multi-Scale Attention model integrating Time-Frequency domain information (MSA-TF), designed to enhance separation performance, computational efficiency, and generalization capability simultaneously. Methods The MSA-TF model incorporates three key innovative components. First, a lightweight Time-Frequency (TF) Fusion Module is designed. This module begins by splitting the frequency band into four sub-bands based on speech priors (e.g., low-frequency energy concentration, high-frequency detail sensitivity) to efficiently extract spectral features. A dynamic gating mechanism, employing decomposed convolutions and SiLU activation, is then used to adaptively enhance speaker-discriminative features and suppress redundant channels associated with noise. Finally, a cross-attention mechanism facilitates deep interaction between time-domain and frequency-domain features during the encoding stage: the time-domain global semantic information guides the selection and weighting of useful frequency-domain features, enabling mutual correction and complementarity. This entire module adds only 0.8M parameters. Second, a Multi-scale Interaction Separator is proposed to overcome the limitations of sequential or loosely coupled multi-scale processing found in models like SepFormer. It extracts multi-granularity features (from frame-level F1 to syllable-level semantic F4) using cascaded dilated convolutions. Its core is the "GF-LF Iterative Feedback" mechanism: the Global Flash (GF) module, utilizing efficient FLASH attention, captures long-range dependencies and syllable-level context. This global information is up-sampled and injected into local features (Fk) via residual connections. Subsequently, Local Flash (LF) modules, also based on FLASH attention, process these enhanced local features (F'k) to model fine-grained structures and suppress frame-level noise. The updated local features are then fed back (after adaptive pooling) to refine the global representation in the next iteration. This closed-loop bidirectional flow ensures deep synergy between global semantics and local details. A gated fusion mechanism at the end dynamically balances the contributions from different scales. Third, to control computational complexity, an efficient hierarchical grouped attention mechanism is adopted, reducing the complexity from quadratic to nearly linear with respect to sequence length. The overall MSA-TF architecture is end-to-end, comprising a 1D convolutional encoder, the integrated TF and Multi-scale modules, a mask network, and a symmetric decoder. Results and Discussions Extensive experiments were conducted on the standard WSJ0-2mix and Libri-2mix datasets, using Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Signal-to-Distortion Ratio (SDR) as evaluation metrics. Ablation studies ( Table 1 ) conclusively confirmed the individual and synergistic contributions of the proposed modules. Adding only the TF module to the TDAnet baseline improved SI-SNR by 0.3 dB and SDR by 0.4 dB with minimal parameters, validating its role in supplementing signal structure modeling, particularly for high-frequency details. Incorporating the Multi-scale (MS) interaction module alone resulted in a substantial gain of 2.5 dB SI-SNR and 2.7 dB SDR, highlighting its critical role in long-term dependency modeling. The combination of TF and MS modules in the full MSA-TF core achieved a synergistic effect (17.6 dB SI-SNR), surpassing the sum of individual gains, demonstrating that the "dual-dimensional features" from TF fusion and the "deep dependency modeling" from multi-scale interaction mutually empower each other. Visual analysis of spectrograms (Fig. 4 ) further verified that the TF module effectively suppressed residual high-frequency noise and resulted in clearer spectral contours for the target speech. On the WSJ0-2mix test set (Table 2 ), MSA-TF achieved a state-of-the-art performance of 17.6 dB SI-SNR and 17.8 dB SDR, matching the performance of SuperFormer and significantly outperforming strong baselines like Conv-Tasnet (by 2.3 dB SI-SNR) while maintaining a reasonable parameter count (15.6M). Compared to models with larger parameter sizes (e.g., SignPredictionNet at 55.2M), MSA-TF demonstrated more efficient modeling. For generalization testing on the completely unseen Libri-2mix dataset (Table 3 ), MSA-TF, trained solely on WSJ0-2mix, achieved 14.2 dB SI-SNR and 14.7 dB SDR, performing comparably to Conv-Tasnet models specifically trained on Libri-2mix (14.4 dB SI-SNR) and outperforming BLSTM-Tasnet trained on Libri-2mix. This robust cross-dataset adaptability underscores the model's success in capturing universal time-frequency characteristics and multi-scale dependency structures inherent in speech signals, rather than overfitting to a specific dataset distribution.Conclusions This study presents MSA-TF, a novel model that effectively addresses key challenges in monaural speech separation by deeply integrating multi-scale time-frequency information interaction. The proposed lightweight Time-Frequency Fusion Module efficiently complements time-domain features with discriminative frequency-domain information. The innovative Multi-scale Interaction Separator with its iterative feedback mechanism enables dynamic, bidirectional flow of information across scales, significantly enhancing the joint modeling of both short-term details and long-term dependencies. Coupled with efficient attention design, the model achieves superior performance without prohibitive computational cost. Experimental results demonstrate that MSA-TF achieves leading separation performance on standard benchmarks and exhibits strong generalization ability to unseen data distributions, validating its comprehensive approach. The model provides an efficient, robust, and generalizable solution potentially suitable for practical low-resource application scenarios. Future work will explore advanced cross-modal fusion techniques and dynamic scale adjustment strategies to further enhance robustness and performance in more complex and variable acoustic environments. -
表 1 消融实验的结果
模型 SI-SNR (dB) SDR (dB) Param(M) TDAnet 14.5 14.8 2.3 TDAnet+TF 14.8 15.2 3.1 TDAnet+MS 17.0 17.5 14.8 TDAnet+TF+MS 17.6 17.8 15.6 表 2 不同GF-LF迭代次数下的性能与效率对比
模型配置 迭代次数 (N) SI-SNR (dB) SDR (dB) GMACs TDAnet+TF+MS 1 15.1 15.5 17.47 TDAnet+TF+MS 3 16.8 17.2 52.21 TDAnet+TF+MS 6 17.6 17.8 104.32 表 3 WSJ0-2mix数据集(测试集)上的结果
表 4 Libri-2mix数据集(测试集)上的结果
表 5 各模型的计算效率与资源消耗对比
模型 Params (M) GMACs 显存占用(MB) Conv-TasNet 5.1 20.47 1290.29 SepFormer 26.0 145.87 8159.30 MSA-TF 15.62 104.32 3179.57 表 6 MSA-TF 在不同输入语音长度下的鲁棒性分析
音频时长 样本数 SI-SNR (dB) 平均推理时间 (s) 3 s~4 s 12 18.44 0.1048 4 s~5 s 87 17.35 0.1142 5 s~6 s 285 17.31 0.1217 6 s~7 s 497 17.49 0.1361 7 s~8 s 357 17.44 0.1503 8 s~9 s 427 17.65 0.1693 9 s~10 s 439 17.54 0.1871 10 s~15 s 857 17.33 0.2259 15 s~20 s 39 18.31 0.3441 -
[1] LI Kai, CHEN Guo, SANG Wendi, et al. Advances in speech separation: Techniques, challenges, and future trends[J]. arXiv preprint arXiv: 2508.10830, 2025. doi: 10.48550/arXiv.2508.10830. (查阅网上资料,不确定文献类型及格式是否正确,请确认). [2] LUO Yi and MESGARANI N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266. doi: 10.1109/TASLP.2019.2915167. [3] ZHANG Liwen, SHI Ziqiang, HAN Jiqing, et al. FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks[C]. Proceedings of the 26th International Conference on Multimedia Modeling, Daejeon, South Korea, 2020: 653–665. doi: 10.1007/978-3-030-37731-1_53. [4] SHI Huiyu, CHEN Xi, KONG Tianlong, et al. GLMSnet: Single channel speech separation framework in noisy and reverberant environments[C]. Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 2021: 663–670. doi: 10.1109/ASRU51503.2021.9688217. [5] LUO Yi, CHEN Zhuo, and YOSHIOKA T. Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 46–50. doi: 10.1109/ICASSP40776.2020.9054266. [6] SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 21–25. doi: 10.1109/ICASSP39728.2021.9413901. [7] ZHAO Yucheng, LUO Chong, ZHA Zhengjun, et al. Multi-scale group transformer for long sequence modeling in speech separation[C]. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 2021: 450. [8] RIXEN J and RENZ M. SFSRNet: Super-resolution for single-channel audio source separation[C]. Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 11220–11228. doi: 10.1609/aaai.v36i10.21372. (查阅网上资料,未找到本条文献出版地信息,请确认). [9] TONG Weinan, ZHU Jiaxu, CHEN Jun, et al. TFCnet: Time-frequency domain corrector for speech separation[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096785. [10] ROUARD S, MASSA F, and DÉFOSSEZ A. Hybrid transformers for music source separation[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096956. [11] TZINIS E, WANG Zhepei, and SMARAGDIS P. Sudo RM -RF: Efficient networks for universal audio source separation[C]. Proceedings of 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 2020: 1–6. doi: 10.1109/MLSP49062.2020.9231900. [12] LI Kai, YANG Runxuan, and HU Xiaolin. An efficient encoder-decoder architecture with top-down attention for speech separation[J]. arXiv preprint arXiv: 2209.15200, 2022. doi: 10.48550/arXiv.2209.15200. (查阅网上资料,不确定文献类型及格式是否正确,请确认). [13] GOEL K, GU A, DONAHUE C, et al. It’s raw! Audio generation with state-space models[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 7616–7633. [14] CHEN Chen, YANG C H H, LI Kai, et al. A neural state-space modeling approach to efficient speech separation[C]. Proceedings of the 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 3784–3788. [15] XU Mohan, LI Kai, CHEN Guo, et al. TIGER: Time-frequency interleaved gain extraction and reconstruction for efficient speech separation[C]. Proceedings of the 13th International Conference on Learning Representations, Singapore, Singapore, 2025. [16] OH H, YI J, and LEE Y. Papez: Resource-efficient speech separation with auditory working memory[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095136. [17] HUA Weizhe, DAI Zihang, LIU Hanxiao, et al. Transformer quality in linear time[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 9099–9117. [18] ZHAO Shengkui and MA Bin. MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096646. [19] ZHAO Shengkui, MA Yukun, NI Chongjia, et al. MossFormer2: Combining transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation[C]. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024: 10356–10360. doi: 10.1109/ICASSP48485.2024.10445985. [20] HU Xiaolin, LI Kai, ZHANG Weiyi, et al. Speech separation using an asynchronous fully recurrent convolutional neural network[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 1724. (查阅网上资料, 未找到本条文献出版地信息, 请确认). [21] PAN Zexu, WICHERN G, GERMAIN F G, et al. PARIS: Pseudo-AutoRegressIve Siamese training for online speech separation[C]. Proceedings of the 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024. [22] TAN H M, VU D Q, and WANG J C. Selinet: A lightweight model for single channel speech separation[C]. Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10097121. [23] TZINIS E, VENKATARAMANI S, WANG Zhepei, et al. Two-step sound source separation: Training on learned latent targets[C]. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 31–35. doi: 10.1109/ICASSP40776.2020.9054172. [24] LUO Jian, WANG Jianzong, CHENG Ning, et al. Tiny-Sepformer: A tiny time-domain transformer network for speech separation[J]. arXiv preprint arXiv: 2206.13689, 2022. doi: 10.48550/arXiv.2206.13689. (查阅网上资料,不确定文献类型及格式是否正确,请确认). [25] JIANG Yanji, QIU Youli, SHEN Xueli, et al. SuperFormer: Enhanced multi-speaker speech separation network combining channel and spatial adaptability[J]. Applied Sciences, 2022, 12(15): 7650. doi: 10.3390/app12157650. [26] LIU Debang, ZHANG Tianqi, CHRISTENSEN M G, et al. Efficient time-domain speech separation using short encoded sequence network[J]. Speech Communication, 2025, 166: 103150. doi: 10.1016/j.specom.2024.103150. [27] 侯进, 盛尧宝, 张波. 基于二阶统计特性的方向向量估计算法的DOA估计[J]. 电子与信息学报, 2024, 46(2): 697–704. doi: 10.11999/JEIT230172.HOU Jin, SHENG Yaobao, and ZHANG Bo. DOA estimation of direction vector estimation algorithm based on second-order statistical properties[J]. Journal of Electronics & Information Technology, 2024, 46(2): 697–704. doi: 10.11999/JEIT230172. [28] 田浩原, 陈宇轩, 陈北京, 等. 抵抗语音转换伪造的扩散重构式主动防御方法[J]. 电子与信息学报, 2026, 48(2): 818–828. doi: 10.11999/JEIT250709.TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, et al. Defeating voice conversion forgery by active defense with diffusion reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(2): 818–828. doi: 10.11999/JEIT250709. [29] 刘佳, 张洋瑞, 陈大鹏, 等. 结合双流注意力与对抗互重建的双模态情绪识别方法[J]. 电子与信息学报, 2026, 48(1): 277–286. doi: 10.11999/JEIT250424.LIU Jia, ZHANG Yangrui, CHEN Dapeng, et al. Bimodal emotion recognition method based on dual-stream attention and adversarial mutual reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(1): 277–286. doi: 10.11999/JEIT250424. -
下载:
下载: