A comprehensive study on supervised single-channel noisy speech separation with multi-task learning

This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement...

Full description

Saved in:

Bibliographic Details
Published in:	Speech communication Vol. 167; p. 103162
Main Authors:	Dang, Shaoxiang, Matsumoto, Tetsuya, Takeuchi, Yoshinori, Kudo, Hiroaki
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01.02.2025
Subjects:	Multi-task learning Separation priority pipeline Speech enhancement Speech separation Supervised learning Speech enhancement Speech separation Separation priority pipeline Supervised learning Multi-task learning
ISSN:	0167-6393
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement or separation. Next, we classify each pipeline into shared encoder–decoder scheme (SEDS) and independent encoder–decoder scheme (IEDS), depending on whether the two modules share the same encoder and decoder. Additionally, we introduce two types of intermediate structures between the two modules. One structure uses time–frequency (T–F) representations, while the other uses T–F masks. This article elaborates on the strengths and weaknesses of each approach, particularly in mitigating over-suppression and improving computational efficiency. Our experiments show substantial improvements in SPP with IEDS across multiple metrics on the LibriXmix dataset. In addition, by replacing the synthesis-based trick in the enhancement module, the model achieves superior generalization on the LibriCSS dataset. •We extend the SEDS structure for SE and SS by transitioning features to masks.•We propose negative gradient modulation as a simpler alternative to projection methods.•We mitigated over-suppression with a pipeline ensuring uncompromised input for separation.
ISSN:	0167-6393
DOI:	10.1016/j.specom.2024.103162