A comprehensive study on supervised single-channel noisy speech separation with multi-task learning

This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Speech communication Ročník 167; s. 103162
Hlavní autoři:	Dang, Shaoxiang, Matsumoto, Tetsuya, Takeuchi, Yoshinori, Kudo, Hiroaki
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier B.V 01.02.2025
Témata:	Multi-task learning Separation priority pipeline Speech enhancement Speech separation Supervised learning Speech enhancement Speech separation Separation priority pipeline Supervised learning Multi-task learning
ISSN:	0167-6393
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement or separation. Next, we classify each pipeline into shared encoder–decoder scheme (SEDS) and independent encoder–decoder scheme (IEDS), depending on whether the two modules share the same encoder and decoder. Additionally, we introduce two types of intermediate structures between the two modules. One structure uses time–frequency (T–F) representations, while the other uses T–F masks. This article elaborates on the strengths and weaknesses of each approach, particularly in mitigating over-suppression and improving computational efficiency. Our experiments show substantial improvements in SPP with IEDS across multiple metrics on the LibriXmix dataset. In addition, by replacing the synthesis-based trick in the enhancement module, the model achieves superior generalization on the LibriCSS dataset. •We extend the SEDS structure for SE and SS by transitioning features to masks.•We propose negative gradient modulation as a simpler alternative to projection methods.•We mitigated over-suppression with a pipeline ensuring uncompromised input for separation.
ISSN:	0167-6393
DOI:	10.1016/j.specom.2024.103162