Efficient time-domain speech separation using short encoded sequence network

The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Speech communication Ročník 166; s. 103150
Hlavní autoři: Liu, Debang, Zhang, Tianqi, Christensen, Mads Græsbøll, Ma, Baoze, Deng, Pan
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.01.2025
Témata:
ISSN:0167-6393
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance. •We introduce an encoder-decoder framework that ensures a short encoded sequence while achieving excellent separation performance.•We design the separation network, which combines with an encoder-decoder network to achieve effective target source separation.•The ESEDNet has smaller model size, lower training cost, and is easy to extend to other networks.
ISSN:0167-6393
DOI:10.1016/j.specom.2024.103150