Efficient time-domain speech separation using short encoded sequence network

The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target...

Full description

Saved in:

Bibliographic Details
Published in:	Speech communication Vol. 166; p. 103150
Main Authors:	Liu, Debang, Zhang, Tianqi, Christensen, Mads Græsbøll, Ma, Baoze, Deng, Pan
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01.01.2025
Subjects:	Computational complexity Multi-temporal resolution Transformer Short sequence encoder–decoder framework Speech separation Speech separation Computational complexity Multi-temporal resolution Transformer Short sequence encoder–decoder framework
ISSN:	0167-6393
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The key to single-channel time-domain speech separation lies in encoding the mixed speech into latent feature representations by encoder and obtaining the accurate estimation of the target speaker masks by the separation network. Despite the advanced separation network contribute to separate target speech, but due to the limitation of the time-domain encoder–decoder framework, these separation models commonly improve the separation performance by setting a small convolution kernel size of encoder to increase the length of the coded sequence, which will result in increased computational complexity and training costs for the model. Therefore, in this paper, we propose an efficient time-domain speech separation model using short-sequence encoder–decoder framework (ESEDNet). In this model, we construct a novel encoder–decoder framework to accommodate short encoded sequences, where the encoder consists of multiple convolution and downsampling operations to reduce length of high-resolution sequence, while the decoder utilizes the encoded features to reconstruct the fine-detailed speech sequence of the target speaker. Since the output sequence of the encoder is shorter, when combined with our proposed multi-temporal resolution Transformer separation network (MTRFormer), ESEDNet can efficiently obtains separation masks for the short encoded feature sequence. Experiments show that compared with previous state-of-the-art (SOTA) methods, ESEDNet is more efficient in terms of computational complexity, training speed and GPU memory usage, while maintaining competitive separation performance. •We introduce an encoder-decoder framework that ensures a short encoded sequence while achieving excellent separation performance.•We design the separation network, which combines with an encoder-decoder network to achieve effective target source separation.•The ESEDNet has smaller model size, lower training cost, and is easy to extend to other networks.
ISSN:	0167-6393
DOI:	10.1016/j.specom.2024.103150