View in EDS

A noise-robust voice conversion method with controllable background sounds

Saved in:

Bibliographic Details
Title:	A noise-robust voice conversion method with controllable background sounds
Authors:	Lele Chen, Xiongwei Zhang, Yihao Li, Meng Sun, Weiwei Chen
Source:	Complex & Intelligent Systems, Vol 10, Iss 3, Pp 3981-3994 (2024)
Publisher Information:	Springer Science and Business Media LLC, 2024.
Publication Year:	2024
Subject Terms:	Dual-decoder structure, Noise-robust voice conversion, Electronic computers. Computer science, 0103 physical sciences, Cycle loss, 0202 electrical engineering, electronic engineering, information engineering, QA75.5-76.95, Information technology, 02 engineering and technology, Bridge module, T58.5-58.64, 01 natural sciences, Speech disentanglement
Description:	Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.
Document Type:	Article
Language:	English
ISSN:	2198-6053 2199-4536
DOI:	10.1007/s40747-024-01375-6
Access URL:	https://doaj.org/article/7aa48dec97f54bab9a1585965cf39cdd
Rights:	CC BY
Accession Number:	edsair.doi.dedup.....60a4b7cc7565d8d8339d5c766c3cd0ec
Database:	OpenAIRE

View record at OpenAIRE

Full Text Finder

Nájsť tento článok vo Web of Science

Description
Abstract:	Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.
ISSN:	21986053 21994536
DOI:	10.1007/s40747-024-01375-6