In EDS ansehen

Multi-MelGAN voice conversion for the creation of under-resourced child speech synthesis

Gespeichert in:

Bibliographische Detailangaben
Titel:	Multi-MelGAN voice conversion for the creation of under-resourced child speech synthesis
Autoren:	Govender, Avashna
Quelle:	IST-Africa 2022 Conference Proceedings, Virtual, South Africa, 16-20 May 2022
Publikationsjahr:	2022
Bestand:	Council for Scientific and Industrial Research (South Africa): CSIR Research Space
Schlagwörter:	Voice conversion, Text-to-speech voices, MelGAN-VC, Melspectrogram Generatative Adversarial Network, Speech data
Beschreibung:	Voice conversion (VC) is an important technique for the development of text-to-speech voices in the use case of lacking speech resources. VC can convert an audio signal from a source speaker to a specific target speaker whilst maintaining the linguistic information. The benefit of VC is that you only require a small amount of target data which therefore makes it possible to build high quality text-to-speech voices using only a limited amount of speech data. In this work, we implement VC using a Melspectrogram Generatative Adversarial Network called MelGAN-VC. This technique does not require parallel data and has been proven successful on as little as 1 hour of target speech data. The aim of this work was to build child voices by modifying the original one-to-one MelGAN-VC model to a many-to-many model and determine if there is any gain in using such a model. We found that using a many-to-many model performs better than the baseline one-to-one model in terms of speaker similarity and the naturalness of the output speech when using only 24 minutes of speech data. ; 7 ; Paper presented at the IST-Africa 2022 Conference, Virtual, South Africa, 16-20 May 2022 ; Next Generation Enterprises & Institutions ; Voice Computing
Publikationsart:	conference object
Dateibeschreibung:	Fulltext; application/pdf
Sprache:	English
Relation:	http://www.ist-africa.org/Conference2022/default.asp?page=schedule-print&schedule.id=4569&schedule.expanded=yes; http://hdl.handle.net/10204/12496; 25900
Verfügbarkeit:	http://hdl.handle.net/10204/12496
Dokumentencode:	edsbas.4AB867A0
Datenbank:	BASE

View record from BASE

Nájsť tento článok vo Web of Science

Beschreibung
Abstract:	Voice conversion (VC) is an important technique for the development of text-to-speech voices in the use case of lacking speech resources. VC can convert an audio signal from a source speaker to a specific target speaker whilst maintaining the linguistic information. The benefit of VC is that you only require a small amount of target data which therefore makes it possible to build high quality text-to-speech voices using only a limited amount of speech data. In this work, we implement VC using a Melspectrogram Generatative Adversarial Network called MelGAN-VC. This technique does not require parallel data and has been proven successful on as little as 1 hour of target speech data. The aim of this work was to build child voices by modifying the original one-to-one MelGAN-VC model to a many-to-many model and determine if there is any gain in using such a model. We found that using a many-to-many model performs better than the baseline one-to-one model in terms of speaker similarity and the naturalness of the output speech when using only 24 minutes of speech data. ; 7 ; Paper presented at the IST-Africa 2022 Conference, Virtual, South Africa, 16-20 May 2022 ; Next Generation Enterprises & Institutions ; Voice Computing