Joint Speech-Text Embeddings for Multitask Speech Processing
Uloženo v:
| Název: | Joint Speech-Text Embeddings for Multitask Speech Processing |
|---|---|
| Autoři: | Michael Gian Gonzales, Peter Corcoran, Naomi Harte, Michael Schukat |
| Přispěvatelé: | Science Foundation Ireland, University of Galway Research Repository |
| Zdroj: | IEEE Access, Vol 12, Pp 145955-145967 (2024) |
| Informace o vydavateli: | Institute of Electrical and Electronics Engineers (IEEE), 2024. |
| Rok vydání: | 2024 |
| Témata: | joint speech-text, voice conversion, speaker recognition, Automatic speech recognition, Electrical engineering. Electronics. Nuclear engineering, text-to-speech, TK1-9971, speech processing |
| Popis: | Devices that use speech as the communication medium between human and computer have been emerging for the past few years. The technologies behind this interface are called Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The two are distinct fields in speech signal processing that have independently made great strides in recent years. This paper proposes an architecture that takes advantage of the two modalities present in ASR and TTS, speech and text, while simultaneously training three tasks, adding speaker recognition to the underlying ASR and TTS tasks. This architecture not only reduces the memory footprint required to run all tasks, but also has performance comparable to single-task models. The dataset used to train and evaluate the model is the CSTR VCTK Corpus. Results show a 97.64% accuracy in the speaker recognition task, word and character error rates of 18.18% and 7.95% for the ASR task, a mel cepstral distortion of 4.31 and two predicted MOS of 2.98 and 3.28 for the TTS task. While voice conversion is not part of the training tasks, the architecture is capable of doing this and was evaluated to have 5.22, 2.98, and 2.73 for mel cepstral distortion and predicted MOS, respectively. |
| Druh dokumentu: | Article |
| Popis souboru: | application/pdf |
| ISSN: | 2169-3536 |
| DOI: | 10.1109/access.2024.3473743 |
| DOI: | 10.13025/29260 |
| Přístupová URL adresa: | https://doaj.org/article/351d1fbfd3be4abfa53e00946ba37d3e https://hdl.handle.net/10379/18466 https://doi.org/10.13025/29260 |
| Rights: | CC BY |
| Přístupové číslo: | edsair.doi.dedup.....f81c0c4471abe7983f81f8c178d5a7a8 |
| Databáze: | OpenAIRE |
| Abstrakt: | Devices that use speech as the communication medium between human and computer have been emerging for the past few years. The technologies behind this interface are called Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The two are distinct fields in speech signal processing that have independently made great strides in recent years. This paper proposes an architecture that takes advantage of the two modalities present in ASR and TTS, speech and text, while simultaneously training three tasks, adding speaker recognition to the underlying ASR and TTS tasks. This architecture not only reduces the memory footprint required to run all tasks, but also has performance comparable to single-task models. The dataset used to train and evaluate the model is the CSTR VCTK Corpus. Results show a 97.64% accuracy in the speaker recognition task, word and character error rates of 18.18% and 7.95% for the ASR task, a mel cepstral distortion of 4.31 and two predicted MOS of 2.98 and 3.28 for the TTS task. While voice conversion is not part of the training tasks, the architecture is capable of doing this and was evaluated to have 5.22, 2.98, and 2.73 for mel cepstral distortion and predicted MOS, respectively. |
|---|---|
| ISSN: | 21693536 |
| DOI: | 10.1109/access.2024.3473743 |
Full Text Finder
Nájsť tento článok vo Web of Science