Joint Speech-Text Embeddings for Multitask Speech Processing

Uloženo v:
Podrobná bibliografie
Název: Joint Speech-Text Embeddings for Multitask Speech Processing
Autoři: Michael Gian Gonzales, Peter Corcoran, Naomi Harte, Michael Schukat
Přispěvatelé: Science Foundation Ireland, University of Galway Research Repository
Zdroj: IEEE Access, Vol 12, Pp 145955-145967 (2024)
Informace o vydavateli: Institute of Electrical and Electronics Engineers (IEEE), 2024.
Rok vydání: 2024
Témata: joint speech-text, voice conversion, speaker recognition, Automatic speech recognition, Electrical engineering. Electronics. Nuclear engineering, text-to-speech, TK1-9971, speech processing
Popis: Devices that use speech as the communication medium between human and computer have been emerging for the past few years. The technologies behind this interface are called Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The two are distinct fields in speech signal processing that have independently made great strides in recent years. This paper proposes an architecture that takes advantage of the two modalities present in ASR and TTS, speech and text, while simultaneously training three tasks, adding speaker recognition to the underlying ASR and TTS tasks. This architecture not only reduces the memory footprint required to run all tasks, but also has performance comparable to single-task models. The dataset used to train and evaluate the model is the CSTR VCTK Corpus. Results show a 97.64% accuracy in the speaker recognition task, word and character error rates of 18.18% and 7.95% for the ASR task, a mel cepstral distortion of 4.31 and two predicted MOS of 2.98 and 3.28 for the TTS task. While voice conversion is not part of the training tasks, the architecture is capable of doing this and was evaluated to have 5.22, 2.98, and 2.73 for mel cepstral distortion and predicted MOS, respectively.
Druh dokumentu: Article
Popis souboru: application/pdf
ISSN: 2169-3536
DOI: 10.1109/access.2024.3473743
DOI: 10.13025/29260
Přístupová URL adresa: https://doaj.org/article/351d1fbfd3be4abfa53e00946ba37d3e
https://hdl.handle.net/10379/18466
https://doi.org/10.13025/29260
Rights: CC BY
Přístupové číslo: edsair.doi.dedup.....f81c0c4471abe7983f81f8c178d5a7a8
Databáze: OpenAIRE
Popis
Abstrakt:Devices that use speech as the communication medium between human and computer have been emerging for the past few years. The technologies behind this interface are called Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The two are distinct fields in speech signal processing that have independently made great strides in recent years. This paper proposes an architecture that takes advantage of the two modalities present in ASR and TTS, speech and text, while simultaneously training three tasks, adding speaker recognition to the underlying ASR and TTS tasks. This architecture not only reduces the memory footprint required to run all tasks, but also has performance comparable to single-task models. The dataset used to train and evaluate the model is the CSTR VCTK Corpus. Results show a 97.64% accuracy in the speaker recognition task, word and character error rates of 18.18% and 7.95% for the ASR task, a mel cepstral distortion of 4.31 and two predicted MOS of 2.98 and 3.28 for the TTS task. While voice conversion is not part of the training tasks, the architecture is capable of doing this and was evaluated to have 5.22, 2.98, and 2.73 for mel cepstral distortion and predicted MOS, respectively.
ISSN:21693536
DOI:10.1109/access.2024.3473743