Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, us...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 11091 - 11095
Hlavní autoři:	Garg, Abhinav, Kim, Jiyeon, Khyalia, Sushil, Kim, Chanwoo, Gowda, Dhananjaya
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 14.04.2024
Témata:	Acoustics data-driven G2P Grapheme-to-Phoneme lexicon-free TTS Linguistics Self-supervised learning Signal processing Speech processing Text-to-Speech
ISSN:	2379-190X
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
ISSN:	2379-190X
DOI:	10.1109/ICASSP48485.2024.10446275