Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, us...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 11091 - 11095
Main Authors:	Garg, Abhinav, Kim, Jiyeon, Khyalia, Sushil, Kim, Chanwoo, Gowda, Dhananjaya
Format:	Conference Proceeding
Language:	English
Published:	IEEE 14.04.2024
Subjects:	Acoustics data-driven G2P Grapheme-to-Phoneme lexicon-free TTS Linguistics Self-supervised learning Signal processing Speech processing Text-to-Speech
ISSN:	2379-190X
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
AbstractList	Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
Author	Garg, Abhinav Kim, Chanwoo Gowda, Dhananjaya Khyalia, Sushil Kim, Jiyeon
Author_xml	– sequence: 1 givenname: Abhinav surname: Garg fullname: Garg, Abhinav email: abhinavg@stanford.edu organization: Stanford University – sequence: 2 givenname: Jiyeon surname: Kim fullname: Kim, Jiyeon email: jstacey7.kim@samsung.com organization: Samsung Research – sequence: 3 givenname: Sushil surname: Khyalia fullname: Khyalia, Sushil email: skhyalia@andrew.cmu.edu organization: Carnegie Mellon University – sequence: 4 givenname: Chanwoo surname: Kim fullname: Kim, Chanwoo email: chanwcom@korea.ac.kr organization: Korea University – sequence: 5 givenname: Dhananjaya surname: Gowda fullname: Gowda, Dhananjaya email: d.gowda@samsung.com organization: Samsung Research
BookMark	eNo1kN9KwzAYxaMouM29gRfxAVKTNGmSS9ncFApOO8G78bX9Qiv2D2mR7e2tqFfn4nB-53Dm5KLtWiTkVvBICO7unlb3WbZTVlkdSS5VJLhSiTT6jCydcTbWPFaTKc7JTMbGMeH4-xWZD8MH59waZWfkZQ0j0HWov7Cl2wB9hQ2ysWO7aiprkL5iH3DAdoSx7tqB-i5QoCke66Jr2SYg0j0ex59I1iMW1TW59PA54PJPF-Rt87BfPbL0eTstTlktlRyZ96V3XhVeW1GC9okG6dEJ7-IEbJJLp3LMjZVFbjwAFpaDBOugVAbzRMQLcvPLrRHx0Ie6gXA6_F8QfwN4ElSu
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICASSP48485.2024.10446275
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	9798350344851
EISSN	2379-190X
EndPage	11095
ExternalDocumentID	10446275
Genre	orig-research
GroupedDBID	23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS
ID	FETCH-LOGICAL-i242t-ffdf9f4cf581da5f65a2fe91f936a86b294beb782cb7faaec80a2a89ad47eb613
IEDL.DBID	RIE
ISICitedReferencesCount	0
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001396233804069&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:33:51 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i242t-ffdf9f4cf581da5f65a2fe91f936a86b294beb782cb7faaec80a2a89ad47eb613
PageCount	5
ParticipantIDs	ieee_primary_10446275
PublicationCentury	2000
PublicationDate	2024-04-14
PublicationDateYYYYMMDD	2024-04-14
PublicationDate_xml	– month: 04 year: 2024 text: 2024-04-14 day: 14
PublicationDecade	2020
PublicationTitle	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev	ICASSP
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0008748
Score	2.2741132
Snippet	Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully...
SourceID	ieee
SourceType	Publisher
StartPage	11091
SubjectTerms	Acoustics data-driven G2P Grapheme-to-Phoneme lexicon-free TTS Linguistics Self-supervised learning Signal processing Speech processing Text-to-Speech
Title	Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech
URI	https://ieeexplore.ieee.org/document/10446275
WOSCitedRecordID	wos001396233804069&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG6UGKMXXxjfqYnXItttu-3RgKgJIaug4Ua63TZwcCHLYvz5TpcF9eDB26TppE0f82jnm0HohkZONbkOSJrwBByUpiRSioBwYa0NRcpoGfL_1o16PTkcqrgCq5dYGOhSBp_ZhifLv_x0ahb-qQxuODgvNOKbaDOKoiVYay12ZcTkNrqukmjePrXu-v2YSSY5eIGUNVbMv8qolFqks_fP8fdR_RuPh-O1pjlAGzY7RLs_Ugkeoee2LjRu51544Qefhdq-W1JMSTyeZkDilzLktUIaZXMMxirWuGs_4ShkpJNbiwfeCwaW_sxaM66j1879oPVIqnIJZAJ6tiDOpU45ZhwHG1RzJ7imzqrAqVBoKRKqWGITsAhMEjmtrZFNTbVUOmW-MEoQHqNaBlM6QbhpQA7yUIO5qJiS0MelUgimDBVhmtBTVPerM5otM2KMVgtz9kf7Odrxe-B_YQJ2gWpFvrCXaMt8FJN5flXu4xc1bZ6J
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwMhECZajY-LrxrfYuKVussCC0fTWtu4NtVW461hWUh7cNtst8afL2y3VQ8evBHCJJMB5gHzzQBwjUMjPCp9lMQ0tgGKxxHnzEeUaa0DlhBcpPy_RmGnw9_eRLcEqxdYGLukSD7TNTcs_vKTsZq5pzJ7w23wgkO6CtYoIdifw7WWipeHhG-Aq7KM5k27ftvrdQknnNo4EJPagvxXI5XCjjR3_snBLqh-I_Jgd2lr9sCKTvfB9o9iggfgqSFzCRuZU1_w3tWh1u8a5WPUHY5TO4TPRdJriTVKp9C6q1DCSH_aw5CiZqY17Ls42JL0JlqrYRW8NO_69RYqGyagkbW0OTImMcIQZaj1QiU1jEpstPCNCJjkLMaCxDq2PoGKQyOlVtyTWHIhE-Jao_jBIaiklqUjAD1lNSENpHUYBRHcrjEJZ4wIhVmQxPgYVJ10BpN5TYzBQjAnf8xfgs1W_zEaRO3OwynYcvvh_mR8cgYqeTbT52BdfeSjaXZR7OkXo1eh0A
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Data+Driven+Grapheme-to-Phoneme+Representations+for+a+Lexicon-Free+Text-to-Speech&rft.au=Garg%2C+Abhinav&rft.au=Kim%2C+Jiyeon&rft.au=Khyalia%2C+Sushil&rft.au=Kim%2C+Chanwoo&rft.date=2024-04-14&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=11091&rft.epage=11095&rft_id=info:doi/10.1109%2FICASSP48485.2024.10446275&rft.externalDocID=10446275