Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech
Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, us...
Uloženo v:
| Vydáno v: | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 11091 - 11095 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
14.04.2024
|
| Témata: | |
| ISSN: | 2379-190X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise. |
|---|---|
| AbstractList | Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise. |
| Author | Garg, Abhinav Kim, Chanwoo Gowda, Dhananjaya Khyalia, Sushil Kim, Jiyeon |
| Author_xml | – sequence: 1 givenname: Abhinav surname: Garg fullname: Garg, Abhinav email: abhinavg@stanford.edu organization: Stanford University – sequence: 2 givenname: Jiyeon surname: Kim fullname: Kim, Jiyeon email: jstacey7.kim@samsung.com organization: Samsung Research – sequence: 3 givenname: Sushil surname: Khyalia fullname: Khyalia, Sushil email: skhyalia@andrew.cmu.edu organization: Carnegie Mellon University – sequence: 4 givenname: Chanwoo surname: Kim fullname: Kim, Chanwoo email: chanwcom@korea.ac.kr organization: Korea University – sequence: 5 givenname: Dhananjaya surname: Gowda fullname: Gowda, Dhananjaya email: d.gowda@samsung.com organization: Samsung Research |
| BookMark | eNo1kN9KwzAYxaMouM29gRfxAVKTNGmSS9ncFApOO8G78bX9Qiv2D2mR7e2tqFfn4nB-53Dm5KLtWiTkVvBICO7unlb3WbZTVlkdSS5VJLhSiTT6jCydcTbWPFaTKc7JTMbGMeH4-xWZD8MH59waZWfkZQ0j0HWov7Cl2wB9hQ2ysWO7aiprkL5iH3DAdoSx7tqB-i5QoCke66Jr2SYg0j0ex59I1iMW1TW59PA54PJPF-Rt87BfPbL0eTstTlktlRyZ96V3XhVeW1GC9okG6dEJ7-IEbJJLp3LMjZVFbjwAFpaDBOugVAbzRMQLcvPLrRHx0Ie6gXA6_F8QfwN4ElSu |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICASSP48485.2024.10446275 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 9798350344851 |
| EISSN | 2379-190X |
| EndPage | 11095 |
| ExternalDocumentID | 10446275 |
| Genre | orig-research |
| GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i242t-ffdf9f4cf581da5f65a2fe91f936a86b294beb782cb7faaec80a2a89ad47eb613 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001396233804069&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:33:51 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i242t-ffdf9f4cf581da5f65a2fe91f936a86b294beb782cb7faaec80a2a89ad47eb613 |
| PageCount | 5 |
| ParticipantIDs | ieee_primary_10446275 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-04-14 |
| PublicationDateYYYYMMDD | 2024-04-14 |
| PublicationDate_xml | – month: 04 year: 2024 text: 2024-04-14 day: 14 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) |
| PublicationTitleAbbrev | ICASSP |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0008748 |
| Score | 2.2741132 |
| Snippet | Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 11091 |
| SubjectTerms | Acoustics data-driven G2P Grapheme-to-Phoneme lexicon-free TTS Linguistics Self-supervised learning Signal processing Speech processing Text-to-Speech |
| Title | Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech |
| URI | https://ieeexplore.ieee.org/document/10446275 |
| WOSCitedRecordID | wos001396233804069&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5aRPTiq-KbCF5Tu7vZ3eQorVWhlNVW6a0k2QntwW3ZbsWf72S7rXrw4C0EQsLkMd8k-b4h5CZ17MimlcwqEIyHuOekFgGLAu1bRCA2KNP5vHXjXk8MhzKpyOolFwYAys9n0HDF8i0_nZqFuyrDHY7Bix-Hm2QzjuMlWWt97IqYi21yXYlo3j617vr9hAsuQowCfd5YNf6VRqX0Ip29f_a_T-rffDyarD3NAdmA7JDs_pASPCLPbVUo2s7d4UUfnAo1vAMrpiwZTzMs0pfyy2vFNMrmFMEqVbQLn7gUMtbJAejARcHYpD8DMOM6ee3cD1qPrEqXwCboZwtmbWql5caGiEFVaKNQ-RakZ2UQKRFpX3INGhGB0bFVCoxoKl8JqVLuEqN4wTGpZTikE0KVDZzsPGInxFPGk-jC0Lo65YFByALNU1J31hnNlooYo5Vhzv6oPyc7bg7cK4zHL0ityBdwSbbMRzGZ51flPH4BGlKeSQ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagIB4LryLeGInVpUmcxB5RS2lFqAItiK1ykrPagbRKU8TP55ymBQYGNsuSJevs831n-_uOkOvEsCPrWjKtQDDuos_JSDjMcyJbIwLRTlHO5zXwu13x9ibDkqxecGEAoPh8BjXTLN7yk3E8M1dl6OGYvNi-u0rWXM5ta07XWh68wudig1yVMpo3ncZtrxdywYWLeaDNa4vhvwqpFHGktfPPGeyS6jcjj4bLWLNHViDdJ9s_xAQPyFNT5Yo2M3N80XujQw3vwPIxC4fjFJv0ufj0WnKN0ilFuEoVDeATN0PKWhkA7Zs8GIf0JgDxsEpeWnf9RpuVBRPYCCNtzrROtNQ81i6iUOVqz1W2Bmlp6XhKeJEteQQRYoI48rVSEIu6spWQKuGmNIrlHJJKilM6IlRpxwjPI3pCRBVbEoMYWjdKuBMjaIH6Maka6wwmc02MwcIwJ3_0X5LNdv8xGASd7sMp2TLrYd5kLH5GKnk2g3OyHn_ko2l2UazpF4UyoZA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Data+Driven+Grapheme-to-Phoneme+Representations+for+a+Lexicon-Free+Text-to-Speech&rft.au=Garg%2C+Abhinav&rft.au=Kim%2C+Jiyeon&rft.au=Khyalia%2C+Sushil&rft.au=Kim%2C+Chanwoo&rft.date=2024-04-14&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=11091&rft.epage=11095&rft_id=info:doi/10.1109%2FICASSP48485.2024.10446275&rft.externalDocID=10446275 |