Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, us...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 11091 - 11095
Main Authors: Garg, Abhinav, Kim, Jiyeon, Khyalia, Sushil, Kim, Chanwoo, Gowda, Dhananjaya
Format: Conference Proceeding
Language:English
Published: IEEE 14.04.2024
Subjects:
ISSN:2379-190X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
AbstractList Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
Author Garg, Abhinav
Kim, Chanwoo
Gowda, Dhananjaya
Khyalia, Sushil
Kim, Jiyeon
Author_xml – sequence: 1
  givenname: Abhinav
  surname: Garg
  fullname: Garg, Abhinav
  email: abhinavg@stanford.edu
  organization: Stanford University
– sequence: 2
  givenname: Jiyeon
  surname: Kim
  fullname: Kim, Jiyeon
  email: jstacey7.kim@samsung.com
  organization: Samsung Research
– sequence: 3
  givenname: Sushil
  surname: Khyalia
  fullname: Khyalia, Sushil
  email: skhyalia@andrew.cmu.edu
  organization: Carnegie Mellon University
– sequence: 4
  givenname: Chanwoo
  surname: Kim
  fullname: Kim, Chanwoo
  email: chanwcom@korea.ac.kr
  organization: Korea University
– sequence: 5
  givenname: Dhananjaya
  surname: Gowda
  fullname: Gowda, Dhananjaya
  email: d.gowda@samsung.com
  organization: Samsung Research
BookMark eNo1kN9KwzAYxaMouM29gRfxAVKTNGmSS9ncFApOO8G78bX9Qiv2D2mR7e2tqFfn4nB-53Dm5KLtWiTkVvBICO7unlb3WbZTVlkdSS5VJLhSiTT6jCydcTbWPFaTKc7JTMbGMeH4-xWZD8MH59waZWfkZQ0j0HWov7Cl2wB9hQ2ysWO7aiprkL5iH3DAdoSx7tqB-i5QoCke66Jr2SYg0j0ex59I1iMW1TW59PA54PJPF-Rt87BfPbL0eTstTlktlRyZ96V3XhVeW1GC9okG6dEJ7-IEbJJLp3LMjZVFbjwAFpaDBOugVAbzRMQLcvPLrRHx0Ie6gXA6_F8QfwN4ElSu
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP48485.2024.10446275
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9798350344851
EISSN 2379-190X
EndPage 11095
ExternalDocumentID 10446275
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i242t-ffdf9f4cf581da5f65a2fe91f936a86b294beb782cb7faaec80a2a89ad47eb613
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001396233804069&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:33:51 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i242t-ffdf9f4cf581da5f65a2fe91f936a86b294beb782cb7faaec80a2a89ad47eb613
PageCount 5
ParticipantIDs ieee_primary_10446275
PublicationCentury 2000
PublicationDate 2024-04-14
PublicationDateYYYYMMDD 2024-04-14
PublicationDate_xml – month: 04
  year: 2024
  text: 2024-04-14
  day: 14
PublicationDecade 2020
PublicationTitle Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev ICASSP
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.2741132
Snippet Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully...
SourceID ieee
SourceType Publisher
StartPage 11091
SubjectTerms Acoustics
data-driven G2P
Grapheme-to-Phoneme
lexicon-free TTS
Linguistics
Self-supervised learning
Signal processing
Speech processing
Text-to-Speech
Title Data Driven Grapheme-to-Phoneme Representations for a Lexicon-Free Text-to-Speech
URI https://ieeexplore.ieee.org/document/10446275
WOSCitedRecordID wos001396233804069&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG6UGKMXXxjfqYnXItttu-3RgKgJIaug4Ua63TZwcCHLYvz5TpcF9eDB26TppE0f82jnm0HohkZONbkOSJrwBByUpiRSioBwYa0NRcpoGfL_1o16PTkcqrgCq5dYGOhSBp_ZhifLv_x0ahb-qQxuODgvNOKbaDOKoiVYay12ZcTkNrqukmjePrXu-v2YSSY5eIGUNVbMv8qolFqks_fP8fdR_RuPh-O1pjlAGzY7RLs_Ugkeoee2LjRu51544Qefhdq-W1JMSTyeZkDilzLktUIaZXMMxirWuGs_4ShkpJNbiwfeCwaW_sxaM66j1879oPVIqnIJZAJ6tiDOpU45ZhwHG1RzJ7imzqrAqVBoKRKqWGITsAhMEjmtrZFNTbVUOmW-MEoQHqNaBlM6QbhpQA7yUIO5qJiS0MelUgimDBVhmtBTVPerM5otM2KMVgtz9kf7Odrxe-B_YQJ2gWpFvrCXaMt8FJN5flXu4xc1bZ6J
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwMhECZajY-LrxrfYuKVussCC0fTWtu4NtVW461hWUh7cNtst8afL2y3VQ8evBHCJJMB5gHzzQBwjUMjPCp9lMQ0tgGKxxHnzEeUaa0DlhBcpPy_RmGnw9_eRLcEqxdYGLukSD7TNTcs_vKTsZq5pzJ7w23wgkO6CtYoIdifw7WWipeHhG-Aq7KM5k27ftvrdQknnNo4EJPagvxXI5XCjjR3_snBLqh-I_Jgd2lr9sCKTvfB9o9iggfgqSFzCRuZU1_w3tWh1u8a5WPUHY5TO4TPRdJriTVKp9C6q1DCSH_aw5CiZqY17Ls42JL0JlqrYRW8NO_69RYqGyagkbW0OTImMcIQZaj1QiU1jEpstPCNCJjkLMaCxDq2PoGKQyOlVtyTWHIhE-Jao_jBIaiklqUjAD1lNSENpHUYBRHcrjEJZ4wIhVmQxPgYVJ10BpN5TYzBQjAnf8xfgs1W_zEaRO3OwynYcvvh_mR8cgYqeTbT52BdfeSjaXZR7OkXo1eh0A
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Data+Driven+Grapheme-to-Phoneme+Representations+for+a+Lexicon-Free+Text-to-Speech&rft.au=Garg%2C+Abhinav&rft.au=Kim%2C+Jiyeon&rft.au=Khyalia%2C+Sushil&rft.au=Kim%2C+Chanwoo&rft.date=2024-04-14&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=11091&rft.epage=11095&rft_id=info:doi/10.1109%2FICASSP48485.2024.10446275&rft.externalDocID=10446275