Preserving Speaker Identity in Speech-to-Speech Translation: An Exploration of Attention-Based Approaches

Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorp...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	International Conference on Computing Communication Control and Automation (Online) s. 1 - 6
Hlavní autoři:	Jaybhaye, S. M, Lale, Yogesh, Kulkarni, Parth, Diwnale, Tanvi, Kota, Apurva
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 23.08.2024
Témata:	Accuracy Attention mechanisms Automation Bridges Computer architecture Decoding encoder-decoder architectures Focusing Measurement speaker embeddings speaker identity Speaker recognition speech-to-speech translation
ISSN:	2771-1358
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorporating speaker embeddings through various attention mechanisms is explored, including speaker-aware self-attention, cross-attention with speaker embeddings, and a dedicated speaker attention module within the decoder. Utilizing the CVSS multilingual dataset. The approach is rigorously evaluated through objective metrics (BLEU, WER, speaker recognition accuracy, cosine similarity, FID) and subjective human perception studies. The results demonstrates that dedicated speaker attention and cross-attention mechanisms within the decoder significantly enhance speaker identity preservation without compromising translation accuracy. These results pave the way for the development of STST systems that deliver both accurate content and natural, personalized communication experiences.
AbstractList	Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorporating speaker embeddings through various attention mechanisms is explored, including speaker-aware self-attention, cross-attention with speaker embeddings, and a dedicated speaker attention module within the decoder. Utilizing the CVSS multilingual dataset. The approach is rigorously evaluated through objective metrics (BLEU, WER, speaker recognition accuracy, cosine similarity, FID) and subjective human perception studies. The results demonstrates that dedicated speaker attention and cross-attention mechanisms within the decoder significantly enhance speaker identity preservation without compromising translation accuracy. These results pave the way for the development of STST systems that deliver both accurate content and natural, personalized communication experiences.
Author	Kulkarni, Parth Jaybhaye, S. M Lale, Yogesh Kota, Apurva Diwnale, Tanvi
Author_xml	– sequence: 1 givenname: S. M surname: Jaybhaye fullname: Jaybhaye, S. M email: sangita.jaybhaye@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 2 givenname: Yogesh surname: Lale fullname: Lale, Yogesh email: yogesh.lale22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 3 givenname: Parth surname: Kulkarni fullname: Kulkarni, Parth email: parth.kulkarni22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 4 givenname: Tanvi surname: Diwnale fullname: Diwnale, Tanvi email: tanvi.diwnale22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 5 givenname: Apurva surname: Kota fullname: Kota, Apurva email: apurva.kota22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India
BookMark	eNo1UFtLwzAYjaLgnPsHPgTfO78kbS6-dWXqYKDgfB5J-9VFZ1rSINu_1zl9Ohc4B865JGehC0jIDYMpY2BuF1X1OpuXkqkcphx4PmWgVK4lPyETo4wWBQjDlIJTMuJKsYyJQl-QyTC8A4DgkIMsRsQ_RxwwfvnwRl96tB8Y6aLBkHzaUx8OHtabLHXZkdFVtGHY2uS7cEfLQOe7ftvFX027lpYpHcJdyGZ2wIaWfR87W29wuCLnrd0OOPnDMVndz1fVY7Z8elhU5TLzhqUsl1K7QnPLXGNqrSxrwGmnzcFqZVM0yoI1LlfMSNdyg1agdrVrf7Y3XIgxuT7WekRc99F_2rhf_58jvgFvPF1O
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICCUBEA61740.2024.10774862
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350391770
EISSN	2771-1358
EndPage	6
ExternalDocumentID	10774862
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL
ID	FETCH-LOGICAL-i91t-4668b582a1bd9c87a1d0b8b8982a1f6d5d7a0a9b47196bf29ea3e8bcbf486d233
IEDL.DBID	RIE
IngestDate	Wed Jan 15 06:21:30 EST 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i91t-4668b582a1bd9c87a1d0b8b8982a1f6d5d7a0a9b47196bf29ea3e8bcbf486d233
PageCount	6
ParticipantIDs	ieee_primary_10774862
PublicationCentury	2000
PublicationDate	2024-Aug.-23
PublicationDateYYYYMMDD	2024-08-23
PublicationDate_xml	– month: 08 year: 2024 text: 2024-Aug.-23 day: 23
PublicationDecade	2020
PublicationTitle	International Conference on Computing Communication Control and Automation (Online)
PublicationTitleAbbrev	ICCUBEA
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003204065
Score	1.8804108
Snippet	Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity....
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Accuracy Attention mechanisms Automation Bridges Computer architecture Decoding encoder-decoder architectures Focusing Measurement speaker embeddings speaker identity Speaker recognition speech-to-speech translation
Title	Preserving Speaker Identity in Speech-to-Speech Translation: An Exploration of Attention-Based Approaches
URI	https://ieeexplore.ieee.org/document/10774862
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELUAMTABoohveWB1SewkttnSqhUsVSWK1K3yxwUipKRqUyT-PbaTFjEwsNk3RLYv9kvO790hdK8g0ywC5wGeJiSRyhBtQZNYiExaQ6XVOhSb4JOJmM_ltBOrBy0MAATyGfR9M9zl29psfKjM7XD3sSL8ibvPedaKtXYBFUbd-5ilXWLROJIPz8Ph62CUO4xOIvcnSJP-9gG_SqkEJBkf_3MMJ6j3o8nD0x3anKI9qM5Q6RkUfrdXb_hlCeoDVriT3n7hsvI2MO-kqUnbwgGaWvrbI84r3FLwQh_XBc6bpqU_koFDN4vzLuM4rHtoNh7Nhk-kK55AShk3JMkyoVNBVaytNIKr2EZaaCG9qchsarmKlNQOm2SmCypBMRDa6MJNzlLGztFBVVdwgTBX3DLGKTWpm7lSArR156TRMdM2kfwS9fwyLZZteozFdoWu_rBfoyPvDB-YpewGHTSrDdyiQ_PZlOvVXXDqN8xrpKI
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA8yBT2pOPHbHLxmtkk_Em_d2NhwjoETdhv5eNUitGPrBP97k7ZOPHjwlrxDSfKa_NqX3-89hO4kRIp5YD0QhwEJhNREGVDE5zwSRlNhlKqKTcSTCZ_PxbQRq1daGACoyGfQcc3qLt8UeuNCZXaH248V7k7cXVc6q5FrbUMqjNo3Mgqb1KK-J-5Hvd5Lt59YlA48-y9Ig873I34VU6mwZHD4z1EcofaPKg9Pt3hzjHYgP0GZ41C4_Z6_4uclyHdY4UZ8-4mz3NlAv5GyIHULV-BUE-AecJLjmoRX9XGR4qQsawIk6Vp8Mzhpco7Duo1mg_6sNyRN-QSSCb8kQRRxFXIqfWWE5rH0jae44sKZ0siEJpaeFMqik4hUSgVIBlxpldrJGcrYKWrlRQ5nCMcyNozFlOrQzlxKDsrYk1IrnykTiPgctd0yLZZ1gozF9wpd_GG_RfvD2dN4MR5NHi_RgXOMC9NSdoVa5WoD12hPf5TZenVTOfgLjjCn6w
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Computing+Communication+Control+and+Automation+%28Online%29&rft.atitle=Preserving+Speaker+Identity+in+Speech-to-Speech+Translation%3A+An+Exploration+of+Attention-Based+Approaches&rft.au=Jaybhaye%2C+S.+M&rft.au=Lale%2C+Yogesh&rft.au=Kulkarni%2C+Parth&rft.au=Diwnale%2C+Tanvi&rft.date=2024-08-23&rft.pub=IEEE&rft.eissn=2771-1358&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICCUBEA61740.2024.10774862&rft.externalDocID=10774862