Preserving Speaker Identity in Speech-to-Speech Translation: An Exploration of Attention-Based Approaches
Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorp...
Uložené v:
| Vydané v: | International Conference on Computing Communication Control and Automation (Online) s. 1 - 6 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
23.08.2024
|
| Predmet: | |
| ISSN: | 2771-1358 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorporating speaker embeddings through various attention mechanisms is explored, including speaker-aware self-attention, cross-attention with speaker embeddings, and a dedicated speaker attention module within the decoder. Utilizing the CVSS multilingual dataset. The approach is rigorously evaluated through objective metrics (BLEU, WER, speaker recognition accuracy, cosine similarity, FID) and subjective human perception studies. The results demonstrates that dedicated speaker attention and cross-attention mechanisms within the decoder significantly enhance speaker identity preservation without compromising translation accuracy. These results pave the way for the development of STST systems that deliver both accurate content and natural, personalized communication experiences. |
|---|---|
| AbstractList | Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorporating speaker embeddings through various attention mechanisms is explored, including speaker-aware self-attention, cross-attention with speaker embeddings, and a dedicated speaker attention module within the decoder. Utilizing the CVSS multilingual dataset. The approach is rigorously evaluated through objective metrics (BLEU, WER, speaker recognition accuracy, cosine similarity, FID) and subjective human perception studies. The results demonstrates that dedicated speaker attention and cross-attention mechanisms within the decoder significantly enhance speaker identity preservation without compromising translation accuracy. These results pave the way for the development of STST systems that deliver both accurate content and natural, personalized communication experiences. |
| Author | Kulkarni, Parth Jaybhaye, S. M Lale, Yogesh Kota, Apurva Diwnale, Tanvi |
| Author_xml | – sequence: 1 givenname: S. M surname: Jaybhaye fullname: Jaybhaye, S. M email: sangita.jaybhaye@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 2 givenname: Yogesh surname: Lale fullname: Lale, Yogesh email: yogesh.lale22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 3 givenname: Parth surname: Kulkarni fullname: Kulkarni, Parth email: parth.kulkarni22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 4 givenname: Tanvi surname: Diwnale fullname: Diwnale, Tanvi email: tanvi.diwnale22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India – sequence: 5 givenname: Apurva surname: Kota fullname: Kota, Apurva email: apurva.kota22@vit.edu organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India |
| BookMark | eNo1UFtLwzAYjaLgnPsHPgTfO78kbS6-dWXqYKDgfB5J-9VFZ1rSINu_1zl9Ohc4B865JGehC0jIDYMpY2BuF1X1OpuXkqkcphx4PmWgVK4lPyETo4wWBQjDlIJTMuJKsYyJQl-QyTC8A4DgkIMsRsQ_RxwwfvnwRl96tB8Y6aLBkHzaUx8OHtabLHXZkdFVtGHY2uS7cEfLQOe7ftvFX027lpYpHcJdyGZ2wIaWfR87W29wuCLnrd0OOPnDMVndz1fVY7Z8elhU5TLzhqUsl1K7QnPLXGNqrSxrwGmnzcFqZVM0yoI1LlfMSNdyg1agdrVrf7Y3XIgxuT7WekRc99F_2rhf_58jvgFvPF1O |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ICCUBEA61740.2024.10774862 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350391770 |
| EISSN | 2771-1358 |
| EndPage | 6 |
| ExternalDocumentID | 10774862 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL |
| ID | FETCH-LOGICAL-i91t-4668b582a1bd9c87a1d0b8b8982a1f6d5d7a0a9b47196bf29ea3e8bcbf486d233 |
| IEDL.DBID | RIE |
| IngestDate | Wed Jan 15 06:21:30 EST 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i91t-4668b582a1bd9c87a1d0b8b8982a1f6d5d7a0a9b47196bf29ea3e8bcbf486d233 |
| PageCount | 6 |
| ParticipantIDs | ieee_primary_10774862 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Aug.-23 |
| PublicationDateYYYYMMDD | 2024-08-23 |
| PublicationDate_xml | – month: 08 year: 2024 text: 2024-Aug.-23 day: 23 |
| PublicationDecade | 2020 |
| PublicationTitle | International Conference on Computing Communication Control and Automation (Online) |
| PublicationTitleAbbrev | ICCUBEA |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003204065 |
| Score | 1.8804108 |
| Snippet | Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity.... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Accuracy Attention mechanisms Automation Bridges Computer architecture Decoding encoder-decoder architectures Focusing Measurement speaker embeddings speaker identity Speaker recognition speech-to-speech translation |
| Title | Preserving Speaker Identity in Speech-to-Speech Translation: An Exploration of Attention-Based Approaches |
| URI | https://ieeexplore.ieee.org/document/10774862 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELUAMTABoohveWB1SezUsdnSqhUsVSWK1K3yxxkipLRqUyT-PbaTFjEwsDk3RMnZzsud37tD6N4HOCoxGRDLjCKZ1IYowxRJbGqFos4vJB2bTeTjsZjN5KQVq0ctDABE8hl0wzCe5duF2YRUmd_h_mdFhC_ufp7zRqy1S6gw6tcj77WFRdNEPjwPBq_9YeExOkt8JEiz7vYGv1qpRCQZHf_zGU5Q50eThyc7tDlFe1CdoTIwKMJur97wyxLUB6xwK739wmUVbGDeSb0gzQhHaGrob4-4qHBDwYvXeOFwUdcN_ZH0PbpZXLQVx2HdQdPRcDp4Im3zBFLKtCYZ50L3BFWpttKIXKU20UILGUyO257NVaKk9tgkuXZUgmIgtNHOv5yljJ2jg2pRwQXClGXUx995onOVOcMlpBIgNY4L73TrLlEnuGm-bMpjzLceuvrDfo2OwmSExCxlN-igXm3gFh2az7pcr-7ipH4D3nSk0g |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA4yBT2pOPG3OXjNbNO0Tbx1Y2PDOQZO2G3kx6sWoR1bJ_jfm7R14sGDt_QdSvuS9Ot7-b73ELqzAY70NANiAi0JE0oTqQNJPOMbLmlqF5Kqmk3Ekwmfz8W0EatXWhgAqMhn0HHD6izfFHrjUmV2h9ufFe6-uLshY9Sr5VrblEpA7YqMwqa0qO-J-1Gv99LtJxalmWdjQco637f41UylwpLB4T-f4gi1f1R5eLrFm2O0A_kJyhyHwu33_BU_L0G-wwo34ttPnOXOBvqNlAWpR7gCp5oA94CTHNckvOoaFylOyrImQJKuxTeDk6bmOKzbaDboz3pD0rRPIJnwS8KiiKuQU-krIzSPpW88xRUXzpRGJjSx9KRQFp1EpFIqQAbAlVapfTlDg-AUtfIihzOEacCojcBjT8WSpToS4AsAX6cRt0436TlqOzctlnWBjMW3hy7-sN-i_eHsabwYjyaPl-jATYxL09LgCrXK1Qau0Z7-KLP16qaa4C-eqKgZ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Computing+Communication+Control+and+Automation+%28Online%29&rft.atitle=Preserving+Speaker+Identity+in+Speech-to-Speech+Translation%3A+An+Exploration+of+Attention-Based+Approaches&rft.au=Jaybhaye%2C+S.+M&rft.au=Lale%2C+Yogesh&rft.au=Kulkarni%2C+Parth&rft.au=Diwnale%2C+Tanvi&rft.date=2024-08-23&rft.pub=IEEE&rft.eissn=2771-1358&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICCUBEA61740.2024.10774862&rft.externalDocID=10774862 |