Preserving Speaker Identity in Speech-to-Speech Translation: An Exploration of Attention-Based Approaches

Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorp...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:International Conference on Computing Communication Control and Automation (Online) s. 1 - 6
Hlavní autori: Jaybhaye, S. M, Lale, Yogesh, Kulkarni, Parth, Diwnale, Tanvi, Kota, Apurva
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 23.08.2024
Predmet:
ISSN:2771-1358
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorporating speaker embeddings through various attention mechanisms is explored, including speaker-aware self-attention, cross-attention with speaker embeddings, and a dedicated speaker attention module within the decoder. Utilizing the CVSS multilingual dataset. The approach is rigorously evaluated through objective metrics (BLEU, WER, speaker recognition accuracy, cosine similarity, FID) and subjective human perception studies. The results demonstrates that dedicated speaker attention and cross-attention mechanisms within the decoder significantly enhance speaker identity preservation without compromising translation accuracy. These results pave the way for the development of STST systems that deliver both accurate content and natural, personalized communication experiences.
AbstractList Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity. The paper research investigates the efficacy of attention-based encoder-decoder architectures in achieving this goal. The impact of incorporating speaker embeddings through various attention mechanisms is explored, including speaker-aware self-attention, cross-attention with speaker embeddings, and a dedicated speaker attention module within the decoder. Utilizing the CVSS multilingual dataset. The approach is rigorously evaluated through objective metrics (BLEU, WER, speaker recognition accuracy, cosine similarity, FID) and subjective human perception studies. The results demonstrates that dedicated speaker attention and cross-attention mechanisms within the decoder significantly enhance speaker identity preservation without compromising translation accuracy. These results pave the way for the development of STST systems that deliver both accurate content and natural, personalized communication experiences.
Author Kulkarni, Parth
Jaybhaye, S. M
Lale, Yogesh
Kota, Apurva
Diwnale, Tanvi
Author_xml – sequence: 1
  givenname: S. M
  surname: Jaybhaye
  fullname: Jaybhaye, S. M
  email: sangita.jaybhaye@vit.edu
  organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India
– sequence: 2
  givenname: Yogesh
  surname: Lale
  fullname: Lale, Yogesh
  email: yogesh.lale22@vit.edu
  organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India
– sequence: 3
  givenname: Parth
  surname: Kulkarni
  fullname: Kulkarni, Parth
  email: parth.kulkarni22@vit.edu
  organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India
– sequence: 4
  givenname: Tanvi
  surname: Diwnale
  fullname: Diwnale, Tanvi
  email: tanvi.diwnale22@vit.edu
  organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India
– sequence: 5
  givenname: Apurva
  surname: Kota
  fullname: Kota, Apurva
  email: apurva.kota22@vit.edu
  organization: Vishwakarma Institute of Technology,Computer Science of Engineering (Artificial Intelligence),Pune,India
BookMark eNo1UFtLwzAYjaLgnPsHPgTfO78kbS6-dWXqYKDgfB5J-9VFZ1rSINu_1zl9Ohc4B865JGehC0jIDYMpY2BuF1X1OpuXkqkcphx4PmWgVK4lPyETo4wWBQjDlIJTMuJKsYyJQl-QyTC8A4DgkIMsRsQ_RxwwfvnwRl96tB8Y6aLBkHzaUx8OHtabLHXZkdFVtGHY2uS7cEfLQOe7ftvFX027lpYpHcJdyGZ2wIaWfR87W29wuCLnrd0OOPnDMVndz1fVY7Z8elhU5TLzhqUsl1K7QnPLXGNqrSxrwGmnzcFqZVM0yoI1LlfMSNdyg1agdrVrf7Y3XIgxuT7WekRc99F_2rhf_58jvgFvPF1O
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICCUBEA61740.2024.10774862
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350391770
EISSN 2771-1358
EndPage 6
ExternalDocumentID 10774862
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i91t-4668b582a1bd9c87a1d0b8b8982a1f6d5d7a0a9b47196bf29ea3e8bcbf486d233
IEDL.DBID RIE
IngestDate Wed Jan 15 06:21:30 EST 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i91t-4668b582a1bd9c87a1d0b8b8982a1f6d5d7a0a9b47196bf29ea3e8bcbf486d233
PageCount 6
ParticipantIDs ieee_primary_10774862
PublicationCentury 2000
PublicationDate 2024-Aug.-23
PublicationDateYYYYMMDD 2024-08-23
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-Aug.-23
  day: 23
PublicationDecade 2020
PublicationTitle International Conference on Computing Communication Control and Automation (Online)
PublicationTitleAbbrev ICCUBEA
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003204065
Score 1.8804108
Snippet Effective speech-to-speech translation (STST) requires not only accurate linguistic conversion but also preservation of the speaker's unique vocal identity....
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Accuracy
Attention mechanisms
Automation
Bridges
Computer architecture
Decoding
encoder-decoder architectures
Focusing
Measurement
speaker embeddings
speaker identity
Speaker recognition
speech-to-speech translation
Title Preserving Speaker Identity in Speech-to-Speech Translation: An Exploration of Attention-Based Approaches
URI https://ieeexplore.ieee.org/document/10774862
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELUAMTABoohveWB1SezUsdnSqhUsVSWK1K3yxxkipLRqUyT-PbaTFjEwsDk3RMnZzsud37tD6N4HOCoxGRDLjCKZ1IYowxRJbGqFos4vJB2bTeTjsZjN5KQVq0ctDABE8hl0wzCe5duF2YRUmd_h_mdFhC_ufp7zRqy1S6gw6tcj77WFRdNEPjwPBq_9YeExOkt8JEiz7vYGv1qpRCQZHf_zGU5Q50eThyc7tDlFe1CdoTIwKMJur97wyxLUB6xwK739wmUVbGDeSb0gzQhHaGrob4-4qHBDwYvXeOFwUdcN_ZH0PbpZXLQVx2HdQdPRcDp4Im3zBFLKtCYZ50L3BFWpttKIXKU20UILGUyO257NVaKk9tgkuXZUgmIgtNHOv5yljJ2jg2pRwQXClGXUx995onOVOcMlpBIgNY4L73TrLlEnuGm-bMpjzLceuvrDfo2OwmSExCxlN-igXm3gFh2az7pcr-7ipH4D3nSk0g
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA4yBT2pOPG3OXjNbNO0Tbx1Y2PDOQZO2G3kx6sWoR1bJ_jfm7R14sGDt_QdSvuS9Ot7-b73ELqzAY70NANiAi0JE0oTqQNJPOMbLmlqF5Kqmk3Ekwmfz8W0EatXWhgAqMhn0HHD6izfFHrjUmV2h9ufFe6-uLshY9Sr5VrblEpA7YqMwqa0qO-J-1Gv99LtJxalmWdjQco637f41UylwpLB4T-f4gi1f1R5eLrFm2O0A_kJyhyHwu33_BU_L0G-wwo34ttPnOXOBvqNlAWpR7gCp5oA94CTHNckvOoaFylOyrImQJKuxTeDk6bmOKzbaDboz3pD0rRPIJnwS8KiiKuQU-krIzSPpW88xRUXzpRGJjSx9KRQFp1EpFIqQAbAlVapfTlDg-AUtfIihzOEacCojcBjT8WSpToS4AsAX6cRt0436TlqOzctlnWBjMW3hy7-sN-i_eHsabwYjyaPl-jATYxL09LgCrXK1Qau0Z7-KLP16qaa4C-eqKgZ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Computing+Communication+Control+and+Automation+%28Online%29&rft.atitle=Preserving+Speaker+Identity+in+Speech-to-Speech+Translation%3A+An+Exploration+of+Attention-Based+Approaches&rft.au=Jaybhaye%2C+S.+M&rft.au=Lale%2C+Yogesh&rft.au=Kulkarni%2C+Parth&rft.au=Diwnale%2C+Tanvi&rft.date=2024-08-23&rft.pub=IEEE&rft.eissn=2771-1358&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICCUBEA61740.2024.10774862&rft.externalDocID=10774862