Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors

This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish b...

Full description

Saved in:
Bibliographic Details
Published in:2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5274 - 5278
Main Authors: Saito, Yuki, Ijima, Yusuke, Nishida, Kyosuke, Takamichi, Shinnosuke
Format: Conference Proceeding
Language:English
Published: IEEE 01.04.2018
Subjects:
ISSN:2379-190X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish because of an over-regularization issue often observed in latent variables of the VAEs. To overcome the issue, this paper proposes a VAE-based non-parallel VC conditioned by not only the speaker representations but also phonetic contents of speech represented as phonetic posteriorgrams (PPGs). Since the phonetic contents are given during the training, we can expect that the VC models effectively learn speaker-independent latent features of speech. Focusing on the point, this paper also extends the conventional VAE-based non-parallel VC to many-to-many VC that can convert arbitrary speakers' characteristics into another arbitrary speakers' ones. We investigate two methods to estimate speaker representations for speakers not included in speech corpora used for training VC models: 1) adapting conventional speaker codes, and 2) using d-vectors for the speaker representations. Experimental results demonstrate that 1) PPGs successfully improve both naturalness and speaker similarity of the converted speech, and 2) both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.
AbstractList This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish because of an over-regularization issue often observed in latent variables of the VAEs. To overcome the issue, this paper proposes a VAE-based non-parallel VC conditioned by not only the speaker representations but also phonetic contents of speech represented as phonetic posteriorgrams (PPGs). Since the phonetic contents are given during the training, we can expect that the VC models effectively learn speaker-independent latent features of speech. Focusing on the point, this paper also extends the conventional VAE-based non-parallel VC to many-to-many VC that can convert arbitrary speakers' characteristics into another arbitrary speakers' ones. We investigate two methods to estimate speaker representations for speakers not included in speech corpora used for training VC models: 1) adapting conventional speaker codes, and 2) using d-vectors for the speaker representations. Experimental results demonstrate that 1) PPGs successfully improve both naturalness and speaker similarity of the converted speech, and 2) both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.
Author Nishida, Kyosuke
Ijima, Yusuke
Takamichi, Shinnosuke
Saito, Yuki
Author_xml – sequence: 1
  givenname: Yuki
  surname: Saito
  fullname: Saito, Yuki
  organization: Graduate School of Information Science and Technology, The University of Tokyo
– sequence: 2
  givenname: Yusuke
  surname: Ijima
  fullname: Ijima, Yusuke
  organization: NTT Media Intelligence Laboratories, NTT Corporation, Japan
– sequence: 3
  givenname: Kyosuke
  surname: Nishida
  fullname: Nishida, Kyosuke
  organization: NTT Media Intelligence Laboratories, NTT Corporation, Japan
– sequence: 4
  givenname: Shinnosuke
  surname: Takamichi
  fullname: Takamichi, Shinnosuke
  organization: NTT Media Intelligence Laboratories, NTT Corporation, Japan
BookMark eNotUNtKAzEUjKJgrf2CvuQHtuYkm93ksdQrFF2oXXwr2c1pjWwTTVahf-8WCwdmhhkG5lyTCx88EjIFNgNg-vZ5MV-tqhlnoGYqL0Co_IxMdKlAClXkhVTqnIy4KHUGmr1fkUlKn4wxXqi8lMWIfL8En1Ummq7DjtbBtUgXwf9iTC54uk7O72htojP9oE1H5z99QN8GOySOSeuOBlraHGj1MbDetbQKqcfoQtxFs0_UeEvvshrbPsR0Qy63pks4OeGYrB_u3xZP2fL1cZizzBzPoc9AGD2chBJK1eQMOWOIUmPTAOdb27TaKNsIkEyb3MrSFiCBF1upQAlsxJhM_3sdIm6-otubeNicniT-AGMZXsw
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP.2018.8461384
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9781538646588
1538646587
EISSN 2379-190X
EndPage 5278
ExternalDocumentID 8461384
Genre orig-research
GroupedDBID 23M
29P
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i241t-13a93a9517178b40e200ee59ebb122fdbc9a8db31509a4d57d615126f58183eb3
IEDL.DBID RIE
ISICitedReferencesCount 99
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000446384605089&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:54:52 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-13a93a9517178b40e200ee59ebb122fdbc9a8db31509a4d57d615126f58183eb3
PageCount 5
ParticipantIDs ieee_primary_8461384
PublicationCentury 2000
PublicationDate 2018-04
PublicationDateYYYYMMDD 2018-04-01
PublicationDate_xml – month: 04
  year: 2018
  text: 2018-04
PublicationDecade 2010
PublicationTitle 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev ICASSP
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002684756
ssj0008748
Score 2.4276173
Snippet This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models...
SourceID ieee
SourceType Publisher
StartPage 5274
SubjectTerms Backpropagation algorithms
d-vectors
Decoding
Focusing
Graphical models
many-to-many VC
phonetic posteri-orgrams
Phonetics
Speech coding
Training
VAE-based non-parallel VC
Title Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors
URI https://ieeexplore.ieee.org/document/8461384
WOSCitedRecordID wos000446384605089&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB7a4kEvPlrxzR48um0em2ZzLNWilxKolt7KbnaDBW00bQX_vTObUBW8CDksIWSXmWTnsTPfB3CNcjeeFjH3Q2W5CHLLZRbkXEWJzdDGyjg3jmwiHo_lbJakDbjZ9sJYa13xme3S0J3lmyLbUKqsh7bSD6VoQjOO-1Wv1jafQqglDsm83oVlLGSNMuR7Se9hOJhMUirlkt36Nb_4VJw5Ge3_byEH0Pnuy2Pp1uIcQsMuj2DvB6RgG97HxZKnqiSKlBc2LXAfYEOqLHdpMeYqBNgUA-Q6CcgGm3VBYJZU0ExPmgq8yDD9ydJnHOGXxYjSFyfA1ZXqdcXU0rBbPnUJ_1UHnkZ3j8N7XtMq8AWaayKfVwlekY-RnNTCs_ijWIu60doPgtzoLFHS6BBdxUQJE8WGvJ6gn0do3EMMvo-htcTpT4DpXNmMKAFNlAv0NZTQmR94NlQKPUepTqFNwpu_VcgZ81puZ3_fPodd0k9VF3MBrXW5sZewk32sF6vyyqn7C12frJc
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFA5zCuqLl028mwcfzdZLatPHMR0bzlLYHHsbSZPiQFftNsF_7zlZmQq-CH0IpTThnDbnknO-j5BrkLt2FA-Z60vDuJcZJlIvYzKITAo2VoSZtmQTYRyL8ThKKuRm3QtjjLHFZ6aBQ3uWr_N0iamyJthK1xd8g2wic1bZrbXOqCBuicUyL_dhEXJR4gy5TtTstVuDQYLFXKJRvugXo4o1KJ29_y1ln9S_O_NosrY5B6RiZodk9weoYI28x_mMJbJAkpQXOsphJ6BtrC23iTFqawToCELkMg1IW8tFjnCWWNKMT-oVfJGm6pMmzzCCb4siqS9MAKsr5Oucypmmd2xkU_7zOnnq3A_bXVYSK7ApGGykn5cRXIELsZxQ3DHwqxgD2lHK9bxMqzSSQisfnMVIch2EGv0e7zYLwLz7EH4fkeoMpj8mVGXSpEgKqIOMg7chuUpdzzG-lOA7CnlCaii8ydsKO2NSyu3079tXZLs7fOxP-r344YzsoK5WVTLnpLooluaCbKUfi-m8uLSq_wKHgq_g
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Non-Parallel+Voice+Conversion+Using+Variational+Autoencoders+Conditioned+by+Phonetic+Posteriorgrams+and+D-Vectors&rft.au=Saito%2C+Yuki&rft.au=Ijima%2C+Yusuke&rft.au=Nishida%2C+Kyosuke&rft.au=Takamichi%2C+Shinnosuke&rft.date=2018-04-01&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=5274&rft.epage=5278&rft_id=info:doi/10.1109%2FICASSP.2018.8461384&rft.externalDocID=8461384