F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder
Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AU-TOVC, a conditional autoencoders (CAEs) b...
Saved in:
| Published in: | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 6284 - 6288 |
|---|---|
| Main Authors: | , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.05.2020
|
| Subjects: | |
| ISSN: | 2379-190X |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AU-TOVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis. |
|---|---|
| AbstractList | Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AU-TOVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis. |
| Author | Hasegawa-Johnson, Mark Jin, Zeyu Mysore, Gautham J. Qian, Kaizhi |
| Author_xml | – sequence: 1 givenname: Kaizhi surname: Qian fullname: Qian, Kaizhi organization: University of Illinois at Urbana-Champaign,IL,USA – sequence: 2 givenname: Zeyu surname: Jin fullname: Jin, Zeyu organization: Adobe Research,CA,USA – sequence: 3 givenname: Mark surname: Hasegawa-Johnson fullname: Hasegawa-Johnson, Mark organization: University of Illinois at Urbana-Champaign,IL,USA – sequence: 4 givenname: Gautham J. surname: Mysore fullname: Mysore, Gautham J. organization: Adobe Research,CA,USA |
| BookMark | eNotkF1LwzAYhaMouM39Am_yBzLffDRpLsdwKkwdrA7vRtq8gUhNpK3C_r0d7urhgcPhcKbkKuWEhFAOC87B3j-vlrvdVoExeiFAwMJCoYxUF2RuTckLsKC15MUlmQhpLOMWPm7ItO8_AaA0qpyQag1slVMf-wHTQF9cOrIqsxPpa05s6zrXttjSfY4N0jH6i10fc6L76E7q4zCaa-nyZ8iYmuyxuyXXwbU9zs-ckff1Q7V6Ypu3x3HzhkUBcmBeh9o7JwBRO1ProJW2HoIRSjSKWw11CKCFDJaHwvta1BB04I0OFoPRckbu_nsjIh6-u_jluuPhfIL8A6PBVKo |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICASSP40776.2020.9054734 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 9781509066315 1509066314 |
| EISSN | 2379-190X |
| EndPage | 6288 |
| ExternalDocumentID | 9054734 |
| Genre | orig-research |
| GroupedDBID | 23M 29P 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i203t-d6fbdaa20ee6a7b6f6469d0f7242c41960bff0623f91f5ddb2b0f6f1c6f9ef763 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 69 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000615970406109&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:46:49 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-d6fbdaa20ee6a7b6f6469d0f7242c41960bff0623f91f5ddb2b0f6f1c6f9ef763 |
| PageCount | 5 |
| ParticipantIDs | ieee_primary_9054734 |
| PublicationCentury | 2000 |
| PublicationDate | 2020-May |
| PublicationDateYYYYMMDD | 2020-05-01 |
| PublicationDate_xml | – month: 05 year: 2020 text: 2020-May |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) |
| PublicationTitleAbbrev | ICASSP |
| PublicationYear | 2020 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0008748 |
| Score | 2.5043507 |
| Snippet | Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 6284 |
| SubjectTerms | Acoustics autoencoder Conferences F0-conversion Generative adversarial networks Signal processing Speech processing Task analysis Tuning voice-conversion WaveNet-vocoder |
| Title | F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder |
| URI | https://ieeexplore.ieee.org/document/9054734 |
| WOSCitedRecordID | wos000615970406109&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1La8MwDBZt2WG77NGOvfFhx7m1M8dJjqWsbIeVQLvSW3FsCwIjGVm63z876boNdtnJsSEEJCRZivR9ALcqMS6uR4py5IYKkXCqQuRUm9ClYoHQ3MQN2UQ0m8WrVZJ24G43C2OtbZrP7NA_Nv_yTak3vlQ2SphnyhVd6EaRbGe1dl43jkT81anDktHTZDyfp8KD1bgkMGDD7bu_SFSaGDI9_N_Xj2DwPYxH0l2YOYaOLU7g4AeOYB8WU0Yb5k2nsqImz87A6aKkfiWzsqCpqjxlyitZls4vkInvNG_KZGSZK781eVsSJONNXXpoS2OrAbxMHxaTR7qlS6B5wO5raiRmRqmAWStVlEmULvU1DJ3EAy2cpbEMkbnrDiYcQ2OyIGMokWuJiUXnZ06hV5SFPQOilI61u7upiCvhRJgxFQoTerg0xEyG59D38lm_tYgY661oLv4-voR9r4K2TfAKenW1sdewpz_q_L26adT4CR3Gnvo |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB5qFdSLj1Z8uwePbrsbN69jKZYW2xBoLL2VzT4gIInE1N_vblKrghdPm10IgRlmZmcy830A9zyUJq77HFNNJWYspJi7mmIhXZOKOUxQGdRkE34UBctlGLfgYTsLo5Sqm89Uzz7W__JlIda2VNYPiWXKZTuw6zLmkGZaa-t3A58FX706JOxPhoP5PGYWrsakgQ7pbd7-RaNSR5HR0f--fwzd73E8FG8DzQm0VH4Khz-QBDuQjAiuuTeN0vIKzYyJ46TAdkVRkeOYl5Y05RUtCuMZ0ND2mteFMrTIuN3KrCkKosG6Kiy4pVRlF15GT8lwjDeECThzyGOFpadTyblDlPK4n3raM8mvJNrI3BHM2BpJtSbmwqNDql0pUycl2tNUeDpU2niaM2jnRa7OAXEuAmFub9ynnBkRpoS7TLoWME3r1HMvoGPls3prMDFWG9Fc_n18B_vjZDZdTSfR8xUcWHU0TYPX0K7KtbqBPfFRZe_lba3ST5FJokE |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=F0-Consistent+Many-To-Many+Non-Parallel+Voice+Conversion+Via+Conditional+Autoencoder&rft.au=Qian%2C+Kaizhi&rft.au=Jin%2C+Zeyu&rft.au=Hasegawa-Johnson%2C+Mark&rft.au=Mysore%2C+Gautham+J.&rft.date=2020-05-01&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=6284&rft.epage=6288&rft_id=info:doi/10.1109%2FICASSP40776.2020.9054734&rft.externalDocID=9054734 |