Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders
Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during...
Uloženo v:
| Vydáno v: | Proceedings / IEEE International Conference on Computer Vision s. 11273 - 11282 |
|---|---|
| Hlavní autoři: | , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.10.2021
|
| Témata: | |
| ISSN: | 2380-7504 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to over-come this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures. |
|---|---|
| AbstractList | Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to over-come this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures. |
| Author | Kang, Di He, Zhenyu Pei, Wenjie Bao, Linchao Zhang, Ying Li, Jing Zhe, Xuefei |
| Author_xml | – sequence: 1 givenname: Jing surname: Li fullname: Li, Jing organization: Harbin Institute of Technology,Shenzhen – sequence: 2 givenname: Di surname: Kang fullname: Kang, Di organization: Tencent AI Lab – sequence: 3 givenname: Wenjie surname: Pei fullname: Pei, Wenjie organization: Harbin Institute of Technology,Shenzhen – sequence: 4 givenname: Xuefei surname: Zhe fullname: Zhe, Xuefei organization: Tencent AI Lab – sequence: 5 givenname: Ying surname: Zhang fullname: Zhang, Ying organization: Tencent AI Lab – sequence: 6 givenname: Zhenyu surname: He fullname: He, Zhenyu email: zhenyuhe@hit.edu.cn organization: Harbin Institute of Technology,Shenzhen – sequence: 7 givenname: Linchao surname: Bao fullname: Bao, Linchao organization: Tencent AI Lab |
| BookMark | eNo1jMFOAjEURavRREC-QBf9gcHX15lp646MiiQkLlS2pJ15hRpoycyg8e8liqt7c0_uGbKLmCIxditgIgSYu3lVLXNtECcIKCYgjusZGxulRVkWOWqBxTkboNSQqQLyKzbsug8AaVCXA7aeHpqQcEZdf2ipu-czitTaPsQ1fwif1HbE_yH3bdrx1z1RveG_P_4V-g2vUmxCH1K0W760bbCnPj30iWKdmqPlml16u-1ofMoRe396fKues8XLbF5NF1lAkH0mUcq61kIBeK3r3AI5h9Z5r1zpvXCIzhmdF0rXVhnvUTToGkNkSwd1KUfs5s8biGi1b8POtt8rowQIUPIHZRpbnA |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICCV48922.2021.01110 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9781665428125 1665428120 |
| EISSN | 2380-7504 |
| EndPage | 11282 |
| ExternalDocumentID | 9710107 |
| Genre | orig-research |
| GroupedDBID | 29O 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i203t-3233cc81700f88c4a0ebb2abff7b6ff1b22bb984578ca79ff21d2bd9eea6b0c63 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 64 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000798743201044&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:25:36 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-3233cc81700f88c4a0ebb2abff7b6ff1b22bb984578ca79ff21d2bd9eea6b0c63 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_9710107 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-Oct. |
| PublicationDateYYYYMMDD | 2021-10-01 |
| PublicationDate_xml | – month: 10 year: 2021 text: 2021-Oct. |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings / IEEE International Conference on Computer Vision |
| PublicationTitleAbbrev | ICCV |
| PublicationYear | 2021 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0039286 |
| Score | 2.5677335 |
| Snippet | Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 11273 |
| SubjectTerms | Action and behavior recognition Bicycles Codes Computer vision Correlation Gestures and body pose Speech coding Three-dimensional displays Training Vision + other modalities |
| Title | Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders |
| URI | https://ieeexplore.ieee.org/document/9710107 |
| WOSCitedRecordID | wos000798743201044&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcDEo0W85YGRtImdOg4bKhSQUFUJqLpVflwgS1O1Cb8fn5MWIbGwRYkjSz6dz_f5vu8IuVZcCx0LE4A0LkGxLmeV1rCA61BZhQ0LvZbe9CUZj-Vslk5a5GbLhQEAX3wGPXz0d_m2MBVCZf3UhcMIqeM7SSJqrtZm13VhXoqGGheFaf95OJzGMmXItWJRDzuqh78aqPj4Mdr_38wHpPtDxKOTbYg5JC1YHJH95uRIG79cd8jHXWXzgj26Lb5y-fMtrdWksaSZ3vvKC6CbjxQZJfR1CWA-qf-PIhhL3ZQ2r5FBOnUZdIMSuiFlgWqXWPHcJe-jh7fhU9C0UAhyFvIy4IxzY7wIXyaliVUIWjOlsyzRIssizZjWzh7Ob41K0ixjkWXapgBK6NAIfkzai2IBJ4RygCgeKEhAJbFwnquc8RFFRfr1gKtT0sF1my9rlYx5s2Rnf78-J3tomLos7oK0y1UFl2TXfJX5enXlTfsNonCn9Q |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8JAEN4QNdETKhjf7sGjhXa39OHNoAgRCYlIuJF9TLUXSqD197uzLRgTL96adptNdjI7O9_O9w0ht4LLQPqBciBSJkHRJmeNtGIOl67QAhsWWi296TAcjaLZLB7XyN2WCwMAtvgMWvho7_J1pgqEytqxCYceUsd3O77P3JKttdl3TaCPgooc57lxe9DtTv0oZsi2Yl4Le6q7v1qo2AjSq_9v7kPS_KHi0fE2yByRGiyOSb06O9LKM9cN8vFQ6DRjz2aTL0wGfU9LPWksaqaPtvYC6OYjRU4JfVsCqE9q_6MIx1IzpU5LbJBOTQ5d4YRmSJ6h3iXWPDfJe-9p0u07VRMFJ2Uuzx3OOFfKyvAlUaR84YKUTMgkCWWQJJ5kTEpjEeO5SoRxkjBPM6ljABFIVwX8hOwssgWcEsoBPL8jIAQR-oHxXWHMjzgqErA7XJyRBq7bfFnqZMyrJTv_-_UN2e9PXofz4WD0ckEO0Ehlkdwl2clXBVyRPfWVp-vVtTXzN4nkqzw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+International+Conference+on+Computer+Vision&rft.atitle=Audio2Gestures%3A+Generating+Diverse+Gestures+from+Speech+Audio+with+Conditional+Variational+Autoencoders&rft.au=Li%2C+Jing&rft.au=Kang%2C+Di&rft.au=Pei%2C+Wenjie&rft.au=Zhe%2C+Xuefei&rft.date=2021-10-01&rft.pub=IEEE&rft.eissn=2380-7504&rft.spage=11273&rft.epage=11282&rft_id=info:doi/10.1109%2FICCV48922.2021.01110&rft.externalDocID=9710107 |