Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings / IEEE International Conference on Computer Vision s. 11273 - 11282
Hlavní autori: Li, Jing, Kang, Di, Pei, Wenjie, Zhe, Xuefei, Zhang, Ying, He, Zhenyu, Bao, Linchao
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.10.2021
Predmet:
ISSN:2380-7504
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to over-come this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.
AbstractList Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to over-come this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.
Author Kang, Di
He, Zhenyu
Pei, Wenjie
Bao, Linchao
Zhang, Ying
Li, Jing
Zhe, Xuefei
Author_xml – sequence: 1
  givenname: Jing
  surname: Li
  fullname: Li, Jing
  organization: Harbin Institute of Technology,Shenzhen
– sequence: 2
  givenname: Di
  surname: Kang
  fullname: Kang, Di
  organization: Tencent AI Lab
– sequence: 3
  givenname: Wenjie
  surname: Pei
  fullname: Pei, Wenjie
  organization: Harbin Institute of Technology,Shenzhen
– sequence: 4
  givenname: Xuefei
  surname: Zhe
  fullname: Zhe, Xuefei
  organization: Tencent AI Lab
– sequence: 5
  givenname: Ying
  surname: Zhang
  fullname: Zhang, Ying
  organization: Tencent AI Lab
– sequence: 6
  givenname: Zhenyu
  surname: He
  fullname: He, Zhenyu
  email: zhenyuhe@hit.edu.cn
  organization: Harbin Institute of Technology,Shenzhen
– sequence: 7
  givenname: Linchao
  surname: Bao
  fullname: Bao, Linchao
  organization: Tencent AI Lab
BookMark eNo1jMFOAjEURavRREC-QBf9gcHX15lp646MiiQkLlS2pJ15hRpoycyg8e8liqt7c0_uGbKLmCIxditgIgSYu3lVLXNtECcIKCYgjusZGxulRVkWOWqBxTkboNSQqQLyKzbsug8AaVCXA7aeHpqQcEZdf2ipu-czitTaPsQ1fwif1HbE_yH3bdrx1z1RveG_P_4V-g2vUmxCH1K0W760bbCnPj30iWKdmqPlml16u-1ofMoRe396fKues8XLbF5NF1lAkH0mUcq61kIBeK3r3AI5h9Z5r1zpvXCIzhmdF0rXVhnvUTToGkNkSwd1KUfs5s8biGi1b8POtt8rowQIUPIHZRpbnA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICCV48922.2021.01110
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9781665428125
1665428120
EISSN 2380-7504
EndPage 11282
ExternalDocumentID 9710107
Genre orig-research
GroupedDBID 29O
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i203t-3233cc81700f88c4a0ebb2abff7b6ff1b22bb984578ca79ff21d2bd9eea6b0c63
IEDL.DBID RIE
ISICitedReferencesCount 64
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000798743201044&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:25:36 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-3233cc81700f88c4a0ebb2abff7b6ff1b22bb984578ca79ff21d2bd9eea6b0c63
PageCount 10
ParticipantIDs ieee_primary_9710107
PublicationCentury 2000
PublicationDate 2021-Oct.
PublicationDateYYYYMMDD 2021-10-01
PublicationDate_xml – month: 10
  year: 2021
  text: 2021-Oct.
PublicationDecade 2020
PublicationTitle Proceedings / IEEE International Conference on Computer Vision
PublicationTitleAbbrev ICCV
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0039286
Score 2.567797
Snippet Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional...
SourceID ieee
SourceType Publisher
StartPage 11273
SubjectTerms Action and behavior recognition
Bicycles
Codes
Computer vision
Correlation
Gestures and body pose
Speech coding
Three-dimensional displays
Training
Vision + other modalities
Title Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders
URI https://ieeexplore.ieee.org/document/9710107
WOSCitedRecordID wos000798743201044&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVKxcBUoEV8ywMjaR3ny2ZDhQJLVQmoulW2c4YuTdUm_H58TlqExMIWJbYs2bLf-fLeO0JuHAQqF3YnAUNqYyxNHKiQ6cBF3hn6t2UJaF9sIhuPxWwmJy1yu9PCAIAnn0EfH_2__LwwFabKBtJ1D1E6vpdlaa3V2p66DuZF2kjjQiYHL8PhNBaSo9aKh32sqM5-FVDx-DHq_G_kQ9L7EeLRyQ5ijkgLlsek00SOtNmXmy75uK_yRcGf3BFfufvzHa3dpJHSTB888wLo9iNFRQl9XQGYT-r7UUzGUjdkvqgzg3TqbtBNltA1KQt0u0TGc4-8jx7fhs9BU0IhWHAWlUHEo8gYb8JnhTCxYqA1V9raTKfWhppzraWI3b41KpPW8jDnOpcAKtXMpNEJaS-LJZwSasJE8Nw6yE9MzJQSuYXURnFq0ZTQJGeki_M2X9UuGfNmys7_fn1BDnBhalrcJWmX6wquyL75Kheb9bVf2m-8mqa-
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwGG0ImugJFYy_7cGjg67rttabQREiEhKRcCNt1-IujMDm32_bDYyJF2_L1qZJm_Z9_fbe-wC4MxDITdgdeshSGwmTxOM-Ep6JvGPr3xaHSrhiE_FoRGczNq6B-50WRinlyGeqbR_dv_wkk4VNlXWY6e5b6fheSAhGpVpre-4aoKdRJY7zEesMut0poQxbtRX227amOvpVQsUhSK_xv7GPQOtHigfHO5A5BjW1PAGNKnaE1c7cNMHisUjSDL-YQ74wN-gHWPpJW1IzfHLcCwW3H6HVlMD3lVLyE7p-0KZjoRkyScvcIJyaO3SVJzRN8sz6XVrOcwt89J4n3b5XFVHwUoyC3AtwEEjpbPg0pZJwpITAXGgdi0hrX2AsBKPE7FzJY6Y19hMsEqYUjwSSUXAK6stsqc4AlH5IcaIN6IeSIM5polWkAxJpa0sow3PQtPM2X5U-GfNqyi7-fn0LDvqTt-F8OBi9XoJDu0glSe4K1PN1oa7BvvzK0836xi3zN5LeqgU
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+International+Conference+on+Computer+Vision&rft.atitle=Audio2Gestures%3A+Generating+Diverse+Gestures+from+Speech+Audio+with+Conditional+Variational+Autoencoders&rft.au=Li%2C+Jing&rft.au=Kang%2C+Di&rft.au=Pei%2C+Wenjie&rft.au=Zhe%2C+Xuefei&rft.date=2021-10-01&rft.pub=IEEE&rft.eissn=2380-7504&rft.spage=11273&rft.epage=11282&rft_id=info:doi/10.1109%2FICCV48922.2021.01110&rft.externalDocID=9710107