Generating Diverse and Natural 3D Human Motions from Text

Automated generation of 3D human motions from text is a challenging problem. The generated motions are expected to be sufficiently diverse to explore the text-grounded motion space, and more importantly, accurately depicting the content in prescribed text descriptions. Here we tackle this problem wi...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 5142 - 5151
Hlavní autori:	Guo, Chuan, Zou, Shihao, Zuo, Xinxin, Wang, Sen, Ji, Wei, Li, Xingyu, Cheng, Li
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.06.2022
Predmet:	Codes Computer vision Machine learning Robot vision systems Semantics Three-dimensional displays Tracking Vision + language; Image and video synthesis and generation; Machine learning; Motion and tracking; Robot vision
ISSN:	1063-6919
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Automated generation of 3D human motions from text is a challenging problem. The generated motions are expected to be sufficiently diverse to explore the text-grounded motion space, and more importantly, accurately depicting the content in prescribed text descriptions. Here we tackle this problem with a two-stage approach: text2length sampling and text2motion generation. Text2length involves sampling from the learned distribution function of motion lengths conditioned on the input text. This is followed by our text2motion module using temporal variational autoen-coder to synthesize a diverse set of human motions of the sampled lengths. Instead of directly engaging with pose sequences, we propose motion snippet code as our internal motion representation, which captures local semantic motion contexts and is empirically shown to facilitate the generation of plausible motions faithful to the input text. Moreover, a large-scale dataset of scripted 3D Human motions, HumanML3D, is constructed, consisting of 14,616 motion clips and 44,970 text descriptions.
ISSN:	1063-6919
DOI:	10.1109/CVPR52688.2022.00509