SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, r...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural processing letters Jg. 55; H. 6; S. 7529 - 7542
Hauptverfasser:	Yang, Shuai, Qiao, Kai, Shi, Shuhao, Yang, Jie, Ma, Dekui, Hu, Guoen, Yan, Bin, Chen, Jian
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York Springer US 01.12.2023 Springer Nature B.V
Schlagworte:	Artificial Intelligence Audio data Complex Systems Computational Intelligence Computer Science Deep learning Design Encoders-Decoders Feature extraction Head Head movement Methods Motion perception Mouth Realism Semantics Speech Synchronism Talking Talking face generation Feature learning Generative adversarial networks Encoder-decoder
ISSN:	1370-4621, 1573-773X
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, resulting in a lack of naturalness. This paper proposes SATFace, a subject agnostic talking face generation method with natural head movement. To model the talking face’s complicated and critical features (identity, background, mouth shape, head posture, etc.), we construct SATFace by taking encoder-decoder as the primary network architecture. Then, we design a long short-time feature learning network to better reference the global and local information in audio for generating reasonable head movement. Besides, a modular training process is proposed to improve explicit and implicit features’ learning effects and efficiency. The experimental comparison results show that SATFace improves by at least about 9.8% in cumulative probability of blur detection and 8.2% in synchronization confidence compared with the mainstream methods. The mean opinion scores show that SATFace has advantages in terms of lip sync quality, head movement naturalness, and video realness.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-023-11272-7