Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Human social behaviors are inherently multi-modal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE International Conference and Workshops on Automatic Face and Gesture Recognition : FG s. 1 - 5
Hlavní autoři: Bohy, Hugo, Tran, Minh, El Haddad, Kevin, Dutoit, Thierry, Soleymani, Mohammad
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 27.05.2024
Témata:
ISSN:2770-8330
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Human social behaviors are inherently multi-modal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by fine-tuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.
ISSN:2770-8330
DOI:10.1109/FG59268.2024.10581940