A vector quantized masked autoencoder for audiovisual speech emotion recognition

An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model...

Full description

Saved in:

Bibliographic Details
Published in:	Computer vision and image understanding Vol. 257; p. 104362
Main Authors:	Sadok, Samir, Leglaive, Simon, Séguier, Renaud
Format:	Journal Article
Language:	English
Published:	Elsevier Inc 01.06.2025 Elsevier
Subjects:	Audiovisual speech representation learning Computer Science Computer Vision and Pattern Recognition Emotion recognition Machine Learning Masked autoencoder Multimedia Self-supervised learning Signal and Image Processing Sound Emotion recognition Audiovisual speech representation learning Masked autoencoder Self-supervised learning
ISSN:	1077-3142, 1090-235X
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions. [Display omitted] •We present a self-supervised model for audiovisual speech emotion recognition.•The model operates on discrete audiovisual speech tokens.•A multimodal masked autoencoder with attention fuses the audio and visual modalities.•The model achieves state-of-the-art audiovisual speech emotion recognition results.•Ablation studies reveal the importance of each model component.
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2025.104362