Multimodal Multi-View Spectral-Spatial-Temporal Masked Autoencoder for Self-Supervised Emotion Recognition

Emotion recognition is a primary and complex task in emotional intelligence. Due to the complexity of human emotions, utilizing multimodal fusion methods can enhance the performance by leveraging the complementary properties of different modalities. In this paper, we propose a Multimodal Multi-view...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) S. 1926 - 1930
Hauptverfasser: Gao, Pengxuan, Liu, Tianyu, Liu, Jia-Wen, Lu, Bao-Liang, Zheng, Wei-Long
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 14.04.2024
Schlagworte:
ISSN:2379-190X
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Emotion recognition is a primary and complex task in emotional intelligence. Due to the complexity of human emotions, utilizing multimodal fusion methods can enhance the performance by leveraging the complementary properties of different modalities. In this paper, we propose a Multimodal Multi-view Spectral-Spatial-Temporal Masked Autoencoder (Multimodal MV-SSTMA) with self-supervised learning to investigate multimodal emotion recognition based on electroencephalogram (EEG) and eye movement signals. Our experimental process comprises three stages: 1) In the pre-training stage, we employ MV-SSTMA to train feature extractors for EEG and eye movement signals; 2) In the fine-tuning stage, the labeled data are input to the feature extractors to fuse and fine-tune the features; 3) In the testing stage, our model is applied to recognize emotions with test data to calculate the accuracies of different methods. Our experimental results demonstrate that the multimodal fusion model outperforms the unimodal model on both SEED-IV and SEED-V datasets. In addition, the proposed model can still effectively recognize emotions with various ratios of missing data. These results underscore the efficiency of multimodal self-supervised learning and data fusion in emotion recognition.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10447194