Multimodal Emotion Recognition based on Face and Speech using Deep Convolution Neural Network and Long Short Term Memory

Multimodal emotion recognition (MER) is crucial for analyzing a person’s mental behavior and health to enhance the performance of human–computer-interaction systems. Various deep learning-based MER systems have been presented in the last decade. However, the outcomes of the MER schemes are limited d...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Circuits, systems, and signal processing Ročník 44; číslo 9; s. 6622 - 6649
Hlavní autori: Taware, Shwetkranti, Thakare, Anuradha D.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York Springer US 01.09.2025
Springer Nature B.V
Predmet:
ISSN:0278-081X, 1531-5878
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Multimodal emotion recognition (MER) is crucial for analyzing a person’s mental behavior and health to enhance the performance of human–computer-interaction systems. Various deep learning-based MER systems have been presented in the last decade. However, the outcomes of the MER schemes are limited due to poor feature representation, lower correlation in short and long-term features, security issues, lower generalization capability, lower reliability of emotional modality systems, and higher computational intricacy of deep learning models. This paper presents the MER based on facial images and speech data using parallel deep convolution neural network (PDCNN) and bidirectional long short-term memory (BiLSTM) to improve the system’s reliability, security, and robustness. The PDCNN aims to offer superior generalization capability and feature depiction; however, BiLSTM offers better long-term dependency, temporal representation, and correlation between the multimodal data’s short and long-term attributes. The novel hybrid Particle Swarm Optimization based on Multi-Attribute Utility Theory and Archimedes Optimization Algorithm (PMA) is used to select crucial features of the facial expressions and speech data to minimize the computational intricacy of the PDCNN-LSTM framework. It offers an overall improved accuracy of 99.22%, precision of 0.9967, recall of 0.9933, and F1-score of 0.9949 for MER on the BAUM dataset compared to traditional techniques.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0278-081X
1531-5878
DOI:10.1007/s00034-025-03080-2