A Residual Multi-Scale Convolutional Neural Network With Transformers for Speech Emotion Recognition.

Uloženo v:
Podrobná bibliografie
Název: A Residual Multi-Scale Convolutional Neural Network With Transformers for Speech Emotion Recognition.
Autoři: Yan, Tianhao, Meng, Hao, Parada-Cabaleiro, Emilia, Tao, Jianhua, Li, Taihao, Schuller, Bjorn W.
Zdroj: IEEE Transactions on Affective Computing; Apr-Jun2025, Vol. 16 Issue 2, p915-932, 18p
Abstrakt: The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. Moreover, the position encoding in the Transformer structure is relatively fixed and only encodes the time domain dimension, which cannot effectively obtain the position information of discriminative features in the frequency domain dimension. In order to overtake these limitations, we propose an end-to-end Residual Multi-Scale Convolutional Neural Networks (RMSCNN) with Transformer model network. Simultaneously, to further validate the effectivenessof RMSCNN in extracting multi-scale features and delivering pertinent emotion localization data, we developed the RMSC_down network in conjunction with the Wav2Vec 2.0 model. The results of the prediction of Arousal, Valenceand Dominanceon the popular corpora demonstrate the superiority and robustness of our approach for SER, showing an improvement of the recognition accuracy in the public dataset MSP-Podcast 1.9 version. [ABSTRACT FROM AUTHOR]
Copyright of IEEE Transactions on Affective Computing is the property of IEEE and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Popis
Abstrakt:The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. Moreover, the position encoding in the Transformer structure is relatively fixed and only encodes the time domain dimension, which cannot effectively obtain the position information of discriminative features in the frequency domain dimension. In order to overtake these limitations, we propose an end-to-end Residual Multi-Scale Convolutional Neural Networks (RMSCNN) with Transformer model network. Simultaneously, to further validate the effectivenessof RMSCNN in extracting multi-scale features and delivering pertinent emotion localization data, we developed the RMSC_down network in conjunction with the Wav2Vec 2.0 model. The results of the prediction of Arousal, Valenceand Dominanceon the popular corpora demonstrate the superiority and robustness of our approach for SER, showing an improvement of the recognition accuracy in the public dataset MSP-Podcast 1.9 version. [ABSTRACT FROM AUTHOR]
ISSN:19493045
DOI:10.1109/TAFFC.2024.3481253