Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Sensors (Basel, Switzerland) Ročník 20; číslo 22; s. 6688
Hlavní autoři:	Lee, Sanghyun, Han, David K., Ko, Hanseok
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Switzerland MDPI AG 23.11.2020 MDPI
Témata:	Accuracy bidirectional encoder representations from transformers (BERT) convolutional neural networks (CNNs) Deep learning Emotions Experiments Humans Neural networks Neural Networks, Computer representation Signal processing spatiotemporal representation Speech speech emotion recognition transformer bidirectional encoder representations from transformers (BERT) convolutional neural networks (CNNs) speech emotion recognition fusion model transformer representation spatiotemporal representation
ISSN:	1424-8220, 1424-8220
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With the advent of deep learning, there have been some efforts in applying the deep-network-based approach to the problem of emotion recognition. As deep learning automatically extracts salient features correlated to speaker emotion, it brings certain advantages over the handcrafted-feature-based methods. There are, however, some challenges in applying them to the emotion recognition problem, because data required for properly training deep networks are often lacking. Therefore, there is a need for a new deep-learning-based approach which can exploit available information from given speech signals to the maximum extent possible. Our proposed method, called “Fusion-ConvBERT”, is a parallel fusion model consisting of bidirectional encoder representations from transformers and convolutional neural networks. Extensive experiments were conducted on the proposed model using the EMO-DB and Interactive Emotional Dyadic Motion Capture Database emotion corpus, and it was shown that the proposed method outperformed state-of-the-art techniques in most of the test configurations.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s20226688