Attention based convolutional recurrent neural network for environmental sound classification

[Display omitted] •We employ an attention model to automatically focus on the semantically relevant frames for ESC.•We propose a novel convolutional RNN model to analyze temporal relations for ESC.•We apply a data augmentation pipeline to further improve perfromance for ESC. Environmental sound clas...

Full description

Saved in:

Bibliographic Details
Published in:	Neurocomputing (Amsterdam) Vol. 453; pp. 896 - 903
Main Authors:	Zhang, Zhichao, Xu, Shugong, Zhang, Shunqing, Qiao, Tianhao, Cao, Shan
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 17.09.2021
Subjects:	Attention mechanism Convolutional recurrent neural network Environmental sound classification Convolutional recurrent neural network Attention mechanism Environmental sound classification
ISSN:	0925-2312, 1872-8286
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	[Display omitted] •We employ an attention model to automatically focus on the semantically relevant frames for ESC.•We propose a novel convolutional RNN model to analyze temporal relations for ESC.•We apply a data augmentation pipeline to further improve perfromance for ESC. Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose a convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. We investigated the classification performance when using different attention scaling function and applying different layers. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and our method achieved the state-of-the-art or competitive classification accuracy with lower computational complexity. We also visualized our attention results and observed that the proposed attention mechanism was able to lead the network tofocus on the semantically relevant parts of environmental sounds.
ISSN:	0925-2312 1872-8286
DOI:	10.1016/j.neucom.2020.08.069