Attention Based Convolutional Neural Network with Multi-frequency Resolution Feature for Environment Sound Classification

The environmental sound classification has great research significance in the fields of intelligent audio monitoring and other fields. A novel multi-frequency resolution (MFR) feature is proposed in this paper to solve the problem that the existing single frequency resolution time–frequency features...

Full description

Saved in:
Bibliographic Details
Published in:Neural processing letters Vol. 55; no. 4; pp. 4291 - 4306
Main Authors: Li, Minze, Huang, Wu, Zhang, Tao
Format: Journal Article
Language:English
Published: New York Springer US 01.08.2023
Springer Nature B.V
Subjects:
ISSN:1370-4621, 1573-773X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The environmental sound classification has great research significance in the fields of intelligent audio monitoring and other fields. A novel multi-frequency resolution (MFR) feature is proposed in this paper to solve the problem that the existing single frequency resolution time–frequency features of sound cannot effectively express the characteristics of multiple types of sound. The MFR feature is composed of three features with different frequency resolutions, which are compressed in varying degrees at the time dimension. This method not only has the effect of data augmentation but also can obtain more context information during the feature extraction. And the MFR features of Log-Mel Spectrogram, Cochleagram, and Constant Q-Transform are combined to form a multi-channel MFR feature. Also, a network named SacNet is built, which can effectively solve the problem that the time–frequency feature map of sound contains more invalid information. The basic structural unit of the SacNet consists of two parallel branches, one using depthwise separable convolution as the main feature extractor, and the other using spatial attention module to extract more effective information. Experiment results have demonstrated that the proposed method achieves the state-of-the-art accuracy of 97.5%, 93.1%, and 95.3% on three benchmark datasets of ESC10, ESC50, and UrbanSound8K respectively, which are increased by 3.3%, 0.5%, and 2.3% respectively compared with the previous advanced methods.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1370-4621
1573-773X
DOI:10.1007/s11063-022-11041-y