A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning

Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, ac...

Full description

Saved in:

Bibliographic Details
Published in:	Digital signal processing Vol. 163; p. 105234
Main Authors:	Akter, Rubaiya, Islam, Md. Rezwanul, Debnath, Sumon Kumar, Sarker, Prodip Kumar, Uddin, Md. Kamal
Format:	Journal Article
Language:	English
Published:	Elsevier Inc 01.08.2025
Subjects:	Deep learning Environmental sound classification Evaluation metrics Hybrid model Mel-frequency cepstral coefficients Deep learning Mel-frequency cepstral coefficients Hybrid model Environmental sound classification Evaluation metrics
ISSN:	1051-2004
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, achieve real-time processing, and provide interpretability and generalization across varied environments. To overcome these limitations, we propose a novel hybrid CNN-LSTM architecture, integrating CNNs for spatial feature extraction and LSTMs for temporal sequence modeling, enabling superior classification performance. We also present an improved Patch Transformer model that uses self-attention mechanisms to improve spectro-temporal feature learning even more. To achieve this, we extract Mel-frequency Cepstral Coefficients (MFCC) and transform them into Mel spectrograms so that the models can accurately represent spectro-temporal information for ESC. We incorporate batch normalization, feature representation maps, and transfer learning to enhance the model's accuracy and generalization. Evaluated on the UrbanSound8K and ESC-50 datasets, our model surpasses conventional CNN, RNN, and ANN approaches, achieving state-of-the-art accuracy. The findings demonstrate the importance of hybrid architectures and optimized feature extraction techniques in ESC. This study provides a robust framework to improve environmental sound recognition, with potential applications in smart surveillance, healthcare monitoring, and urban noise analysis.
ISSN:	1051-2004
DOI:	10.1016/j.dsp.2025.105234