A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning

Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, ac...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Digital signal processing Ročník 163; s. 105234
Hlavní autori:	Akter, Rubaiya, Islam, Md. Rezwanul, Debnath, Sumon Kumar, Sarker, Prodip Kumar, Uddin, Md. Kamal
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier Inc 01.08.2025
Predmet:	Deep learning Environmental sound classification Evaluation metrics Hybrid model Mel-frequency cepstral coefficients Deep learning Mel-frequency cepstral coefficients Hybrid model Environmental sound classification Evaluation metrics
ISSN:	1051-2004
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, achieve real-time processing, and provide interpretability and generalization across varied environments. To overcome these limitations, we propose a novel hybrid CNN-LSTM architecture, integrating CNNs for spatial feature extraction and LSTMs for temporal sequence modeling, enabling superior classification performance. We also present an improved Patch Transformer model that uses self-attention mechanisms to improve spectro-temporal feature learning even more. To achieve this, we extract Mel-frequency Cepstral Coefficients (MFCC) and transform them into Mel spectrograms so that the models can accurately represent spectro-temporal information for ESC. We incorporate batch normalization, feature representation maps, and transfer learning to enhance the model's accuracy and generalization. Evaluated on the UrbanSound8K and ESC-50 datasets, our model surpasses conventional CNN, RNN, and ANN approaches, achieving state-of-the-art accuracy. The findings demonstrate the importance of hybrid architectures and optimized feature extraction techniques in ESC. This study provides a robust framework to improve environmental sound recognition, with potential applications in smart surveillance, healthcare monitoring, and urban noise analysis.
ISSN:	1051-2004
DOI:	10.1016/j.dsp.2025.105234