A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning
Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, ac...
Uložené v:
| Vydané v: | Digital signal processing Ročník 163; s. 105234 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier Inc
01.08.2025
|
| Predmet: | |
| ISSN: | 1051-2004 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, achieve real-time processing, and provide interpretability and generalization across varied environments. To overcome these limitations, we propose a novel hybrid CNN-LSTM architecture, integrating CNNs for spatial feature extraction and LSTMs for temporal sequence modeling, enabling superior classification performance. We also present an improved Patch Transformer model that uses self-attention mechanisms to improve spectro-temporal feature learning even more. To achieve this, we extract Mel-frequency Cepstral Coefficients (MFCC) and transform them into Mel spectrograms so that the models can accurately represent spectro-temporal information for ESC. We incorporate batch normalization, feature representation maps, and transfer learning to enhance the model's accuracy and generalization. Evaluated on the UrbanSound8K and ESC-50 datasets, our model surpasses conventional CNN, RNN, and ANN approaches, achieving state-of-the-art accuracy. The findings demonstrate the importance of hybrid architectures and optimized feature extraction techniques in ESC. This study provides a robust framework to improve environmental sound recognition, with potential applications in smart surveillance, healthcare monitoring, and urban noise analysis. |
|---|---|
| ISSN: | 1051-2004 |
| DOI: | 10.1016/j.dsp.2025.105234 |