A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning
Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, ac...
Gespeichert in:
| Veröffentlicht in: | Digital signal processing Jg. 163; S. 105234 |
|---|---|
| Hauptverfasser: | , , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Elsevier Inc
01.08.2025
|
| Schlagworte: | |
| ISSN: | 1051-2004 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, achieve real-time processing, and provide interpretability and generalization across varied environments. To overcome these limitations, we propose a novel hybrid CNN-LSTM architecture, integrating CNNs for spatial feature extraction and LSTMs for temporal sequence modeling, enabling superior classification performance. We also present an improved Patch Transformer model that uses self-attention mechanisms to improve spectro-temporal feature learning even more. To achieve this, we extract Mel-frequency Cepstral Coefficients (MFCC) and transform them into Mel spectrograms so that the models can accurately represent spectro-temporal information for ESC. We incorporate batch normalization, feature representation maps, and transfer learning to enhance the model's accuracy and generalization. Evaluated on the UrbanSound8K and ESC-50 datasets, our model surpasses conventional CNN, RNN, and ANN approaches, achieving state-of-the-art accuracy. The findings demonstrate the importance of hybrid architectures and optimized feature extraction techniques in ESC. This study provides a robust framework to improve environmental sound recognition, with potential applications in smart surveillance, healthcare monitoring, and urban noise analysis. |
|---|---|
| ISSN: | 1051-2004 |
| DOI: | 10.1016/j.dsp.2025.105234 |