A hybrid CNN-LSTM model for environmental sound classification: Leveraging feature engineering and transfer learning
Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, ac...
Uloženo v:
| Vydáno v: | Digital signal processing Ročník 163; s. 105234 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier Inc
01.08.2025
|
| Témata: | |
| ISSN: | 1051-2004 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Environmental Sound Classification (ESC) remains a significant challenge due to the dynamic nature of audio data, scarcity of labeled datasets, and the complexity of feature extraction. Existing deep learning approaches, such as CNNs and RNNs, frequently struggle to manage diverse sound features, achieve real-time processing, and provide interpretability and generalization across varied environments. To overcome these limitations, we propose a novel hybrid CNN-LSTM architecture, integrating CNNs for spatial feature extraction and LSTMs for temporal sequence modeling, enabling superior classification performance. We also present an improved Patch Transformer model that uses self-attention mechanisms to improve spectro-temporal feature learning even more. To achieve this, we extract Mel-frequency Cepstral Coefficients (MFCC) and transform them into Mel spectrograms so that the models can accurately represent spectro-temporal information for ESC. We incorporate batch normalization, feature representation maps, and transfer learning to enhance the model's accuracy and generalization. Evaluated on the UrbanSound8K and ESC-50 datasets, our model surpasses conventional CNN, RNN, and ANN approaches, achieving state-of-the-art accuracy. The findings demonstrate the importance of hybrid architectures and optimized feature extraction techniques in ESC. This study provides a robust framework to improve environmental sound recognition, with potential applications in smart surveillance, healthcare monitoring, and urban noise analysis. |
|---|---|
| ISSN: | 1051-2004 |
| DOI: | 10.1016/j.dsp.2025.105234 |