Spectro-Temporal-CNN Fusion for Deepfake Speech Detection and Spoof System Attribution
Uložené v:
| Názov: | Spectro-Temporal-CNN Fusion for Deepfake Speech Detection and Spoof System Attribution |
|---|---|
| Autori: | Zuhal Can, Buket Soyhan |
| Zdroj: | IEEE Access, Vol 13, Pp 185802-185817 (2025) |
| Informácie o vydavateľovi: | IEEE, 2025. |
| Rok vydania: | 2025 |
| Zbierka: | LCC:Electrical engineering. Electronics. Nuclear engineering |
| Predmety: | Convolutional neural networks, deepfake audio detection, ensemble classification, spectro-temporal features, speech synthesis, voice conversion, Electrical engineering. Electronics. Nuclear engineering, TK1-9971 |
| Popis: | Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, $13~\Delta $ MFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by $\approx +0.019$ -0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, $\Delta $ MFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models. |
| Druh dokumentu: | article |
| Popis súboru: | electronic resource |
| Jazyk: | English |
| ISSN: | 2169-3536 |
| Relation: | https://ieeexplore.ieee.org/document/11218048/; https://doaj.org/toc/2169-3536 |
| DOI: | 10.1109/ACCESS.2025.3625746 |
| Prístupová URL adresa: | https://doaj.org/article/1aadaf5bfeaa4b9cb6b6f3a6484880f4 |
| Prístupové číslo: | edsdoj.1aadaf5bfeaa4b9cb6b6f3a6484880f4 |
| Databáza: | Directory of Open Access Journals |
| Abstrakt: | Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, $13~\Delta $ MFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by $\approx +0.019$ -0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, $\Delta $ MFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models. |
|---|---|
| ISSN: | 21693536 |
| DOI: | 10.1109/ACCESS.2025.3625746 |
Full Text Finder
Nájsť tento článok vo Web of Science