Spectro-Temporal-CNN Fusion for Deepfake Speech Detection and Spoof System Attribution

Uloženo v:
Podrobná bibliografie
Název: Spectro-Temporal-CNN Fusion for Deepfake Speech Detection and Spoof System Attribution
Autoři: Zuhal Can, Buket Soyhan
Zdroj: IEEE Access, Vol 13, Pp 185802-185817 (2025)
Informace o vydavateli: IEEE, 2025.
Rok vydání: 2025
Sbírka: LCC:Electrical engineering. Electronics. Nuclear engineering
Témata: Convolutional neural networks, deepfake audio detection, ensemble classification, spectro-temporal features, speech synthesis, voice conversion, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Popis: Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, $13~\Delta $ MFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by $\approx +0.019$ -0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, $\Delta $ MFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models.
Druh dokumentu: article
Popis souboru: electronic resource
Jazyk: English
ISSN: 2169-3536
Relation: https://ieeexplore.ieee.org/document/11218048/; https://doaj.org/toc/2169-3536
DOI: 10.1109/ACCESS.2025.3625746
Přístupová URL adresa: https://doaj.org/article/1aadaf5bfeaa4b9cb6b6f3a6484880f4
Přístupové číslo: edsdoj.1aadaf5bfeaa4b9cb6b6f3a6484880f4
Databáze: Directory of Open Access Journals
Popis
Abstrakt:Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, $13~\Delta $ MFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by $\approx +0.019$ -0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, $\Delta $ MFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models.
ISSN:21693536
DOI:10.1109/ACCESS.2025.3625746