Zobrazit v EDS

Spectro-Temporal-CNN Fusion for Deepfake Speech Detection and Spoof System Attribution

Uloženo v:

Podrobná bibliografie
Název:	Spectro-Temporal-CNN Fusion for Deepfake Speech Detection and Spoof System Attribution
Autoři:	Zuhal Can, Buket Soyhan
Zdroj:	IEEE Access, Vol 13, Pp 185802-185817 (2025)
Informace o vydavateli:	IEEE, 2025.
Rok vydání:	2025
Sbírka:	LCC:Electrical engineering. Electronics. Nuclear engineering
Témata:	Convolutional neural networks, deepfake audio detection, ensemble classification, spectro-temporal features, speech synthesis, voice conversion, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Popis:	Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, $13~\Delta $ MFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by $\approx +0.019$ -0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, $\Delta $ MFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models.
Druh dokumentu:	article
Popis souboru:	electronic resource
Jazyk:	English
ISSN:	2169-3536
Relation:	https://ieeexplore.ieee.org/document/11218048/; https://doaj.org/toc/2169-3536
DOI:	10.1109/ACCESS.2025.3625746
Přístupová URL adresa:	https://doaj.org/article/1aadaf5bfeaa4b9cb6b6f3a6484880f4
Přístupové číslo:	edsdoj.1aadaf5bfeaa4b9cb6b6f3a6484880f4
Databáze:	Directory of Open Access Journals

View record in DOAJ

Full Text Finder

Nájsť tento článok vo Web of Science

Popis
Abstrakt:	Neural speech synthesis now produces speech that can sound convincingly human, challenging security and forensics. We propose a detector that fuses an interpretable 51-dimensional spectro-temporal vector (13 MFCCs, $13~\Delta $ MFCCs, 12 chroma, 7 spectral-contrast, 6 tonnetz) with compact CNN embeddings (EfficientNet-B1/B4, EfficientNet-V2-S/M, Xception, ResNet-50). Evaluation spans two complementary datasets: a controlled ESOGU corpus (real vs. synthetic from CoquiTTS, DiffVC, FreeVC) and the public ASVspoof2021-LA benchmark (bonafide vs. spoof across 13 attack systems, A07-A19). Duration controls remove utterance-length cues, and interpretability analyses verify reliance on formant structure and spectral texture rather than recording quirks. On ESOGU, the 51-D vector alone achieves 100% binary accuracy and 99.65% three-class recognition; with fusion, EfficientNet-B1 reaches 100% Stage-1 and 99.75% Stage-2 accuracy. These perfect scores are confined to ESOGU under our protocol; performance on ASVspoof2021-LA is lower. On ASVspoof2021-LA, where codec/channel diversity makes detection harder, fusion raises performance where it matters most: EfficientNet-V2-M attains 94.59% binary accuracy and 87.09% 13-way spoof attribution, and the bonafide-class F1 improves by $\approx +0.019$ -0.035 over MFCC-only methods. Permutation-importance highlights low-order MFCCs, $\Delta $ MFCC dynamics, and spectral-contrast bands as principal cues, and Grad-CAMs corroborate attention to characteristic peak-valley structure. These results show that well-chosen, interpretable acoustics, joined with lightweight CNN representations, deliver robust and explainable synthetic-speech detection without resorting to ever-larger end-to-end models.
ISSN:	21693536
DOI:	10.1109/ACCESS.2025.3625746