Video deepfake detection using a hybrid CNN-LSTM-Transformer model for identity verification

The proliferation of deepfake technology poses significant challenges due to its potential for misuse in creating highly convincing manipulated videos. Deep learning (DL) techniques have emerged as powerful tools for analyzing and identifying subtle inconsistencies that distinguish genuine content f...

Full description

Saved in:
Bibliographic Details
Published in:Multimedia tools and applications Vol. 84; no. 33; pp. 40617 - 40636
Main Authors: Petmezas, Georgios, Vanian, Vazgken, Konstantoudakis, Konstantinos, Almaloglou, Elena E. I., Zarpalas, Dimitris
Format: Journal Article
Language:English
Published: New York Springer US 01.10.2025
Springer Nature B.V
Subjects:
ISSN:1573-7721, 1380-7501, 1573-7721
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The proliferation of deepfake technology poses significant challenges due to its potential for misuse in creating highly convincing manipulated videos. Deep learning (DL) techniques have emerged as powerful tools for analyzing and identifying subtle inconsistencies that distinguish genuine content from deepfakes. This paper introduces a novel approach for video deepfake detection that integrates 3D Morphable Models (3DMMs) with a hybrid CNN-LSTM-Transformer model, aimed at enhancing detection accuracy and efficiency. Our model leverages 3DMMs for detailed facial feature extraction, a CNN for fine-grained spatial analysis, an LSTM for short-term temporal dynamics, and a Transformer for capturing long-term dependencies in sequential data. This architecture effectively addresses critical challenges in current detection systems by handling both local and global temporal information. The proposed model employs an identity verification approach, comparing test videos with reference videos containing genuine footage of the individuals. Trained and validated on the VoxCeleb2 dataset, with further testing on three additional datasets, our model demonstrates superior performance to existing state-of-the-art methods, maintaining robustness across different video qualities, compression levels and manipulation types. Additionally, it operates efficiently in time-sensitive scenarios, significantly outperforming existing methods in inference speed. By relying solely on pristine, unmanipulated data for training, our approach enhances adaptability to new and sophisticated manipulations, setting a new benchmark for video deepfake detection technologies. This study not only advances the framework for detecting deepfakes but also underscores its potential for practical deployment in areas critical for digital forensics and media integrity.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1573-7721
1380-7501
1573-7721
DOI:10.1007/s11042-024-20548-6