SEDN: A Spatiotemporal Encoder-Decoder Network for End-to-End Object Removal Forgery Detection in High-Resolution Videos

With the growing popularity of high-resolution (HR) video and the continuous growth of network bandwidth, the challenge of object removal detection in HR videos has attracted significant attention. Expert forgers leverage the rich detail in HR videos for meticulous pixel manipulation and apply sophi...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on multimedia Ročník 27; s. 2503 - 2515
Hlavní autori:	Xiong, Lizhi, Ding, Linsen, Cao, Mengqi, Xia, Zhihua, Shi, Yun-Qing
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	IEEE 2025
Predmet:	3D convolution Accuracy Data mining Feature extraction Forgery hybrid encoder-decoder model Location awareness Object detection Object removal forgery detection spatiotemporal localization Spatiotemporal phenomena Transformers Transforms Vectors video passive forensics
ISSN:	1520-9210, 1941-0077
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	With the growing popularity of high-resolution (HR) video and the continuous growth of network bandwidth, the challenge of object removal detection in HR videos has attracted significant attention. Expert forgers leverage the rich detail in HR videos for meticulous pixel manipulation and apply sophisticated postprocessing techniques to hide high-frequency artifacts, thereby making forgery detection and localization more difficult when existing schemes are used. Additionally, the end-to-end framework simplifies the detection and localization process, which has not been considered in previous work. To solve the above issues, a spatiotemporal encoder−decoder network (SEDN) is proposed for end-to-end object removal forgery detection in HR videos. In the SEDN, a new model composed of a 3D asymmetric dual-stream network (3D-ADSN) and Transformer is proposed. The 3D-ADSN is utilized as the encoder, which fully integrates the high-frequency and low-frequency spatiotemporal information of videos. Transformer is utilized as the decoder to capture the global structure spatiotemporal information of the long-range feature sequence obtained by the encoder. This network combination successfully achieves simultaneous detection in the temporal and spatial domains without any additional postprocessing calculations. The experimental results demonstrate the better performance of the SEDN at different resolutions.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3521804