DCTFormer: A Dual-Branch Transformer With Cloze Tests for Video Anomaly Detection

Video anomaly detection is of critical importance in safety-critical scenarios. The key challenge is to effectively capture the spatio-temporal features of videos and learn normal patterns from the training data. However, existing methods often fall short in modelling intra-channel and inter-channel...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE transactions on multimedia s. 1 - 11
Hlavní autoři:	Chen, Pengzhan, Du, Shengdong, Zhao, Xiaole, Hu, Jie, Li, Jingjing, Li, Tianrui
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	IEEE 2025
Témata:	Anomaly detection Autoencoders conditional variational autoencoder Correlation Dynamics Feature extraction Optical flow Semantics Training transformer residual autoencoder Transformers unsupervised learning Video anomaly detection Videos
ISSN:	1520-9210, 1941-0077
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Video anomaly detection is of critical importance in safety-critical scenarios. The key challenge is to effectively capture the spatio-temporal features of videos and learn normal patterns from the training data. However, existing methods often fall short in modelling intra-channel and inter-channel correlations as well as dynamic dependencies between video frames, leading to challenges in model robustness and generalization. To address these issues, we propose DCTFormer, a dual-branch framework that integrates both RGB and optical flow branches to handle Video Anomaly Detection. Firstly, we design a novel module TRAECT (Transformer-based Residual Autoencoder with Cloze Tests), which incorporates high-level semantics and temporal context information to improve the spatio-temporal relationships learning ability by capturing intra-channel and inter-channel correlations. More importantly, conditioned on the RGB branch, we propose a new optical flow completion approach incorporating richer motion dynamics to learn dynamic dependencies between video frames and optical flows through a conditional variational autoencoder. At last, we introduce an ensemble strategy to compute anomaly scores for both branches, and thus fully exploit the branches modality information. The experimentation on three challenging benchmark datasets evinces the efficacy of our framework, which outperforms current state-of-the-art approaches with regard to anomaly detection performance.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2025.3613082