In EDS ansehen

Midframe-centric token merging for efficient video transformer.

Gespeichert in:

Bibliographische Detailangaben
Titel:	Midframe-centric token merging for efficient video transformer.
Autoren:	Zhang, Qian¹ (AUTHOR) 20060075@upc.edu.cn, Yang, Zuosui¹ (AUTHOR) z22070057@s.upc.edu.cn, Shao, Mingwen¹ (AUTHOR) mwshao@upc.edu.cn, Liang, Hong¹ (AUTHOR) liangh@upc.edu.cn
Quelle:	Cluster Computing. Oct2025, Vol. 28 Issue 10, p1-14. 14p.
Schlagwörter:	VIDEO processing, ECONOMIC efficiency, SPATIOTEMPORAL processes, OPTIMIZATION algorithms
Abstract:	Video Transformer has become the standard architecture for video recognition due to its powerful spatio-temporal modeling capability. However, its high computational cost makes it difficult to apply in real-world scenarios. In this paper, we design a lightweight module called Midframe-Centric Token Merging (MCTM). It can be embedded in some video transformers to adaptively merge some unimportant and similar tokens during the inference process. Specifically, MCTM can be divided into two progressive sub-modules. In the first sub-module, we divide tokens into two parts based on their importance, called important tokens and unimportant tokens. In the second sub-module, for unimportant tokens, we design a token merging strategy to merge some similar tokens. We performed several sets of experiments to validate the effectiveness of MCTM. For example, we embedded MCTM in Joint Video Transformer and ran experiments on Kinetics-400 dataset. It reduces the computational cost by 37% while maintaining similar accuracy. [ABSTRACT FROM AUTHOR]
Datenbank:	Academic Search Index

Full Text Finder

Nájsť tento článok vo Web of Science

Beschreibung
Abstract:	Video Transformer has become the standard architecture for video recognition due to its powerful spatio-temporal modeling capability. However, its high computational cost makes it difficult to apply in real-world scenarios. In this paper, we design a lightweight module called Midframe-Centric Token Merging (MCTM). It can be embedded in some video transformers to adaptively merge some unimportant and similar tokens during the inference process. Specifically, MCTM can be divided into two progressive sub-modules. In the first sub-module, we divide tokens into two parts based on their importance, called important tokens and unimportant tokens. In the second sub-module, for unimportant tokens, we design a token merging strategy to merge some similar tokens. We performed several sets of experiments to validate the effectiveness of MCTM. For example, we embedded MCTM in Joint Video Transformer and ran experiments on Kinetics-400 dataset. It reduces the computational cost by 37% while maintaining similar accuracy. [ABSTRACT FROM AUTHOR]
ISSN:	13867857
DOI:	10.1007/s10586-025-05478-8