PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference

The troublesome model size and quadratic computational complexity associated with token quantity pose significant deployment challenges for Vision Transformers (ViTs) in practical applications. Despite recent advancements in model pruning and token reduction techniques speed up the inference speed o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence Jg. 48; H. 1; S. 1 - 17
Hauptverfasser: Li, Ye, Tang, Chen, Meng, Yuan, Fan, Jiajun, Chai, Zenghao, Ma, Xinzhu, Wang, Zhi, Zhu, Wenwu
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States IEEE 2025
Schlagworte:
ISSN:0162-8828, 1939-3539, 2160-9292, 1939-3539
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The troublesome model size and quadratic computational complexity associated with token quantity pose significant deployment challenges for Vision Transformers (ViTs) in practical applications. Despite recent advancements in model pruning and token reduction techniques speed up the inference speed of ViTs, these approaches either adopt a fixed sparsity ratio or overlook the meaningful interplay between architectural optimization and token selection. Consequently, this static and single-dimension compression often leads to pronounced accuracy degradation under aggressive compression rates, as they fail to fully explore redundancies across these two orthogonal dimensions. Therefore, we introduce PRANCE, a framework which can jointly optimize activated channels and tokens on a per-sample basis, aiming to accelerate ViTs' inference process from a unified data and architectural perspective. However, the joint framework poses challenges to both architectural and decision-making aspects. Firstly, while ViTs inherently support variable-token inference, they do not facilitate dynamic computations for variable channels. To overcome this limitation, we propose a meta-network using weight-sharing techniques to support arbitrary channels of the Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP) layers, serving as a foundational model for architectural decision-making. Secondly, simultaneously optimizing the model structure and input data constitutes a combinatorial optimization problem with an extremely large decision space, reaching up to around <inline-formula><tex-math notation="LaTeX">10^{14}</tex-math></inline-formula>, making supervised learning infeasible. To this end, we design a lightweight selector employing Proximal Policy Optimization algorithm (PPO) for efficient decision-making. Furthermore, we introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a Markov decision process, significantly reducing action space and mitigating delayed-reward issues during training. Additionally, our framework simultaneously supports different kinds of token optimization methods such as pruning, merging, and sequential pruning-merging strategies. Extensive experiments demonstrate the effectiveness of PRANCE in reducing FLOPs by approximately 50%, retaining only about 10% of tokens while achieving lossless Top-1 accuracy.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
1939-3539
DOI:10.1109/TPAMI.2025.3605239