Zobrazit v EDS

Dynamic Hierarchical Token Merging for Vision Transformers

Uloženo v:

Podrobná bibliografie
Název:	Dynamic Hierarchical Token Merging for Vision Transformers
Autoři:	Haroun, Karim, Allenet, Thibault, Ben Chehida, Karim, Martinet, Jean
Přispěvatelé:	Haroun, Karim
Zdroj:	Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. :677-684
Informace o vydavateli:	SCITEPRESS - Science and Technology Publications, 2025.
Rok vydání:	2025
Témata:	Vision Transformers, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Neural network compression, Dynamic neural networks, Token merging
Popis:	Vision Transformers (ViTs) have achieved impressive results in computer vision, excelling in tasks such as image classification, segmentation, and object detection. However, their quadratic complexity $O(N^2)$, where $N$ is the token sequence length, poses challenges when deployed on resource-limited devices. To address this issue, dynamic token merging has emerged as an effective strategy, progressively reducing the token count during inference to achieve computational savings. Some strategies consider all tokens in the sequence as merging candidates, without focusing on spatially close tokens. Other strategies either limit token merging to a local window, or constrains it to pairs of adjacent tokens, thus not capturing more complex feature relationships. In this paper, we propose Dynamic Hierarchical Token Merging (DHTM), a novel token merging approach, where we advocate that spatially close tokens share more information than distant tokens and consider all pairs of spatially close candidates instead of imposing fixed windows. Besides, our approach draws on the principles of Hierarchical Agglomerative Clustering (HAC), where we iteratively merge tokens in each layer, fusing a fixed number of selected neighbor token pairs based on their similarity. Our proposed approach is off-the-shelf, i.e., it does not require additional training. We evaluate our approach on the ImageNet-1K dataset for classification, achieving substantial computational savings while minimizing accuracy reduction, surpassing existing token merging methods.
Druh dokumentu:	Article Conference object
Popis souboru:	application/pdf
DOI:	10.5220/0013284100003912
Přístupová URL adresa:	https://hal.science/hal-04885469v1 https://doi.org/10.5220/0013284100003912 https://hal.science/hal-04885469v1/document
Přístupové číslo:	edsair.doi.dedup.....120db99c8bc74fa9f40732c2e6245163
Databáze:	OpenAIRE

View record at OpenAIRE

Nájsť tento článok vo Web of Science

Popis
Abstrakt:	Vision Transformers (ViTs) have achieved impressive results in computer vision, excelling in tasks such as image classification, segmentation, and object detection. However, their quadratic complexity $O(N^2)$, where $N$ is the token sequence length, poses challenges when deployed on resource-limited devices. To address this issue, dynamic token merging has emerged as an effective strategy, progressively reducing the token count during inference to achieve computational savings. Some strategies consider all tokens in the sequence as merging candidates, without focusing on spatially close tokens. Other strategies either limit token merging to a local window, or constrains it to pairs of adjacent tokens, thus not capturing more complex feature relationships. In this paper, we propose Dynamic Hierarchical Token Merging (DHTM), a novel token merging approach, where we advocate that spatially close tokens share more information than distant tokens and consider all pairs of spatially close candidates instead of imposing fixed windows. Besides, our approach draws on the principles of Hierarchical Agglomerative Clustering (HAC), where we iteratively merge tokens in each layer, fusing a fixed number of selected neighbor token pairs based on their similarity. Our proposed approach is off-the-shelf, i.e., it does not require additional training. We evaluate our approach on the ImageNet-1K dataset for classification, achieving substantial computational savings while minimizing accuracy reduction, surpassing existing token merging methods.
DOI:	10.5220/0013284100003912