Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neurocomputing (Amsterdam) Jg. 630; S. 129696
Hauptverfasser: Guan, Yijia, Wang, Kundong
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 14.05.2025
Schlagworte:
ISSN:0925-2312
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8mAPb and a 43.4/44.6mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3× schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages. •A new model DCT-ViT based on dynamic channel tokens.•DCT-ViT enables linear computation complexity and multi-scale features.•DCT-ViT-S achieves 82.9% Top-1 on ImageNet and 47.9 mAPb/43.4 mAPm on COCO for 3× schedule.
ISSN:0925-2312
DOI:10.1016/j.neucom.2025.129696