Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features
Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi...
Saved in:
| Published in: | Neurocomputing (Amsterdam) Vol. 630; p. 129696 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
14.05.2025
|
| Subjects: | |
| ISSN: | 0925-2312 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8mAPb and a 43.4/44.6mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3× schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.
•A new model DCT-ViT based on dynamic channel tokens.•DCT-ViT enables linear computation complexity and multi-scale features.•DCT-ViT-S achieves 82.9% Top-1 on ImageNet and 47.9 mAPb/43.4 mAPm on COCO for 3× schedule. |
|---|---|
| ISSN: | 0925-2312 |
| DOI: | 10.1016/j.neucom.2025.129696 |