Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency

In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Vol. 2025; pp. 30157 - 30166
Main Authors:	Wang, Feng, Yang, Timing, Yu, Yaodong, Ren, Sucheng, Wei, Guoyizhe, Wang, Angtian, Shao, Wei, Zhou, Yuyin, Yuille, Alan, Xie, Cihang
Format:	Conference Proceeding Journal Article
Language:	English
Published:	United States IEEE 01.06.2025
Subjects:	Complexity theory Computational modeling Computer vision Pattern recognition Predictive models Throughput Training Transformers Visualization
ISSN:	1063-6919, 1063-6919
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1063-6919 1063-6919
DOI:	10.1109/CVPR52734.2025.02807