Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency

In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Jg. 2025; S. 30157 - 30166
Hauptverfasser: Wang, Feng, Yang, Timing, Yu, Yaodong, Ren, Sucheng, Wei, Guoyizhe, Wang, Angtian, Shao, Wei, Zhou, Yuyin, Yuille, Alan, Xie, Cihang
Format: Tagungsbericht Journal Article
Sprache:Englisch
Veröffentlicht: United States IEEE 01.06.2025
Schlagworte:
ISSN:1063-6919, 1063-6919
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.
AbstractList In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.
In this work, we introduce the series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.
In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.
Author Ren, Sucheng
Yang, Timing
Xie, Cihang
Zhou, Yuyin
Yuille, Alan
Shao, Wei
Wang, Angtian
Yu, Yaodong
Wei, Guoyizhe
Wang, Feng
Author_xml – sequence: 1
  givenname: Feng
  surname: Wang
  fullname: Wang, Feng
  organization: Johns Hopkins University
– sequence: 2
  givenname: Timing
  surname: Yang
  fullname: Yang, Timing
  organization: Johns Hopkins University
– sequence: 3
  givenname: Yaodong
  surname: Yu
  fullname: Yu, Yaodong
  organization: UC Berkeley
– sequence: 4
  givenname: Sucheng
  surname: Ren
  fullname: Ren, Sucheng
  organization: Johns Hopkins University
– sequence: 5
  givenname: Guoyizhe
  surname: Wei
  fullname: Wei, Guoyizhe
  organization: Johns Hopkins University
– sequence: 6
  givenname: Angtian
  surname: Wang
  fullname: Wang, Angtian
  organization: Johns Hopkins University
– sequence: 7
  givenname: Wei
  surname: Shao
  fullname: Shao, Wei
  organization: University of Florida
– sequence: 8
  givenname: Yuyin
  surname: Zhou
  fullname: Zhou, Yuyin
  organization: UC Santa Cruz
– sequence: 9
  givenname: Alan
  surname: Yuille
  fullname: Yuille, Alan
  organization: Johns Hopkins University
– sequence: 10
  givenname: Cihang
  surname: Xie
  fullname: Xie, Cihang
  organization: UC Santa Cruz
BackLink https://www.ncbi.nlm.nih.gov/pubmed/41179969$$D View this record in MEDLINE/PubMed
BookMark eNpN0N9LAkEQB_AtjDTzP5C4x17O9vft9iZmFhhGiK-ytzdnG96e3d4F9td3okUPwwzMhy_MXKGOLz0gNCR4RAjWd5PV65ugCeMjiqkYYapwcoYGOtGKMSI4k1ydox7BksVSE935N3fRIIQPjDGjhEitLlGXE5JoLXUPzcbZF_i6qaC6jxa72hXu2_lNtHLBlT56MUVqonFl310N9sCiBwhu40OUl1U0zXNnHXi7v0YXudkGGJx6Hy0fp8vJUzxfzJ4n43nskkTHTBoqqKEY2xQEp7ytjGYpw1oRYyikudIgpOBAtbaYGpNblQGkIuOGKNZHt8fYXVV-NhDqdeGChe3WeCibsGZUqjadKd3SmxNt0gKy9a5yhan269_bWzA8AgcAf-vDu7miiv0AgPJqeA
CODEN IEEPAD
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IH
CBEJK
RIE
RIO
NPM
7X8
DOI 10.1109/CVPR52734.2025.02807
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
PubMed
MEDLINE - Academic
DatabaseTitle PubMed
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
PubMed

Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9798331543648
EISSN 1063-6919
EndPage 30166
ExternalDocumentID 41179969
11094828
Genre orig-research
Journal Article
GrantInformation_xml – fundername: NEI NIH HHS
  grantid: R01 EY037193
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
23M
29F
29O
6IK
ABDPE
ACGFS
IPLJI
M43
NPM
RNS
7X8
ID FETCH-LOGICAL-i779-36a252a200cbe5424542d2db30981aa2ebf89e5654e299c02aafc8deeb5d4a183
IEDL.DBID RIE
ISSN 1063-6919
IngestDate Mon Nov 03 21:42:51 EST 2025
Thu Nov 06 11:53:47 EST 2025
Wed Aug 20 06:20:37 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i779-36a252a200cbe5424542d2db30981aa2ebf89e5654e299c02aafc8deeb5d4a183
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PMID 41179969
PQID 3268200389
PQPubID 23479
PageCount 10
ParticipantIDs proquest_miscellaneous_3268200389
ieee_primary_11094828
pubmed_primary_41179969
PublicationCentury 2000
PublicationDate 2025-Jun
PublicationDateYYYYMMDD 2025-06-01
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-Jun
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationTitleAlternate Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
ssj0023720
Score 2.4776397
Snippet In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn...
In this work, we introduce the series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual...
SourceID proquest
pubmed
ieee
SourceType Aggregation Database
Index Database
Publisher
StartPage 30157
SubjectTerms Complexity theory
Computational modeling
Computer vision
Pattern recognition
Predictive models
Throughput
Training
Transformers
Visualization
Title Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
URI https://ieeexplore.ieee.org/document/11094828
https://www.ncbi.nlm.nih.gov/pubmed/41179969
https://www.proquest.com/docview/3268200389
Volume 2025
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8JAEN4I8eAJH6j4IGvitTy223bXm0HQgyIxhHBrZrdDwgEwFDz4651dCnrh4K2HbtNMZzrfPL4Zxu6lRApvyJDMRIaBFCYJANoyyCgamgBoqz3rffSa9PtqPNaDgqzuuTCI6JvPsOEufS0_W9i1S5U13XRMSSFCiZWSJNmQtXYJlZBCmVirgh5HdzY7o8GHmy_mUiciargi4naJyn486f1Kr_LPNzpm1V-GHh_sfM8JO8D5KasUkJIXBpufsWe3cNm5FVw-8Hf6Pcym33SAjzylnL_BzAB__FNM4E--pSPnBGZ518-XcOTMKhv2usPOS1DsTgimSaKDMAYRCSATsAYjV92UIhOZCVtatQEEmonSSGBOIvkj2xIAE6syRBNlEsjMz1l5vpjjJeNtVFksCVfGVklpI6UMgBUQKQxDDe0aqzq5pJ-b6RjpViQ1drcVcUoq6-oQMMfFOk8JMSrXE6d0jV1sZL87Lf2Iulhf7XnqNTtyH3PTrHXDyqvlGm_Zof1aTfNlnfRirOpeL34AK4-25g
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4omugJH6j4rInXBbbbXVpvBkGMgMQQwm0z2x0SDoDh4cFfb1sW9MLB2x7aTTPb2fnm8c0APAhBxr0xipQMReAJnlQ9RF94qfGGhohKK8d677eqnY4cDFQ3I6s7LgwRueIzKtlHl8tPp3ppQ2Vl2x1TGBdhF_ZCIbi_omttQiqBcWYiJTOCnFlbrvW7H7bDmA2e8LBk04jrMSrbEaWzLI38P890BIVfjh7rbqzPMezQ5ATyGahkmcrOT-HFjly2hoVmj-zd_CDGo2-zgfUdqZy1cZwge_qTTmDPrqhjzgycZXXXYcLSMwvQa9R7taaXTU_wRtWq8oIIecjRKIFOKLT5TcFTniZBRUkfkVMylIoMnBNkLJKucMShlilREqYCjaKfQW4yndAFMJ9kGgmDLCMthdChlAmi5hhKCgKFfhEKVi7x56o_RrwWSRHu1yKOzaW1mQic0HQ5jw1mlLYqTqoinK9kv9ktXJO6SF1ueesdHDR77Vbceu28XcGh_bCr0q1ryC1mS7qBff21GM1nt-52_ACIw7lF
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Adventurer%3A+Optimizing+Vision+Mamba+Architecture+Designs+for+Efficiency&rft.au=Wang%2C+Feng&rft.au=Yang%2C+Timing&rft.au=Yu%2C+Yaodong&rft.au=Ren%2C+Sucheng&rft.date=2025-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=30157&rft.epage=30166&rft_id=info:doi/10.1109%2FCVPR52734.2025.02807&rft.externalDocID=11094828
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon