Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the...
Gespeichert in:
| Veröffentlicht in: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Jg. 2025; S. 30157 - 30166 |
|---|---|
| Hauptverfasser: | , , , , , , , , , |
| Format: | Tagungsbericht Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
IEEE
01.06.2025
|
| Schlagworte: | |
| ISSN: | 1063-6919, 1063-6919 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer. |
|---|---|
| AbstractList | In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer. In this work, we introduce the series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer. In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer. |
| Author | Ren, Sucheng Yang, Timing Xie, Cihang Zhou, Yuyin Yuille, Alan Shao, Wei Wang, Angtian Yu, Yaodong Wei, Guoyizhe Wang, Feng |
| Author_xml | – sequence: 1 givenname: Feng surname: Wang fullname: Wang, Feng organization: Johns Hopkins University – sequence: 2 givenname: Timing surname: Yang fullname: Yang, Timing organization: Johns Hopkins University – sequence: 3 givenname: Yaodong surname: Yu fullname: Yu, Yaodong organization: UC Berkeley – sequence: 4 givenname: Sucheng surname: Ren fullname: Ren, Sucheng organization: Johns Hopkins University – sequence: 5 givenname: Guoyizhe surname: Wei fullname: Wei, Guoyizhe organization: Johns Hopkins University – sequence: 6 givenname: Angtian surname: Wang fullname: Wang, Angtian organization: Johns Hopkins University – sequence: 7 givenname: Wei surname: Shao fullname: Shao, Wei organization: University of Florida – sequence: 8 givenname: Yuyin surname: Zhou fullname: Zhou, Yuyin organization: UC Santa Cruz – sequence: 9 givenname: Alan surname: Yuille fullname: Yuille, Alan organization: Johns Hopkins University – sequence: 10 givenname: Cihang surname: Xie fullname: Xie, Cihang organization: UC Santa Cruz |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/41179969$$D View this record in MEDLINE/PubMed |
| BookMark | eNpN0N9LAkEQB_AtjDTzP5C4x17O9vft9iZmFhhGiK-ytzdnG96e3d4F9td3okUPwwzMhy_MXKGOLz0gNCR4RAjWd5PV65ugCeMjiqkYYapwcoYGOtGKMSI4k1ydox7BksVSE935N3fRIIQPjDGjhEitLlGXE5JoLXUPzcbZF_i6qaC6jxa72hXu2_lNtHLBlT56MUVqonFl310N9sCiBwhu40OUl1U0zXNnHXi7v0YXudkGGJx6Hy0fp8vJUzxfzJ4n43nskkTHTBoqqKEY2xQEp7ytjGYpw1oRYyikudIgpOBAtbaYGpNblQGkIuOGKNZHt8fYXVV-NhDqdeGChe3WeCibsGZUqjadKd3SmxNt0gKy9a5yhan269_bWzA8AgcAf-vDu7miiv0AgPJqeA |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IH CBEJK RIE RIO NPM 7X8 |
| DOI | 10.1109/CVPR52734.2025.02807 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present PubMed MEDLINE - Academic |
| DatabaseTitle | PubMed MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic PubMed |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Computer Science |
| EISBN | 9798331543648 |
| EISSN | 1063-6919 |
| EndPage | 30166 |
| ExternalDocumentID | 41179969 11094828 |
| Genre | orig-research Journal Article |
| GrantInformation_xml | – fundername: NEI NIH HHS grantid: R01 EY037193 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO 23M 29F 29O 6IK ABDPE ACGFS IPLJI M43 NPM RNS 7X8 |
| ID | FETCH-LOGICAL-i779-36a252a200cbe5424542d2db30981aa2ebf89e5654e299c02aafc8deeb5d4a183 |
| IEDL.DBID | RIE |
| ISSN | 1063-6919 |
| IngestDate | Mon Nov 03 21:42:51 EST 2025 Thu Nov 06 11:53:47 EST 2025 Wed Aug 20 06:20:37 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i779-36a252a200cbe5424542d2db30981aa2ebf89e5654e299c02aafc8deeb5d4a183 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| PMID | 41179969 |
| PQID | 3268200389 |
| PQPubID | 23479 |
| PageCount | 10 |
| ParticipantIDs | proquest_miscellaneous_3268200389 ieee_primary_11094828 pubmed_primary_41179969 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-Jun |
| PublicationDateYYYYMMDD | 2025-06-01 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-Jun |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationTitleAlternate | Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 ssj0023720 |
| Score | 2.4776397 |
| Snippet | In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn... In this work, we introduce the series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual... |
| SourceID | proquest pubmed ieee |
| SourceType | Aggregation Database Index Database Publisher |
| StartPage | 30157 |
| SubjectTerms | Complexity theory Computational modeling Computer vision Pattern recognition Predictive models Throughput Training Transformers Visualization |
| Title | Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency |
| URI | https://ieeexplore.ieee.org/document/11094828 https://www.ncbi.nlm.nih.gov/pubmed/41179969 https://www.proquest.com/docview/3268200389 |
| Volume | 2025 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8JAEN4I8eAJH6j4IGvitTy223bXm0HQgyIxhHBrZrdDwgEwFDz4651dCnrh4K2HbtNMZzrfPL4Zxu6lRApvyJDMRIaBFCYJANoyyCgamgBoqz3rffSa9PtqPNaDgqzuuTCI6JvPsOEufS0_W9i1S5U13XRMSSFCiZWSJNmQtXYJlZBCmVirgh5HdzY7o8GHmy_mUiciargi4naJyn486f1Kr_LPNzpm1V-GHh_sfM8JO8D5KasUkJIXBpufsWe3cNm5FVw-8Hf6Pcym33SAjzylnL_BzAB__FNM4E--pSPnBGZ518-XcOTMKhv2usPOS1DsTgimSaKDMAYRCSATsAYjV92UIhOZCVtatQEEmonSSGBOIvkj2xIAE6syRBNlEsjMz1l5vpjjJeNtVFksCVfGVklpI6UMgBUQKQxDDe0aqzq5pJ-b6RjpViQ1drcVcUoq6-oQMMfFOk8JMSrXE6d0jV1sZL87Lf2Iulhf7XnqNTtyH3PTrHXDyqvlGm_Zof1aTfNlnfRirOpeL34AK4-25g |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4omugJH6j4rInXBbbbXVpvBkGMgMQQwm0z2x0SDoDh4cFfb1sW9MLB2x7aTTPb2fnm8c0APAhBxr0xipQMReAJnlQ9RF94qfGGhohKK8d677eqnY4cDFQ3I6s7LgwRueIzKtlHl8tPp3ppQ2Vl2x1TGBdhF_ZCIbi_omttQiqBcWYiJTOCnFlbrvW7H7bDmA2e8LBk04jrMSrbEaWzLI38P890BIVfjh7rbqzPMezQ5ATyGahkmcrOT-HFjly2hoVmj-zd_CDGo2-zgfUdqZy1cZwge_qTTmDPrqhjzgycZXXXYcLSMwvQa9R7taaXTU_wRtWq8oIIecjRKIFOKLT5TcFTniZBRUkfkVMylIoMnBNkLJKucMShlilREqYCjaKfQW4yndAFMJ9kGgmDLCMthdChlAmi5hhKCgKFfhEKVi7x56o_RrwWSRHu1yKOzaW1mQic0HQ5jw1mlLYqTqoinK9kv9ktXJO6SF1ueesdHDR77Vbceu28XcGh_bCr0q1ryC1mS7qBff21GM1nt-52_ACIw7lF |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Adventurer%3A+Optimizing+Vision+Mamba+Architecture+Designs+for+Efficiency&rft.au=Wang%2C+Feng&rft.au=Yang%2C+Timing&rft.au=Yu%2C+Yaodong&rft.au=Ren%2C+Sucheng&rft.date=2025-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=30157&rft.epage=30166&rft_id=info:doi/10.1109%2FCVPR52734.2025.02807&rft.externalDocID=11094828 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon |