TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision T...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 12073 - 12083
Hlavní autori: Zhang, Wenqiang, Huang, Zilong, Luo, Guozhong, Chen, Tao, Wang, Xinggang, Liu, Wenyu, Yu, Gang, Shen, Chunhua
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.06.2022
Predmet:
ISSN:1063-6919
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision Transformer (TopFormer). The proposed TopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then in-Jected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer.
AbstractList Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision Transformer (TopFormer). The proposed TopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then in-Jected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer.
Author Zhang, Wenqiang
Shen, Chunhua
Wang, Xinggang
Luo, Guozhong
Yu, Gang
Chen, Tao
Huang, Zilong
Liu, Wenyu
Author_xml – sequence: 1
  givenname: Wenqiang
  surname: Zhang
  fullname: Zhang, Wenqiang
  organization: Huazhong University of Science and Technology,China
– sequence: 2
  givenname: Zilong
  surname: Huang
  fullname: Huang, Zilong
  organization: Tencent PCG,China
– sequence: 3
  givenname: Guozhong
  surname: Luo
  fullname: Luo, Guozhong
  organization: Tencent PCG,China
– sequence: 4
  givenname: Tao
  surname: Chen
  fullname: Chen, Tao
  organization: Fudan University,China
– sequence: 5
  givenname: Xinggang
  surname: Wang
  fullname: Wang, Xinggang
  email: xgwang@hust.edu.cn
  organization: Huazhong University of Science and Technology,China
– sequence: 6
  givenname: Wenyu
  surname: Liu
  fullname: Liu, Wenyu
  email: liuwy@hust.edu.cn
  organization: Huazhong University of Science and Technology,China
– sequence: 7
  givenname: Gang
  surname: Yu
  fullname: Yu, Gang
  organization: Tencent PCG,China
– sequence: 8
  givenname: Chunhua
  surname: Shen
  fullname: Shen, Chunhua
  organization: Zhejiang University,China
BookMark eNotjN1KwzAYQKMouM09gV7kBTq_L2nz452UTYWJQ6u3I22-SHRNR9ubvf2GenUOHDhTdpG6RIzdIiwQwd6Vn5u3QihjFgKEWACi1mdsikoVubK5kudsgqBkpizaKzYfhm8AkAJRWTNhy6rbr7q-pf6eV90PJb459K6Nnle9S0P4TfwE_tLVcUf8nVqXxtic5KulNLoxdumaXQa3G2j-zxn7WC2r8ilbvz4-lw_rLAqQYyakr53W3hFogV4pdEJigYV2pL3yhQBsQm1NAz5oUFbntUcnpQmerAtyxm7-vpGItvs-tq4_bK0xIMDIIzrUThs
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.01177
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 12083
ExternalDocumentID 9880208
Genre orig-research
GrantInformation_xml – fundername: NSFC
  grantid: 61733007,61876212,62071127,61773176
  funderid: 10.13039/501100001809
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-23dba77dae0721d661a2315157ae7d6d5201cfb98c0df706974bd1a338fde9af3
IEDL.DBID RIE
ISICitedReferencesCount 258
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759105016&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:15:10 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-23dba77dae0721d661a2315157ae7d6d5201cfb98c0df706974bd1a338fde9af3
PageCount 11
ParticipantIDs ieee_primary_9880208
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6487713
Snippet Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction...
SourceID ieee
SourceType Publisher
StartPage 12073
SubjectTerms Computer architecture
Computer vision
Deep learning
Deep learning architectures and techniques; Segmentation
grouping and shape analysis
Mobile handsets
Semantics
Shape
Transformers
Title TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
URI https://ieeexplore.ieee.org/document/9880208
WOSCitedRecordID wos000870759105016&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcBUoEW85YGRtKntxDFr1YoBqggC6lY59gVFqEmVtkj8e85pKEJiYfLJi6Xz474733dHyI3mZpgJi75JGClPCMG9VBvlRQ5sG9fnStRE4Qc5nUazmYpb5HbHhQGAOvkM-k6s__JtaTYuVDZQeNiYY_buSSm3XK1dPIWjJxOqqGHHDX01GL3GT66YiUvgYqzvip_97qFSm5BJ53-LH5LeDxePxjsrc0RaUByTTgMeaXM1V10yTsrlBAEoVHc0Kd-hoPFnpRe5pck3NoWK4kAfyxSfAvoMC9RqblB4WzQMpKJHXibjZHTvNT0SvJz5fO0xblMtpdXgCp1ZtLYaERuCFKlB2tAGaOBNlqrI-DaTfojuQ2qHGh3TzILSGT8h7aIs4JTQMJA2wu3hgmuhQqVDnoogYCliDMWAnZGu08p8uS2DMW8Ucv739AU5cGrfZlVdkva62sAV2Tcf63xVXdd79wV8eZmj
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwEA5DBX2auom_zYOPdmuTtGl8HRsTt1G0yt5GmqRSZO3oNsH_3ktXJ4IvPuXIS-AuyX2X3HeH0K2kykuZhtgkCIXDGKNOIpVwQgu2le1zxSqi8IhPJuF0KqIGuttyYYwxVfKZ6Vix-svXhVrbp7KugM1GLLN312eMeBu21vZFhUIsE4iw5sd5ruj2XqMnW87EpnAR0rHlz353UamcyKD5v-UPUfuHjYejrZ85Qg2TH6NmDR9xfTiXLdSPi8UAIKgp73FcvJscR5-lnGcax9_o1JQYBjwuErgM8LOZg14zBcLbvOYg5W30MujHvaFTd0lwMuLSlUOoTiTnWhpb6kyDv5WA2QCmcGm4DrQPLl6liQiVq1PuBhBAJNqTEJqm2giZ0hO0kxe5OUU48LkOwUCUUclEIGRAE-b7JAGUIYghZ6hltTJbbAphzGqFnP89fYP2h_F4NBs9TB4v0IE1wSbH6hLtrMq1uUJ76mOVLcvryo5fqxec6g
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=TopFormer%3A+Token+Pyramid+Transformer+for+Mobile+Semantic+Segmentation&rft.au=Zhang%2C+Wenqiang&rft.au=Huang%2C+Zilong&rft.au=Luo%2C+Guozhong&rft.au=Chen%2C+Tao&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12073&rft.epage=12083&rft_id=info:doi/10.1109%2FCVPR52688.2022.01177&rft.externalDocID=9880208