TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision T...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 12073 - 12083
Hlavní autori:	Zhang, Wenqiang, Huang, Zilong, Luo, Guozhong, Chen, Tao, Wang, Xinggang, Liu, Wenyu, Yu, Gang, Shen, Chunhua
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.06.2022
Predmet:	Computer architecture Computer vision Deep learning Deep learning architectures and techniques; Segmentation grouping and shape analysis Mobile handsets Semantics Shape Transformers
ISSN:	1063-6919
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision Transformer (TopFormer). The proposed TopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then in-Jected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer.
AbstractList	Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision Transformer (TopFormer). The proposed TopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then in-Jected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer.
Author	Zhang, Wenqiang Shen, Chunhua Wang, Xinggang Luo, Guozhong Yu, Gang Chen, Tao Huang, Zilong Liu, Wenyu
Author_xml	– sequence: 1 givenname: Wenqiang surname: Zhang fullname: Zhang, Wenqiang organization: Huazhong University of Science and Technology,China – sequence: 2 givenname: Zilong surname: Huang fullname: Huang, Zilong organization: Tencent PCG,China – sequence: 3 givenname: Guozhong surname: Luo fullname: Luo, Guozhong organization: Tencent PCG,China – sequence: 4 givenname: Tao surname: Chen fullname: Chen, Tao organization: Fudan University,China – sequence: 5 givenname: Xinggang surname: Wang fullname: Wang, Xinggang email: xgwang@hust.edu.cn organization: Huazhong University of Science and Technology,China – sequence: 6 givenname: Wenyu surname: Liu fullname: Liu, Wenyu email: liuwy@hust.edu.cn organization: Huazhong University of Science and Technology,China – sequence: 7 givenname: Gang surname: Yu fullname: Yu, Gang organization: Tencent PCG,China – sequence: 8 givenname: Chunhua surname: Shen fullname: Shen, Chunhua organization: Zhejiang University,China
BookMark	eNotjN1KwzAYQKMouM09gV7kBTq_L2nz452UTYWJQ6u3I22-SHRNR9ubvf2GenUOHDhTdpG6RIzdIiwQwd6Vn5u3QihjFgKEWACi1mdsikoVubK5kudsgqBkpizaKzYfhm8AkAJRWTNhy6rbr7q-pf6eV90PJb459K6Nnle9S0P4TfwE_tLVcUf8nVqXxtic5KulNLoxdumaXQa3G2j-zxn7WC2r8ilbvz4-lw_rLAqQYyakr53W3hFogV4pdEJigYV2pL3yhQBsQm1NAz5oUFbntUcnpQmerAtyxm7-vpGItvs-tq4_bK0xIMDIIzrUThs
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52688.2022.01177
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	1665469463 9781665469463
EISSN	1063-6919
EndPage	12083
ExternalDocumentID	9880208
Genre	orig-research
GrantInformation_xml	– fundername: NSFC grantid: 61733007,61876212,62071127,61773176 funderid: 10.13039/501100001809
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i203t-23dba77dae0721d661a2315157ae7d6d5201cfb98c0df706974bd1a338fde9af3
IEDL.DBID	RIE
ISICitedReferencesCount	258
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759105016&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:15:10 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-23dba77dae0721d661a2315157ae7d6d5201cfb98c0df706974bd1a338fde9af3
PageCount	11
ParticipantIDs	ieee_primary_9880208
PublicationCentury	2000
PublicationDate	2022-June
PublicationDateYYYYMMDD	2022-06-01
PublicationDate_xml	– month: 06 year: 2022 text: 2022-June
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2022
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.6487713
Snippet	Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction...
SourceID	ieee
SourceType	Publisher
StartPage	12073
SubjectTerms	Computer architecture Computer vision Deep learning Deep learning architectures and techniques; Segmentation grouping and shape analysis Mobile handsets Semantics Shape Transformers
Title	TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
URI	https://ieeexplore.ieee.org/document/9880208
WOSCitedRecordID	wos000870759105016&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcBUoEW85YGRtKntxDFr1YoBqggC6lY59gVFqEmVtkj8e85pKEJiYfLJi6Xz474733dHyI3mZpgJi75JGClPCMG9VBvlRQ5sG9fnStRE4Qc5nUazmYpb5HbHhQGAOvkM-k6s__JtaTYuVDZQeNiYY_buSSm3XK1dPIWjJxOqqGHHDX01GL3GT66YiUvgYqzvip_97qFSm5BJ53-LH5LeDxePxjsrc0RaUByTTgMeaXM1V10yTsrlBAEoVHc0Kd-hoPFnpRe5pck3NoWK4kAfyxSfAvoMC9RqblB4WzQMpKJHXibjZHTvNT0SvJz5fO0xblMtpdXgCp1ZtLYaERuCFKlB2tAGaOBNlqrI-DaTfojuQ2qHGh3TzILSGT8h7aIs4JTQMJA2wu3hgmuhQqVDnoogYCliDMWAnZGu08p8uS2DMW8Ucv739AU5cGrfZlVdkva62sAV2Tcf63xVXdd79wV8eZmj
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fS8MwEA5DBX2auom_zYOPdmuTtGl8HRsTt1G0yt5GmqRSZO3oNsH_3ktXJ4IvPuXIS-AuyX2X3HeH0K2kykuZhtgkCIXDGKNOIpVwQgu2le1zxSqi8IhPJuF0KqIGuttyYYwxVfKZ6Vix-svXhVrbp7KugM1GLLN312eMeBu21vZFhUIsE4iw5sd5ruj2XqMnW87EpnAR0rHlz353UamcyKD5v-UPUfuHjYejrZ85Qg2TH6NmDR9xfTiXLdSPi8UAIKgp73FcvJscR5-lnGcax9_o1JQYBjwuErgM8LOZg14zBcLbvOYg5W30MujHvaFTd0lwMuLSlUOoTiTnWhpb6kyDv5WA2QCmcGm4DrQPLl6liQiVq1PuBhBAJNqTEJqm2giZ0hO0kxe5OUU48LkOwUCUUclEIGRAE-b7JAGUIYghZ6hltTJbbAphzGqFnP89fYP2h_F4NBs9TB4v0IE1wSbH6hLtrMq1uUJ76mOVLcvryo5fqxec6g
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=TopFormer%3A+Token+Pyramid+Transformer+for+Mobile+Semantic+Segmentation&rft.au=Zhang%2C+Wenqiang&rft.au=Huang%2C+Zilong&rft.au=Luo%2C+Guozhong&rft.au=Chen%2C+Tao&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12073&rft.epage=12083&rft_id=info:doi/10.1109%2FCVPR52688.2022.01177&rft.externalDocID=9880208