TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision T...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 12073 - 12083
Hlavní autoři: Zhang, Wenqiang, Huang, Zilong, Luo, Guozhong, Chen, Tao, Wang, Xinggang, Liu, Wenyu, Yu, Gang, Shen, Chunhua
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2022
Témata:
ISSN:1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision Transformer (TopFormer). The proposed TopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then in-Jected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer.
AbstractList Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named Token Pyramid Vision Transformer (TopFormer). The proposed TopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then in-Jected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer.
Author Zhang, Wenqiang
Shen, Chunhua
Wang, Xinggang
Luo, Guozhong
Yu, Gang
Chen, Tao
Huang, Zilong
Liu, Wenyu
Author_xml – sequence: 1
  givenname: Wenqiang
  surname: Zhang
  fullname: Zhang, Wenqiang
  organization: Huazhong University of Science and Technology,China
– sequence: 2
  givenname: Zilong
  surname: Huang
  fullname: Huang, Zilong
  organization: Tencent PCG,China
– sequence: 3
  givenname: Guozhong
  surname: Luo
  fullname: Luo, Guozhong
  organization: Tencent PCG,China
– sequence: 4
  givenname: Tao
  surname: Chen
  fullname: Chen, Tao
  organization: Fudan University,China
– sequence: 5
  givenname: Xinggang
  surname: Wang
  fullname: Wang, Xinggang
  email: xgwang@hust.edu.cn
  organization: Huazhong University of Science and Technology,China
– sequence: 6
  givenname: Wenyu
  surname: Liu
  fullname: Liu, Wenyu
  email: liuwy@hust.edu.cn
  organization: Huazhong University of Science and Technology,China
– sequence: 7
  givenname: Gang
  surname: Yu
  fullname: Yu, Gang
  organization: Tencent PCG,China
– sequence: 8
  givenname: Chunhua
  surname: Shen
  fullname: Shen, Chunhua
  organization: Zhejiang University,China
BookMark eNotjN1KwzAYQKMouM09gV7kBTq_L2nz452UTYWJQ6u3I22-SHRNR9ubvf2GenUOHDhTdpG6RIzdIiwQwd6Vn5u3QihjFgKEWACi1mdsikoVubK5kudsgqBkpizaKzYfhm8AkAJRWTNhy6rbr7q-pf6eV90PJb459K6Nnle9S0P4TfwE_tLVcUf8nVqXxtic5KulNLoxdumaXQa3G2j-zxn7WC2r8ilbvz4-lw_rLAqQYyakr53W3hFogV4pdEJigYV2pL3yhQBsQm1NAz5oUFbntUcnpQmerAtyxm7-vpGItvs-tq4_bK0xIMDIIzrUThs
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.01177
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 12083
ExternalDocumentID 9880208
Genre orig-research
GrantInformation_xml – fundername: NSFC
  grantid: 61733007,61876212,62071127,61773176
  funderid: 10.13039/501100001809
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-23dba77dae0721d661a2315157ae7d6d5201cfb98c0df706974bd1a338fde9af3
IEDL.DBID RIE
ISICitedReferencesCount 261
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759105016&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:15:10 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-23dba77dae0721d661a2315157ae7d6d5201cfb98c0df706974bd1a338fde9af3
PageCount 11
ParticipantIDs ieee_primary_9880208
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6485999
Snippet Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction...
SourceID ieee
SourceType Publisher
StartPage 12073
SubjectTerms Computer architecture
Computer vision
Deep learning
Deep learning architectures and techniques; Segmentation
grouping and shape analysis
Mobile handsets
Semantics
Shape
Transformers
Title TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation
URI https://ieeexplore.ieee.org/document/9880208
WOSCitedRecordID wos000870759105016&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED2VioGpQIugfMgDI2lT58Mxa9WKhSqCgLpVjn1BEWpSpS0S_55zGoqQWJhiZYl0sf3es-_dAdymoSsMCu1IEfgkULLQUYbWI6F9Jjnhi8G0bjYhZrNoPpdxC-72XhhErJPPcGCH9V2-KfXWHpUNJU02bp29B0KInVdrf57ikZIJZdS440auHI5f4ydbzMQmcHE-sMXPfvdQqSFk2vnfx4-h9-PFY_EeZU6ghcUpdBryyJqlue7CJClXUyKgWN2zpHzHgsWflVrmhiXf3BQrRg_2WKa0FbBnXFJUc02Dt2XjQCp68DKdJOMHp-mR4OTc9TYO90yqhDAKbaEzQ2iriLERSREKhQlNQACvs1RG2jWZcEOSD6kZKRKmmUGpMu8M2kVZ4DkwbUiZBrTfceX5SnAVBSpSLvo-AZjU2QV0bVQWq10ZjEUTkP7fry_hyIZ9l1V1Be1NtcVrONQfm3xd3dT_7gu5CZsU
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFG4ImugJFYy_7cGjg9H96OqVQDACWXQabqRr38xC2MgAE_97X8fEmHjxtGaXJq9rv-_r3vceIXexb3MNXFmCey4KlMS3pMb9iGifCIb4oiEum03wySSYTkVYI_c7LwwAlMln0DbD8l--ztXGXJV1BH5szDh79zzXZd2tW2t3o-KglvFFUPnjurbo9N7CZ1POxKRwMdY25c9-d1EpQWTQ-N_0R6T148aj4Q5njkkNshPSqOgjrTbnqkn6Ub4cIAWF4oFG-RwyGn4WcpFqGn2zUygoPug4j_EwoC-wwLimCgfvi8qDlLXI66Af9YZW1SXBSpntrC3m6FhyriWYUmca8VYiZ0OawiVw7WsPIV4lsQiUrRNu-yggYt2VKE0TDUImzimpZ3kGZ4QqjdrUwxOPSceVnMnAk4G0wXURwoRKzknTRGW23BbCmFUBufj79S05GEbj0Wz0OHm6JIdmCbY5Vlekvi42cE321cc6XRU35Tp-AfnFnls
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=TopFormer%3A+Token+Pyramid+Transformer+for+Mobile+Semantic+Segmentation&rft.au=Zhang%2C+Wenqiang&rft.au=Huang%2C+Zilong&rft.au=Luo%2C+Guozhong&rft.au=Chen%2C+Tao&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12073&rft.epage=12083&rft_id=info:doi/10.1109%2FCVPR52688.2022.01177&rft.externalDocID=9880208