Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to for...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 4300 - 4309
Main Authors: Xu, Lian, Ouyang, Wanli, Bennamoun, Mohammed, Boussaid, Farid, Xu, Dan
Format: Conference Proceeding
Language:English
Published: IEEE 01.06.2022
Subjects:
ISSN:1063-6919
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from the class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS. 1 1 https://github.com/xulianuwa/MCTformer
AbstractList This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from the class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS. 1 1 https://github.com/xulianuwa/MCTformer
Author Bennamoun, Mohammed
Boussaid, Farid
Xu, Lian
Ouyang, Wanli
Xu, Dan
Author_xml – sequence: 1
  givenname: Lian
  surname: Xu
  fullname: Xu, Lian
  email: lian.xu@uwa.edu.au
  organization: The University of Western Australia
– sequence: 2
  givenname: Wanli
  surname: Ouyang
  fullname: Ouyang, Wanli
  email: wanli.ouyang@sydney.edu.au
  organization: The University of Sydney, SenseTime Computer Vision Group,Australia
– sequence: 3
  givenname: Mohammed
  surname: Bennamoun
  fullname: Bennamoun, Mohammed
  email: mohammed.bennamoun@uwa.edu.au
  organization: The University of Western Australia
– sequence: 4
  givenname: Farid
  surname: Boussaid
  fullname: Boussaid, Farid
  email: farid.boussaid@uwa.edu.au
  organization: The University of Western Australia
– sequence: 5
  givenname: Dan
  surname: Xu
  fullname: Xu, Dan
  email: danxu@cse.ust.hk
  organization: Hong Kong University of Science and Technology
BookMark eNotj9FKwzAUQKMouM19gT7kB1pzkzRNHmXoHEwUV_VxZMmtxLXpaDphf29Bn855OnCm5CJ2EQm5BZYDMHO3-Hh9K7jSOueM85wxycszMgWlCqmMVOKcTIApkSkD5orMU_pmjAkOoIyekNXzsRlC5hqbEq26PUZa9Tamuutb7OkI-ol235zo5njA_ick9HSDrY1DcKN8tRgHO4QuXpPL2jYJ5_-ckffHh2rxlK1flqvF_ToLnIkh47WQ3sNO1wycAERnwKK3CLXRBlQpS--8LoQGzSWvrR43lJel2hWsEE7MyM1fNyDi9tCH1vanrdGl0ePXL_G_T4s
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.00427
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 4309
ExternalDocumentID 9879800
Genre orig-research
GrantInformation_xml – fundername: Australian Research Council
  grantid: DP210101682,DP210102674,DP200103223
  funderid: 10.13039/501100000923
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-2f34dd1b8f01c31eec91aedae1f98916747dcd853818242fa84696d476b5053c3
IEDL.DBID RIE
ISICitedReferencesCount 120
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000867754204055&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:15:10 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-2f34dd1b8f01c31eec91aedae1f98916747dcd853818242fa84696d476b5053c3
PageCount 10
ParticipantIDs ieee_primary_9879800
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6479986
Snippet This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic...
SourceID ieee
SourceType Publisher
StartPage 4300
SubjectTerms Computer vision
grouping and shape analysis
Location awareness
Object detection
Pattern recognition
Segmentation
Semantics
Shape
Transformers
Title Multi-class Token Transformer for Weakly Supervised Semantic Segmentation
URI https://ieeexplore.ieee.org/document/9879800
WOSCitedRecordID wos000867754204055&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB3a4sFT1Vb8JgePrt3d7FfOxaKXUrRqbyU7mZVSuy3tVvDfO9ldK4IXTwmBEJiQvDfJvBmAa515rkR2S0JM0WEGTlasjI6v3MyoxISKqmIT8XCYTCZq1ICbnRaGiMrgM7q13fIv3yxxa5_KeuwfKyY4TWjGcVxptXbvKZI9mUgltTrOc1Wv_zJ6tMlMbACXb9NyBv7vGiolhAza_1v8ALo_Wjwx2qHMITQoP4J2TR5FfTQ3HXgopbQOWjYsxss55WL8TUppLbgRr6Tn75_iabuyF8TGTqcFW3aG3Hlb1CqkvAvPg7tx_96p6yQ4M9-VheNnMjDGS5PM9VB6RKg8TUaTl6lEWZlBbNAwLjM4MyJnmjmHikwQRynzH4nyGFr5MqcTEFmIjGipRJ4QhBpTX6o0SmVkfzd1LE-hYy0zXVWpMKa1Uc7-Hj6HfWv6KrLqAlrFekuXsIcfxWyzvir37wslrpy-
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfRUtRXf5uDRtZvNvnIulhZrKbpqb2U3mZVSuy19CP57J7trRfDiKSEQAhOS75tkvhmA6zjltlDklngqURYxcDRiZWU50k61DLUnsSg2EfT74XAoBxW42WhhEDEPPsNb083_8vVMrc1TWZP8Y0kEZwu2Pdd1eKHW2ryoCPJlfBmW-jhuy2brZfBo0pmYEC7HJOZ0nd9VVHIQadf-t_w-NH7UeGywwZkDqGB2CLWSPrLycC7r0M3FtJYyfJhFswlmLPqmpbhg1LBXjCfvn-xpPTdXxNJMxynZdqyo8zYtdUhZA57bd1GrY5WVEqyxY4uV5aTC1ZonYWpzJTiikjxGHSNPZSiN0CDQShMyEzwTJqcxsQ7pazfwE2JAQokjqGazDI-BpZ4iTEuEogmuF6vEETLxE-Gb_804ECdQN5YZzYtkGKPSKKd_D1_Bbid66I163f79GeyZbSjirM6hulqs8QJ21MdqvFxc5nv5BXwsoAU
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Multi-class+Token+Transformer+for+Weakly+Supervised+Semantic+Segmentation&rft.au=Xu%2C+Lian&rft.au=Ouyang%2C+Wanli&rft.au=Bennamoun%2C+Mohammed&rft.au=Boussaid%2C+Farid&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=4300&rft.epage=4309&rft_id=info:doi/10.1109%2FCVPR52688.2022.00427&rft.externalDocID=9879800