Multi-class Token Transformer for Weakly Supervised Semantic Segmentation
This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to for...
Saved in:
| Published in: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 4300 - 4309 |
|---|---|
| Main Authors: | , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.06.2022
|
| Subjects: | |
| ISSN: | 1063-6919 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from the class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS. 1 1 https://github.com/xulianuwa/MCTformer |
|---|---|
| AbstractList | This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS). Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization by learning multiple class tokens within the transformer. To this end, we propose a Multi-class Token Transformer, termed as MCTformer, which uses multiple class tokens to learn interactions between the class tokens and the patch tokens. The proposed MCTformer can successfully produce class-discriminative object localization maps from the class-to-patch attentions corresponding to different class tokens. We also propose to use a patch-level pairwise affinity, which is extracted from the patch-to-patch transformer attention, to further refine the localization maps. Moreover, the proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets. These results underline the importance of the class token for WSSS. 1 1 https://github.com/xulianuwa/MCTformer |
| Author | Bennamoun, Mohammed Boussaid, Farid Xu, Lian Ouyang, Wanli Xu, Dan |
| Author_xml | – sequence: 1 givenname: Lian surname: Xu fullname: Xu, Lian email: lian.xu@uwa.edu.au organization: The University of Western Australia – sequence: 2 givenname: Wanli surname: Ouyang fullname: Ouyang, Wanli email: wanli.ouyang@sydney.edu.au organization: The University of Sydney, SenseTime Computer Vision Group,Australia – sequence: 3 givenname: Mohammed surname: Bennamoun fullname: Bennamoun, Mohammed email: mohammed.bennamoun@uwa.edu.au organization: The University of Western Australia – sequence: 4 givenname: Farid surname: Boussaid fullname: Boussaid, Farid email: farid.boussaid@uwa.edu.au organization: The University of Western Australia – sequence: 5 givenname: Dan surname: Xu fullname: Xu, Dan email: danxu@cse.ust.hk organization: Hong Kong University of Science and Technology |
| BookMark | eNotj9FKwzAUQKMouM19gT7kB1pzkzRNHmXoHEwUV_VxZMmtxLXpaDphf29Bn855OnCm5CJ2EQm5BZYDMHO3-Hh9K7jSOueM85wxycszMgWlCqmMVOKcTIApkSkD5orMU_pmjAkOoIyekNXzsRlC5hqbEq26PUZa9Tamuutb7OkI-ol235zo5njA_ick9HSDrY1DcKN8tRgHO4QuXpPL2jYJ5_-ckffHh2rxlK1flqvF_ToLnIkh47WQ3sNO1wycAERnwKK3CLXRBlQpS--8LoQGzSWvrR43lJel2hWsEE7MyM1fNyDi9tCH1vanrdGl0ePXL_G_T4s |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52688.2022.00427 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 1665469463 9781665469463 |
| EISSN | 1063-6919 |
| EndPage | 4309 |
| ExternalDocumentID | 9879800 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Australian Research Council grantid: DP210101682,DP210102674,DP200103223 funderid: 10.13039/501100000923 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i203t-2f34dd1b8f01c31eec91aedae1f98916747dcd853818242fa84696d476b5053c3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 120 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000867754204055&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:15:10 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-2f34dd1b8f01c31eec91aedae1f98916747dcd853818242fa84696d476b5053c3 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_9879800 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-June |
| PublicationDateYYYYMMDD | 2022-06-01 |
| PublicationDate_xml | – month: 06 year: 2022 text: 2022-June |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2022 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.6479986 |
| Snippet | This paper proposes a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 4300 |
| SubjectTerms | Computer vision grouping and shape analysis Location awareness Object detection Pattern recognition Segmentation Semantics Shape Transformers |
| Title | Multi-class Token Transformer for Weakly Supervised Semantic Segmentation |
| URI | https://ieeexplore.ieee.org/document/9879800 |
| WOSCitedRecordID | wos000867754204055&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB3a4sFT1Vb8JgePrt3d7FfOxaKXUrRqbyU7mZVSuy3tVvDfO9ldK4IXTwmBEJiQvDfJvBmAa515rkR2S0JM0WEGTlasjI6v3MyoxISKqmIT8XCYTCZq1ICbnRaGiMrgM7q13fIv3yxxa5_KeuwfKyY4TWjGcVxptXbvKZI9mUgltTrOc1Wv_zJ6tMlMbACXb9NyBv7vGiolhAza_1v8ALo_Wjwx2qHMITQoP4J2TR5FfTQ3HXgopbQOWjYsxss55WL8TUppLbgRr6Tn75_iabuyF8TGTqcFW3aG3Hlb1CqkvAvPg7tx_96p6yQ4M9-VheNnMjDGS5PM9VB6RKg8TUaTl6lEWZlBbNAwLjM4MyJnmjmHikwQRynzH4nyGFr5MqcTEFmIjGipRJ4QhBpTX6o0SmVkfzd1LE-hYy0zXVWpMKa1Uc7-Hj6HfWv6KrLqAlrFekuXsIcfxWyzvir37wslrpy- |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfRUtRXf5uDRtZvNvnIulhZrKbpqb2U3mZVSuy19CP57J7trRfDiKSEQAhOS75tkvhmA6zjltlDklngqURYxcDRiZWU50k61DLUnsSg2EfT74XAoBxW42WhhEDEPPsNb083_8vVMrc1TWZP8Y0kEZwu2Pdd1eKHW2ryoCPJlfBmW-jhuy2brZfBo0pmYEC7HJOZ0nd9VVHIQadf-t_w-NH7UeGywwZkDqGB2CLWSPrLycC7r0M3FtJYyfJhFswlmLPqmpbhg1LBXjCfvn-xpPTdXxNJMxynZdqyo8zYtdUhZA57bd1GrY5WVEqyxY4uV5aTC1ZonYWpzJTiikjxGHSNPZSiN0CDQShMyEzwTJqcxsQ7pazfwE2JAQokjqGazDI-BpZ4iTEuEogmuF6vEETLxE-Gb_804ECdQN5YZzYtkGKPSKKd_D1_Bbid66I163f79GeyZbSjirM6hulqs8QJ21MdqvFxc5nv5BXwsoAU |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Multi-class+Token+Transformer+for+Weakly+Supervised+Semantic+Segmentation&rft.au=Xu%2C+Lian&rft.au=Ouyang%2C+Wanli&rft.au=Bennamoun%2C+Mohammed&rft.au=Boussaid%2C+Farid&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=4300&rft.epage=4309&rft_id=info:doi/10.1109%2FCVPR52688.2022.00427&rft.externalDocID=9879800 |