Token Contrast for Weakly-Supervised Semantic Segmentation
Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) c...
Saved in:
| Published in: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 3093 - 3102 |
|---|---|
| Main Authors: | , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.06.2023
|
| Subjects: | |
| ISSN: | 1063-6919 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, i.e., the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo. |
|---|---|
| AbstractList | Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, i.e., the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo. |
| Author | Ru, Lixiang Du, Bo Zheng, Heliang Zhan, Yibing |
| Author_xml | – sequence: 1 givenname: Lixiang surname: Ru fullname: Ru, Lixiang email: rulixiang@whu.edu.cn organization: Institute of Artificial Intelligence, School of Computer Science, National Engineering Research Center for Multimedia Software, Wuhan University,Hubei Key Laboratory of Multimedia and Network Communication Engineering,China – sequence: 2 givenname: Heliang surname: Zheng fullname: Zheng, Heliang email: zhengheliang@jd.com organization: JD Explore Academy,China – sequence: 3 givenname: Yibing surname: Zhan fullname: Zhan, Yibing email: zhanyibing@jd.com organization: JD Explore Academy,China – sequence: 4 givenname: Bo surname: Du fullname: Du, Bo email: dubo@whu.edu.cn organization: Institute of Artificial Intelligence, School of Computer Science, National Engineering Research Center for Multimedia Software, Wuhan University,Hubei Key Laboratory of Multimedia and Network Communication Engineering,China |
| BookMark | eNotjttKw0AURUdRsNb8QR_yA6lnzmRuvknwBgXFVn0sk8mJjG0mJRmF_r0Bfdp7w2KxL9lZ7CMxtuCw5BzsdfX-8ipRo10ioFgCCMATllltjZDT4GjNKZtxUKJQltsLlo3jF0wccq6smbGbTb-jmFd9TIMbU972Q_5Bbrc_FuvvAw0_YaQmX1PnYgp-Kp8dxeRS6OMVO2_dfqTsP-fs7f5uUz0Wq-eHp-p2VQSEMhWlqV2LTkwnwFgArZEkOcWdx6Y2rgEppZe6rU1Lmsh4r5DXknStoHRezNnizxuIaHsYQueG45bDZJcKxS-Bi0ss |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52729.2023.00302 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350301298 |
| EISSN | 1063-6919 |
| EndPage | 3102 |
| ExternalDocumentID | 10204562 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 62225113,62002090 funderid: 10.13039/501100001809 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i204t-48baf2a310608900772e5ea61ac2db8ad0555c57fb8fe7ee8cc621b5e7b604ac3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 133 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542603040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:56:29 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i204t-48baf2a310608900772e5ea61ac2db8ad0555c57fb8fe7ee8cc621b5e7b604ac3 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_10204562 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-June |
| PublicationDateYYYYMMDD | 2023-06-01 |
| PublicationDate_xml | – month: 06 year: 2023 text: 2023-June |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.6227999 |
| Snippet | Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 3093 |
| SubjectTerms | Codes Computer vision Image analysis Pattern recognition Scene analysis and understanding Semantic segmentation Semantics Transformers |
| Title | Token Contrast for Weakly-Supervised Semantic Segmentation |
| URI | https://ieeexplore.ieee.org/document/10204562 |
| WOSCitedRecordID | wos001058542603040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED7RioGpPIqgPJSB1SVxEj9YKyqmqoIC3SrbuaCqNK2SFIl_j-2GIgYGNsuyZOv8uM_2ffcB3FDr4xKtOckYTUhiqCYqMpIYwZVEJ3Yrcy82wUcjMZ3KcUNW91wYRPTBZ9h3Rf-Xn63Mxj2V2R1OPWJvQYtzviVr7R5UYnuVYVI09LgolLeDl_FjSi167DuN8L5bz_SXiIr3IcPOP3s_hO4PGy8Y7_zMEexhcQydBj4GzeasTuBuslpgEbh8U6Wq6sDC0eAV1eL9kzxt1u5MqFx7XFpjzo0tvC0b4lHRhefh_WTwQBppBDK3g6hJIrTKqbLYjIVCupw8FFNULFKGZlqozOXxMinPtciRIwpjGI10ilyzMFEmPoV2sSrwDII0ZhjrhHNjkQVaQKiMCsNcCousFGJ4Dl1ni9l6m_1i9m2G3h_1F3DgzL0Np7qEdl1u8Ar2zUc9r8prP2dfJ_qXlw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3JTsMwEB1BQYJTWYrYyYGrS-I4sc21oiqiVBWU5VbZzgRVpWnVpEj8PXYaijhw4GZZlmyNl3m2580DuKTWxzGtOUliyggzVBMVGEmM4EqiE7uVaSk2wXs98foq-xVZveTCIGIZfIZNVyz_8pOpWbinMrvDaYnY12EjYowGS7rW6kkltJeZWIqKIBf48qr13H-IqMWPTacS3nQrmv6SUSm9SLv-z_53oPHDx_P6K0-zC2uY7UG9ApBetT3zfbgeTMeYeS7j1FzlhWcBqfeCavz-SR4XM3cq5K49Tqw5R8YW3iYV9ShrwFP7ZtDqkEocgYzsIArChFYpVRadxb6QLisPxQhVHChDEy1U4jJ5mYinWqTIEYUxMQ10hFzHPlMmPIBaNs3wELwojDHUjHNjsQVaSKiM8v1UCoutFKJ_BA1ni-Fsmf9i-G2G4z_qL2CrM7jvDru3vbsT2HamXwZXnUKtmC_wDDbNRzHK5-fl_H0BPCaa3g |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Token+Contrast+for+Weakly-Supervised+Semantic+Segmentation&rft.au=Ru%2C+Lixiang&rft.au=Zheng%2C+Heliang&rft.au=Zhan%2C+Yibing&rft.au=Du%2C+Bo&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=3093&rft.epage=3102&rft_id=info:doi/10.1109%2FCVPR52729.2023.00302&rft.externalDocID=10204562 |