Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models t...
Gespeichert in:
| Veröffentlicht in: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Jg. 2024; S. 26140 - 26150 |
|---|---|
| Hauptverfasser: | , , , , , |
| Format: | Tagungsbericht Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
IEEE
01.06.2024
|
| Schlagworte: | |
| ISSN: | 1063-6919, 1063-6919 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here |
|---|---|
| AbstractList | Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here. Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here. Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here |
| Author | Mint, Yifei Dai, Weicheng Sekhon, Jasjeet S. Staib, Lawrence You, Chenyu Duncan, James S. |
| Author_xml | – sequence: 1 givenname: Chenyu surname: You fullname: You, Chenyu organization: Yale University – sequence: 2 givenname: Yifei surname: Mint fullname: Mint, Yifei organization: Yale University – sequence: 3 givenname: Weicheng surname: Dai fullname: Dai, Weicheng organization: Yale University – sequence: 4 givenname: Jasjeet S. surname: Sekhon fullname: Sekhon, Jasjeet S. organization: Yale University – sequence: 5 givenname: Lawrence surname: Staib fullname: Staib, Lawrence organization: Yale University – sequence: 6 givenname: James S. surname: Duncan fullname: Duncan, James S. organization: Yale University |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39640960$$D View this record in MEDLINE/PubMed |
| BookMark | eNpVUcFOGzEQNRWIAM0foMrHXja1d2zvupcqimhACgJFiEMvK292Nrja2OnaLurf1xUB0cNoRnpv3sy8OSfHzjsk5JKzGedMf1k83q9lWQHMSlaKWY6KHZGprnQNkoEExtQHcsaZgkJpro_f1RMyDeEnYwxKzpWuT8kEtBJMK3ZGfizMYNvRROu29DYN0RY735mBrnE_YkAXM-Rd-Ern9D6NIdlIfU-Xo097uvZtCtFhCPTZxiefIp075w8tH8lJb4aA00O-IA_frx4W18XqbnmzmK8KCyBiIVAi16AFCqmkqlu-MUIIBN0B05WSPefYVr3oNLLS_DuZlz30ddltSgNwQb69yO5Tu8Nuk1cezdDsR7sz45_GG9v8jzj71Gz97yabkc2sdVb4fFAY_a-EITY7GzY4DMahT6EBLpQExSqZqZ_eD3ub8mpoJly-ECwivsH5F1KJ_Ki_yxKIuQ |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IH CBEJK RIE RIO NPM 7X8 5PM |
| DOI | 10.1109/CVPR52733.2024.02470 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP) 1998-present PubMed MEDLINE - Academic PubMed Central (Full Participant titles) |
| DatabaseTitle | PubMed MEDLINE - Academic |
| DatabaseTitleList | PubMed MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Computer Science |
| EISBN | 9798350353006 |
| EISSN | 1063-6919 |
| EndPage | 26150 |
| ExternalDocumentID | PMC11620289 39640960 10656450 |
| Genre | orig-research Journal Article |
| GrantInformation_xml | – fundername: NCATS NIH HHS grantid: UL1 TR001863 – fundername: NCI NIH HHS grantid: R01 CA206180 |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO 23M 29F 29O 6IK ABDPE ACGFS IPLJI M43 NPM RIG RNS 7X8 5PM |
| ID | FETCH-LOGICAL-i334t-4e5e19394e456568b1ca444e39d309765f11eb7f4d9e02a979812f3f82dc2a33 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 3 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001344387502046&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1063-6919 |
| IngestDate | Tue Sep 30 17:06:38 EDT 2025 Thu Sep 04 14:30:25 EDT 2025 Sun Mar 16 01:22:05 EDT 2025 Wed Aug 27 02:00:48 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i334t-4e5e19394e456568b1ca444e39d309765f11eb7f4d9e02a979812f3f82dc2a33 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 equal contribution. |
| OpenAccessLink | https://www.ncbi.nlm.nih.gov/pmc/articles/11620289 |
| PMID | 39640960 |
| PQID | 3146536075 |
| PQPubID | 23479 |
| PageCount | 11 |
| ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_11620289 proquest_miscellaneous_3146536075 ieee_primary_10656450 pubmed_primary_39640960 |
| PublicationCentury | 2000 |
| PublicationDate | 20240601 |
| PublicationDateYYYYMMDD | 2024-06-01 |
| PublicationDate_xml | – month: 6 year: 2024 text: 20240601 day: 1 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationTitleAlternate | Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 ssj0023720 |
| Score | 2.5461626 |
| Snippet | Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this... |
| SourceID | pubmedcentral proquest pubmed ieee |
| SourceType | Open Access Repository Aggregation Database Index Database Publisher |
| StartPage | 26140 |
| SubjectTerms | Annotations Benchmark testing Computational modeling Contrastive learning Group Robustness; Multi-Modal Learning; Spurious Correlations Refining Training data Visualization |
| Title | Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations |
| URI | https://ieeexplore.ieee.org/document/10656450 https://www.ncbi.nlm.nih.gov/pubmed/39640960 https://www.proquest.com/docview/3146536075 https://pubmed.ncbi.nlm.nih.gov/PMC11620289 |
| Volume | 2024 |
| WOSCitedRecordID | wos001344387502046&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA4qHjz5WnV9EcFrddNk28abiOJBZFlEFi9L2kx0QVPZtv5-Z7J1faAHD4VCEwiZSfPN45th7Dix1tjcUYJDP40UCBHhKXKRkrF10qBNUoSuJTfp7W02GulBS1YPXBgACMlncEKvIZZvy6IhVxme8ISKn6CFvpim6YysNXeoSDRlEp219DjR06cX94Mh1ReTaAbG6gQfakkcmqj8hid_pkV-uWeuVv-5wjXW-WTs8cH8LlpnC-A32GoLMXl7gKtN9kBcrJyk7h95IN9GL6U1z3wYMmJbIpKvzvg5HzTTqpnUvHQ8eKj4sMybqqZ_Iyf3bdnU_Nz7sp3SYXdXl3cX11HbXiGaSKlqFEwfEL5pBQHVZbkojFIKpLayhyil74SAPHXKaujFRqcawYCTLottERspt9iSLz3sMG5M4ZSGBLIC0YKReYFWVQ9AZFYJl8Zd1qGtGr_OCmiMP3apy44-pDBGraZQhfFQNtVYCqr7liCe6bLtmVTms6VOFBleXZZ9k9d8AFXM_v7FT55C5WzUkJhCq7t_LGiPrZCqzFLB9tlSPW3ggC0Xb_Wkmh6i1o2yw6B173JJ2iU |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3da9swED9GN-ie2q7ZlnYfGuzVWWQptrW3UBY6loUQwgh7MbJ16gKbXGJ7f_90ipM1pXvYg8FgCYTuZP3u43cH8D4xRpvCUoLDKI0kch75U2QjKWJjhfY2SRm6lkzT2SxbrdS8I6sHLgwihuQzHNBriOWbqmzJVeZPeELFT7yF_ngkZcy3dK29S0V4YyZRWUeQ40P14erbfEEVxoQ3BGM58A81JQ5tVB5ClPcTI-_cNJOT_1zjKfT-cvbYfH8bncEjdM_gpAOZrDvC9Tl8JzZWQXJ3NyzQb6NfldE_2SLkxHZUJFd_ZGM2bzd1u25YZVnwUbFFVbR1Q39HRg7cqm3Y2Lmqm9KD5eTT8uo66hosRGshZONFM0IP4JTEgOuygpdaSolCGTH0OGVkOccitdIoHMZapcrDAStsFpsy1kI8hyNXOXwJTOvSSoUJZqXHC1oUpberhog8M5LbNO5Dj7Yqv92W0Mh3u9SHdzsp5F6vKVihHVZtnQtOld8Sj2j68GIrlf1soRJJplcfsgN57QdQzezDL279I9TO9hoSU3D14h8LegvH18uv03z6efblEp6S2mwTw17BUbNp8TU8KX8363rzJujeH8aW3IQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Calibrating+Multi-modal+Representations%3A+A+Pursuit+of+Group+Robustness+without+Annotations&rft.au=You%2C+Chenyu&rft.au=Mint%2C+Yifei&rft.au=Dai%2C+Weicheng&rft.au=Sekhon%2C+Jasjeet+S.&rft.date=2024-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=26140&rft.epage=26150&rft_id=info:doi/10.1109%2FCVPR52733.2024.02470&rft.externalDocID=10656450 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon |