Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Jg. 2024; S. 26140 - 26150
Hauptverfasser: You, Chenyu, Mint, Yifei, Dai, Weicheng, Sekhon, Jasjeet S., Staib, Lawrence, Duncan, James S.
Format: Tagungsbericht Journal Article
Sprache:Englisch
Veröffentlicht: United States IEEE 01.06.2024
Schlagworte:
ISSN:1063-6919, 1063-6919
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here
AbstractList Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here
Author Mint, Yifei
Dai, Weicheng
Sekhon, Jasjeet S.
Staib, Lawrence
You, Chenyu
Duncan, James S.
Author_xml – sequence: 1
  givenname: Chenyu
  surname: You
  fullname: You, Chenyu
  organization: Yale University
– sequence: 2
  givenname: Yifei
  surname: Mint
  fullname: Mint, Yifei
  organization: Yale University
– sequence: 3
  givenname: Weicheng
  surname: Dai
  fullname: Dai, Weicheng
  organization: Yale University
– sequence: 4
  givenname: Jasjeet S.
  surname: Sekhon
  fullname: Sekhon, Jasjeet S.
  organization: Yale University
– sequence: 5
  givenname: Lawrence
  surname: Staib
  fullname: Staib, Lawrence
  organization: Yale University
– sequence: 6
  givenname: James S.
  surname: Duncan
  fullname: Duncan, James S.
  organization: Yale University
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39640960$$D View this record in MEDLINE/PubMed
BookMark eNpVUcFOGzEQNRWIAM0foMrHXja1d2zvupcqimhACgJFiEMvK292Nrja2OnaLurf1xUB0cNoRnpv3sy8OSfHzjsk5JKzGedMf1k83q9lWQHMSlaKWY6KHZGprnQNkoEExtQHcsaZgkJpro_f1RMyDeEnYwxKzpWuT8kEtBJMK3ZGfizMYNvRROu29DYN0RY735mBrnE_YkAXM-Rd-Ern9D6NIdlIfU-Xo097uvZtCtFhCPTZxiefIp075w8tH8lJb4aA00O-IA_frx4W18XqbnmzmK8KCyBiIVAi16AFCqmkqlu-MUIIBN0B05WSPefYVr3oNLLS_DuZlz30ddltSgNwQb69yO5Tu8Nuk1cezdDsR7sz45_GG9v8jzj71Gz97yabkc2sdVb4fFAY_a-EITY7GzY4DMahT6EBLpQExSqZqZ_eD3ub8mpoJly-ECwivsH5F1KJ_Ki_yxKIuQ
CODEN IEEPAD
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IH
CBEJK
RIE
RIO
NPM
7X8
5PM
DOI 10.1109/CVPR52733.2024.02470
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP) 1998-present
PubMed
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle PubMed
MEDLINE - Academic
DatabaseTitleList PubMed

MEDLINE - Academic

Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9798350353006
EISSN 1063-6919
EndPage 26150
ExternalDocumentID PMC11620289
39640960
10656450
Genre orig-research
Journal Article
GrantInformation_xml – fundername: NCATS NIH HHS
  grantid: UL1 TR001863
– fundername: NCI NIH HHS
  grantid: R01 CA206180
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
23M
29F
29O
6IK
ABDPE
ACGFS
IPLJI
M43
NPM
RIG
RNS
7X8
5PM
ID FETCH-LOGICAL-i334t-4e5e19394e456568b1ca444e39d309765f11eb7f4d9e02a979812f3f82dc2a33
IEDL.DBID RIE
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001344387502046&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1063-6919
IngestDate Tue Sep 30 17:06:38 EDT 2025
Thu Sep 04 14:30:25 EDT 2025
Sun Mar 16 01:22:05 EDT 2025
Wed Aug 27 02:00:48 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i334t-4e5e19394e456568b1ca444e39d309765f11eb7f4d9e02a979812f3f82dc2a33
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
equal contribution.
OpenAccessLink https://www.ncbi.nlm.nih.gov/pmc/articles/11620289
PMID 39640960
PQID 3146536075
PQPubID 23479
PageCount 11
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_11620289
proquest_miscellaneous_3146536075
ieee_primary_10656450
pubmed_primary_39640960
PublicationCentury 2000
PublicationDate 20240601
PublicationDateYYYYMMDD 2024-06-01
PublicationDate_xml – month: 6
  year: 2024
  text: 20240601
  day: 1
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationTitleAlternate Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
ssj0023720
Score 2.5461626
Snippet Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this...
SourceID pubmedcentral
proquest
pubmed
ieee
SourceType Open Access Repository
Aggregation Database
Index Database
Publisher
StartPage 26140
SubjectTerms Annotations
Benchmark testing
Computational modeling
Contrastive learning
Group Robustness; Multi-Modal Learning; Spurious Correlations
Refining
Training data
Visualization
Title Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
URI https://ieeexplore.ieee.org/document/10656450
https://www.ncbi.nlm.nih.gov/pubmed/39640960
https://www.proquest.com/docview/3146536075
https://pubmed.ncbi.nlm.nih.gov/PMC11620289
Volume 2024
WOSCitedRecordID wos001344387502046&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA4qHjz5WnV9EcFrddNk28abiOJBZFlEFi9L2kx0QVPZtv5-Z7J1faAHD4VCEwiZSfPN45th7Dix1tjcUYJDP40UCBHhKXKRkrF10qBNUoSuJTfp7W02GulBS1YPXBgACMlncEKvIZZvy6IhVxme8ISKn6CFvpim6YysNXeoSDRlEp219DjR06cX94Mh1ReTaAbG6gQfakkcmqj8hid_pkV-uWeuVv-5wjXW-WTs8cH8LlpnC-A32GoLMXl7gKtN9kBcrJyk7h95IN9GL6U1z3wYMmJbIpKvzvg5HzTTqpnUvHQ8eKj4sMybqqZ_Iyf3bdnU_Nz7sp3SYXdXl3cX11HbXiGaSKlqFEwfEL5pBQHVZbkojFIKpLayhyil74SAPHXKaujFRqcawYCTLottERspt9iSLz3sMG5M4ZSGBLIC0YKReYFWVQ9AZFYJl8Zd1qGtGr_OCmiMP3apy44-pDBGraZQhfFQNtVYCqr7liCe6bLtmVTms6VOFBleXZZ9k9d8AFXM_v7FT55C5WzUkJhCq7t_LGiPrZCqzFLB9tlSPW3ggC0Xb_Wkmh6i1o2yw6B173JJ2iU
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3da9swED9GN-ie2q7ZlnYfGuzVWWQptrW3UBY6loUQwgh7MbJ16gKbXGJ7f_90ipM1pXvYg8FgCYTuZP3u43cH8D4xRpvCUoLDKI0kch75U2QjKWJjhfY2SRm6lkzT2SxbrdS8I6sHLgwihuQzHNBriOWbqmzJVeZPeELFT7yF_ngkZcy3dK29S0V4YyZRWUeQ40P14erbfEEVxoQ3BGM58A81JQ5tVB5ClPcTI-_cNJOT_1zjKfT-cvbYfH8bncEjdM_gpAOZrDvC9Tl8JzZWQXJ3NyzQb6NfldE_2SLkxHZUJFd_ZGM2bzd1u25YZVnwUbFFVbR1Q39HRg7cqm3Y2Lmqm9KD5eTT8uo66hosRGshZONFM0IP4JTEgOuygpdaSolCGTH0OGVkOccitdIoHMZapcrDAStsFpsy1kI8hyNXOXwJTOvSSoUJZqXHC1oUpberhog8M5LbNO5Dj7Yqv92W0Mh3u9SHdzsp5F6vKVihHVZtnQtOld8Sj2j68GIrlf1soRJJplcfsgN57QdQzezDL279I9TO9hoSU3D14h8LegvH18uv03z6efblEp6S2mwTw17BUbNp8TU8KX8363rzJujeH8aW3IQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Calibrating+Multi-modal+Representations%3A+A+Pursuit+of+Group+Robustness+without+Annotations&rft.au=You%2C+Chenyu&rft.au=Mint%2C+Yifei&rft.au=Dai%2C+Weicheng&rft.au=Sekhon%2C+Jasjeet+S.&rft.date=2024-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=26140&rft.epage=26150&rft_id=info:doi/10.1109%2FCVPR52733.2024.02470&rft.externalDocID=10656450
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6919&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6919&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6919&client=summon