Vulnerability Detection with Code Language Models: How Far are We?

In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accurac...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings / International Conference on Software Engineering pp. 1729 - 1741
Main Authors: Ding, Yangruibo, Fu, Yanjun, Ibrahim, Omniyyah, Sitawarin, Chawin, Chen, Xinyun, Alomair, Basel, Wagner, David, Ray, Baishakhi, Chen, Yizheng
Format: Conference Proceeding
Language:English
Published: IEEE 26.04.2025
Subjects:
ISSN:1558-1225
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce Primevul, a new dataset for training and evaluating code LMs for vulnerability detection. Primevul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on Primevul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% Fl on BigVul but only 3.09% Fl on Primevul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
AbstractList In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce Primevul, a new dataset for training and evaluating code LMs for vulnerability detection. Primevul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on Primevul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% Fl on BigVul but only 3.09% Fl on Primevul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
Author Fu, Yanjun
Alomair, Basel
Ray, Baishakhi
Sitawarin, Chawin
Chen, Yizheng
Wagner, David
Ibrahim, Omniyyah
Ding, Yangruibo
Chen, Xinyun
Author_xml – sequence: 1
  givenname: Yangruibo
  surname: Ding
  fullname: Ding, Yangruibo
  organization: Columbia University
– sequence: 2
  givenname: Yanjun
  surname: Fu
  fullname: Fu, Yanjun
  organization: University of Washington
– sequence: 3
  givenname: Omniyyah
  surname: Ibrahim
  fullname: Ibrahim, Omniyyah
  organization: King Abdulaziz City for Science and Technology
– sequence: 4
  givenname: Chawin
  surname: Sitawarin
  fullname: Sitawarin, Chawin
  organization: Google DeepMind
– sequence: 5
  givenname: Xinyun
  surname: Chen
  fullname: Chen, Xinyun
  organization: UC Berkeley
– sequence: 6
  givenname: Basel
  surname: Alomair
  fullname: Alomair, Basel
  organization: King Abdulaziz City for Science and Technology
– sequence: 7
  givenname: David
  surname: Wagner
  fullname: Wagner, David
  organization: UC Berkeley
– sequence: 8
  givenname: Baishakhi
  surname: Ray
  fullname: Ray, Baishakhi
  organization: Columbia University
– sequence: 9
  givenname: Yizheng
  surname: Chen
  fullname: Chen, Yizheng
  organization: University of Maryland
BookMark eNotkF1LwzAYhaMouM39g13kD7TmTfKmiTei3eYGFS_8uhzJ8m5Waittx9i_t6BXhwMP54EzZhd1UxNjMxApgHA36_xlgah0lkohMRVCKHvGpi5zVilAgcbBORsBok1ASrxi4677GjCjnRuxh_dDVVPrQ1mV_YnPqadtXzY1P5b9J8-bSLzw9f7g98SfhlZ1t3zVHPnSt9y3xD_o7ppd7nzV0fQ_J-xtuXjNV0nx_LjO74vESyP6JJKVQZPNIAzuYNAa6bXYYtBBhRhBelCgLNjgYjBaK0-7iAKMFRHQqQmb_e2WRLT5actv3542wwnSOQD1C6qRSxE
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICSE55347.2025.00038
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798331505691
EISSN 1558-1225
EndPage 1741
ExternalDocumentID 11029911
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  grantid: 2229876,2154873,2221943,2313055,1845893,2107405
  funderid: 10.13039/100000001
GroupedDBID -~X
.4S
.DC
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
FEDTE
I-F
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-a260t-de82b4e871b064b65862a40c5b4b3bdd12a1313818b9db6443aefd501680d1593
IEDL.DBID RIE
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001538318100135&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:40:09 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a260t-de82b4e871b064b65862a40c5b4b3bdd12a1313818b9db6443aefd501680d1593
PageCount 13
ParticipantIDs ieee_primary_11029911
PublicationCentury 2000
PublicationDate 2025-April-26
PublicationDateYYYYMMDD 2025-04-26
PublicationDate_xml – month: 04
  year: 2025
  text: 2025-April-26
  day: 26
PublicationDecade 2020
PublicationTitle Proceedings / International Conference on Software Engineering
PublicationTitleAbbrev ICSE
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0006499
Score 2.3677003
Snippet In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting...
SourceID ieee
SourceType Publisher
StartPage 1729
SubjectTerms Accuracy
Benchmark testing
Codes
Data integrity
Detectors
Labeling
Measurement
Security
Software engineering
Training
Title Vulnerability Detection with Code Language Models: How Far are We?
URI https://ieeexplore.ieee.org/document/11029911
WOSCitedRecordID wos001538318100135&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgYmAqH0V8ywNraBwndsKCRGlVpKqqBJRuVWyfJSSUoDQF8e85u2lhYWBLMuTjrPjee77nI-SKWdDAYxlopk2A_EsEuZAGT0OtIzA68oLbdCTH43Q2yyaNWd17YQDAF5_BtTv0a_mm1EsnlXUxVeHs6Zy821KKlVlrM-0KxO6NN46FWfeh99hPEnwL5ICR001CZ0H51UHFJ5BB-5-P3iOdHysenWySzD7ZguKAtNe9GGjzax6Su-nyzW0g7Wtdv-g91L7GqqBOaKW90gAdNdIkdf3P3hY3dFh-0kFe0bwC-gK3HfI86D_1hkHTHyHIkYXUgYE0UjEg5VH49QqxhIjyONSJihVXxrAoZ5y5lKwyoxD48BysSRDkpaFBGMOPSKsoCzgmNMU7KCO4TBWPM24dz7Ey1ja2TFihTkjHxWT-vtoCY74Ox-kf18_Irgu7W3aJxDlp1dUSLsiO_qhfF9WlH7hvHkSYdQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwGA2igp7mj4m_zcFrXZukaetFcG5sWMfAOXcbTfIFhNFK1yn-9yZZN7148Nb00LQJzffey_fyIXQdaJBAWeTJQCrP8C_uZTxSpulLSUBJ4gS3cRoNBvFkkgxrs7rzwgCASz6DG3vp9vJVIRdWKmuZUGVWT-vk3QoZI_7SrrVeeLlB77U7LvCTVr_93AlD8x6GBRKrnPjWhPKrhooLId3GPzvfQ80fMx4ersPMPtqA_AA1VtUYcP1zHqL78WJmj5B22a5f-AEql2WVYyu14nahAKe1OIltBbTZ_Bb3ik_czUqclYBf4a6JXrqdUbvn1RUSvMzwkMpTEBPBwJAeYb5eGDTBScZ8GQomqFAqIFlAAxuURaKEgT40A61CA_NiXxkgQ4_QZl7kcIxwbJ4gFKdRLChLqLZMR0dMaqYDrrk4QU07JtP35SEY09VwnP5x_wrt9EZP6TTtDx7P0K6dArsJQ_g52qzKBVygbflRvc3LSzeJ3z-5m7w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=Vulnerability+Detection+with+Code+Language+Models%3A+How+Far+are+We%3F&rft.au=Ding%2C+Yangruibo&rft.au=Fu%2C+Yanjun&rft.au=Ibrahim%2C+Omniyyah&rft.au=Sitawarin%2C+Chawin&rft.date=2025-04-26&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=1729&rft.epage=1741&rft_id=info:doi/10.1109%2FICSE55347.2025.00038&rft.externalDocID=11029911