Vulnerability Detection with Code Language Models: How Far are We?

In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accurac...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings / International Conference on Software Engineering pp. 1729 - 1741
Main Authors:	Ding, Yangruibo, Fu, Yanjun, Ibrahim, Omniyyah, Sitawarin, Chawin, Chen, Xinyun, Alomair, Basel, Wagner, David, Ray, Baishakhi, Chen, Yizheng
Format:	Conference Proceeding
Language:	English
Published:	IEEE 26.04.2025
Subjects:	Accuracy Benchmark testing Codes Data integrity Detectors Labeling Measurement Security Software engineering Training
ISSN:	1558-1225
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce Primevul, a new dataset for training and evaluating code LMs for vulnerability detection. Primevul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on Primevul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% Fl on BigVul but only 3.09% Fl on Primevul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
AbstractList	In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce Primevul, a new dataset for training and evaluating code LMs for vulnerability detection. Primevul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on Primevul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% Fl on BigVul but only 3.09% Fl on Primevul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
Author	Fu, Yanjun Alomair, Basel Ray, Baishakhi Sitawarin, Chawin Chen, Yizheng Wagner, David Ibrahim, Omniyyah Ding, Yangruibo Chen, Xinyun
Author_xml	– sequence: 1 givenname: Yangruibo surname: Ding fullname: Ding, Yangruibo organization: Columbia University – sequence: 2 givenname: Yanjun surname: Fu fullname: Fu, Yanjun organization: University of Washington – sequence: 3 givenname: Omniyyah surname: Ibrahim fullname: Ibrahim, Omniyyah organization: King Abdulaziz City for Science and Technology – sequence: 4 givenname: Chawin surname: Sitawarin fullname: Sitawarin, Chawin organization: Google DeepMind – sequence: 5 givenname: Xinyun surname: Chen fullname: Chen, Xinyun organization: UC Berkeley – sequence: 6 givenname: Basel surname: Alomair fullname: Alomair, Basel organization: King Abdulaziz City for Science and Technology – sequence: 7 givenname: David surname: Wagner fullname: Wagner, David organization: UC Berkeley – sequence: 8 givenname: Baishakhi surname: Ray fullname: Ray, Baishakhi organization: Columbia University – sequence: 9 givenname: Yizheng surname: Chen fullname: Chen, Yizheng organization: University of Maryland
BookMark	eNotkF1LwzAYhaMouM39g13kD7TmTfKmiTei3eYGFS_8uhzJ8m5Waittx9i_t6BXhwMP54EzZhd1UxNjMxApgHA36_xlgah0lkohMRVCKHvGpi5zVilAgcbBORsBok1ASrxi4677GjCjnRuxh_dDVVPrQ1mV_YnPqadtXzY1P5b9J8-bSLzw9f7g98SfhlZ1t3zVHPnSt9y3xD_o7ppd7nzV0fQ_J-xtuXjNV0nx_LjO74vESyP6JJKVQZPNIAzuYNAa6bXYYtBBhRhBelCgLNjgYjBaK0-7iAKMFRHQqQmb_e2WRLT5actv3542wwnSOQD1C6qRSxE
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICSE55347.2025.00038
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798331505691
EISSN	1558-1225
EndPage	1741
ExternalDocumentID	11029911
Genre	orig-research
GrantInformation_xml	– fundername: National Science Foundation grantid: 2229876,2154873,2221943,2313055,1845893,2107405 funderid: 10.13039/100000001
GroupedDBID	-~X .4S .DC 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO
ID	FETCH-LOGICAL-a260t-de82b4e871b064b65862a40c5b4b3bdd12a1313818b9db6443aefd501680d1593
IEDL.DBID	RIE
ISICitedReferencesCount	5
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001538318100135&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 01:40:09 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a260t-de82b4e871b064b65862a40c5b4b3bdd12a1313818b9db6443aefd501680d1593
PageCount	13
ParticipantIDs	ieee_primary_11029911
PublicationCentury	2000
PublicationDate	2025-April-26
PublicationDateYYYYMMDD	2025-04-26
PublicationDate_xml	– month: 04 year: 2025 text: 2025-April-26 day: 26
PublicationDecade	2020
PublicationTitle	Proceedings / International Conference on Software Engineering
PublicationTitleAbbrev	ICSE
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0006499
Score	2.3677003
Snippet	In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting...
SourceID	ieee
SourceType	Publisher
StartPage	1729
SubjectTerms	Accuracy Benchmark testing Codes Data integrity Detectors Labeling Measurement Security Software engineering Training
Title	Vulnerability Detection with Code Language Models: How Far are We?
URI	https://ieeexplore.ieee.org/document/11029911
WOSCitedRecordID	wos001538318100135&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF60ePBUHxXf7MFrbJJ9xotgbakgpeCrt7KPKQiSSJoq_ntn07R68eAt7CXsLLvfN7Pz7UfIhcy8yhDYIm8kJig645FmDiKbCIQT3CJKL80m1GikJ5Ns3IjVay0MANTNZ3AZPuu7fF-4RSiVdRGq8PQMSt5NpeRSrLU-diVy90Ybl8RZ96730BeCcYU5YBrqJnGQoPxyUKkBZND-5693SOdHikfHa5DZJRuQ75H2youBNltzn9w8L97CA9J1r-sXvYWq7rHKaSi00l7hgd43pUka_M_e5ld0WHzSgSmpKYG-wHWHPA36j71h1PgjRAazkCryoFPLAVMei7O3yCVkanjshOWWWe-T1CQsCZBsM2-R-DADMy-Q5OnYI41hB6SVFzkcEmpEqo0wIHms-YwldqaMc0KB5Y4LoY9IJ8Rk-r58AmO6CsfxH-MnZDuEPVy7pPKUtKpyAWdky31Ur_PyvF64b_qBlyw
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1SBT3Vj4rf5uB17W4-drNeBGtLi7UUrNpbSTZTEMqubLeK_97Jdlu9ePAWcglJSN6bybw8Qq7C2EYxAptndYgBioqFp3gCngkkwgkekUgtzSaiwUCNx_GwEquXWhgAKIvP4No1y7d8myULlyprIlTh7emUvJtSCOYv5VrrizdE9l6p4wI_bvZaT20puYgwCmQuc-I7EcovD5USQjr1fw6-Sxo_Yjw6XMPMHtmAdJ_UV24MtDqcB-TuZTFzX0iX1a5f9B6KssoqpS7VSluZBdqvkpPUOaDN5je0m33Sjs6pzoG-wm2DPHfao1bXqxwSPI1xSOFZUMwIwKDH4OwNsomQaeEn0gjDjbUB0wEPHCib2BqkPlzD1Eqkecq3SGT4IamlWQpHhGrJlJYaQuErMeWBmUY6SWQERiRCSnVMGm5NJu_LTzAmq-U4-aP_kmx3R4_9Sb83eDglO24L3CMMC89IrcgXcE62ko_ibZ5flJv4DQlzmnM
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=Vulnerability+Detection+with+Code+Language+Models%3A+How+Far+are+We%3F&rft.au=Ding%2C+Yangruibo&rft.au=Fu%2C+Yanjun&rft.au=Ibrahim%2C+Omniyyah&rft.au=Sitawarin%2C+Chawin&rft.date=2025-04-26&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=1729&rft.epage=1741&rft_id=info:doi/10.1109%2FICSE55347.2025.00038&rft.externalDocID=11029911