Vulnerability Detection with Code Language Models: How Far are We?
In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accurac...
Saved in:
| Published in: | Proceedings / International Conference on Software Engineering pp. 1729 - 1741 |
|---|---|
| Main Authors: | , , , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
26.04.2025
|
| Subjects: | |
| ISSN: | 1558-1225 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce Primevul, a new dataset for training and evaluating code LMs for vulnerability detection. Primevul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on Primevul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% Fl on BigVul but only 3.09% Fl on Primevul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain. |
|---|---|
| AbstractList | In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce Primevul, a new dataset for training and evaluating code LMs for vulnerability detection. Primevul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on Primevul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% Fl on BigVul but only 3.09% Fl on Primevul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain. |
| Author | Fu, Yanjun Alomair, Basel Ray, Baishakhi Sitawarin, Chawin Chen, Yizheng Wagner, David Ibrahim, Omniyyah Ding, Yangruibo Chen, Xinyun |
| Author_xml | – sequence: 1 givenname: Yangruibo surname: Ding fullname: Ding, Yangruibo organization: Columbia University – sequence: 2 givenname: Yanjun surname: Fu fullname: Fu, Yanjun organization: University of Washington – sequence: 3 givenname: Omniyyah surname: Ibrahim fullname: Ibrahim, Omniyyah organization: King Abdulaziz City for Science and Technology – sequence: 4 givenname: Chawin surname: Sitawarin fullname: Sitawarin, Chawin organization: Google DeepMind – sequence: 5 givenname: Xinyun surname: Chen fullname: Chen, Xinyun organization: UC Berkeley – sequence: 6 givenname: Basel surname: Alomair fullname: Alomair, Basel organization: King Abdulaziz City for Science and Technology – sequence: 7 givenname: David surname: Wagner fullname: Wagner, David organization: UC Berkeley – sequence: 8 givenname: Baishakhi surname: Ray fullname: Ray, Baishakhi organization: Columbia University – sequence: 9 givenname: Yizheng surname: Chen fullname: Chen, Yizheng organization: University of Maryland |
| BookMark | eNotkF1LwzAYhaMouM39g13kD7TmTfKmiTei3eYGFS_8uhzJ8m5Waittx9i_t6BXhwMP54EzZhd1UxNjMxApgHA36_xlgah0lkohMRVCKHvGpi5zVilAgcbBORsBok1ASrxi4677GjCjnRuxh_dDVVPrQ1mV_YnPqadtXzY1P5b9J8-bSLzw9f7g98SfhlZ1t3zVHPnSt9y3xD_o7ppd7nzV0fQ_J-xtuXjNV0nx_LjO74vESyP6JJKVQZPNIAzuYNAa6bXYYtBBhRhBelCgLNjgYjBaK0-7iAKMFRHQqQmb_e2WRLT5actv3542wwnSOQD1C6qRSxE |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICSE55347.2025.00038 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798331505691 |
| EISSN | 1558-1225 |
| EndPage | 1741 |
| ExternalDocumentID | 11029911 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Science Foundation grantid: 2229876,2154873,2221943,2313055,1845893,2107405 funderid: 10.13039/100000001 |
| GroupedDBID | -~X .4S .DC 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-a260t-de82b4e871b064b65862a40c5b4b3bdd12a1313818b9db6443aefd501680d1593 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001538318100135&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 01:40:09 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a260t-de82b4e871b064b65862a40c5b4b3bdd12a1313818b9db6443aefd501680d1593 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_11029911 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-April-26 |
| PublicationDateYYYYMMDD | 2025-04-26 |
| PublicationDate_xml | – month: 04 year: 2025 text: 2025-April-26 day: 26 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings / International Conference on Software Engineering |
| PublicationTitleAbbrev | ICSE |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0006499 |
| Score | 2.3677003 |
| Snippet | In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1729 |
| SubjectTerms | Accuracy Benchmark testing Codes Data integrity Detectors Labeling Measurement Security Software engineering Training |
| Title | Vulnerability Detection with Code Language Models: How Far are We? |
| URI | https://ieeexplore.ieee.org/document/11029911 |
| WOSCitedRecordID | wos001538318100135&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF60ePBUHxXf7MFrbJJ9xotgbakgpeCrt7KPKQiSSJoq_ntn07R68eAt7CXsLLvfN7Pz7UfIhcy8yhDYIm8kJig645FmDiKbCIQT3CJKL80m1GikJ5Ns3IjVay0MANTNZ3AZPuu7fF-4RSiVdRGq8PQMSt5NpeRSrLU-diVy90Ybl8RZ96730BeCcYU5YBrqJnGQoPxyUKkBZND-5693SOdHikfHa5DZJRuQ75H2youBNltzn9w8L97CA9J1r-sXvYWq7rHKaSi00l7hgd43pUka_M_e5ld0WHzSgSmpKYG-wHWHPA36j71h1PgjRAazkCryoFPLAVMei7O3yCVkanjshOWWWe-T1CQsCZBsM2-R-DADMy-Q5OnYI41hB6SVFzkcEmpEqo0wIHms-YwldqaMc0KB5Y4LoY9IJ8Rk-r58AmO6CsfxH-MnZDuEPVy7pPKUtKpyAWdky31Ur_PyvF64b_qBlyw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1SBT3Vj4rf5uB17W4-drNeBGtLi7UUrNpbSTZTEMqubLeK_97Jdlu9ePAWcglJSN6bybw8Qq7C2EYxAptndYgBioqFp3gCngkkwgkekUgtzSaiwUCNx_GwEquXWhgAKIvP4No1y7d8myULlyprIlTh7emUvJtSCOYv5VrrizdE9l6p4wI_bvZaT20puYgwCmQuc-I7EcovD5USQjr1fw6-Sxo_Yjw6XMPMHtmAdJ_UV24MtDqcB-TuZTFzX0iX1a5f9B6KssoqpS7VSluZBdqvkpPUOaDN5je0m33Sjs6pzoG-wm2DPHfao1bXqxwSPI1xSOFZUMwIwKDH4OwNsomQaeEn0gjDjbUB0wEPHCib2BqkPlzD1Eqkecq3SGT4IamlWQpHhGrJlJYaQuErMeWBmUY6SWQERiRCSnVMGm5NJu_LTzAmq-U4-aP_kmx3R4_9Sb83eDglO24L3CMMC89IrcgXcE62ko_ibZ5flJv4DQlzmnM |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=Vulnerability+Detection+with+Code+Language+Models%3A+How+Far+are+We%3F&rft.au=Ding%2C+Yangruibo&rft.au=Fu%2C+Yanjun&rft.au=Ibrahim%2C+Omniyyah&rft.au=Sitawarin%2C+Chawin&rft.date=2025-04-26&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=1729&rft.epage=1741&rft_id=info:doi/10.1109%2FICSE55347.2025.00038&rft.externalDocID=11029911 |