When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection
Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and qual...
Uloženo v:
| Vydáno v: | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 345 - 357 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
11.09.2023
|
| Témata: | |
| ISSN: | 2643-1572 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and quality of labeled data. However, the current labeled data are generally automatically collected, such as crawled from human-generated commits, making it hard to ensure the quality of the labels. Prior studies have demonstrated that the non-vulnerable code (i.e., negative labels) tends to be unreliable in commonly-used datasets, while vulnerable code (i.e., positive labels) is more determined. Considering the large numbers of unlabeled data in practice, it is necessary and worth exploring to leverage the positive data and large numbers of unlabeled data for more accurate vulnerability detection. In this paper, we focus on the Positive and Unlabeled (PU) learning problem for vulnerability detection and propose a novel model named PILOT, i.e., Positive and unlabeled Learning mOdel for vulnerability deTection. PILOT only learns from positive and unlabeled data for vulnerability detection. It mainly contains two modules: (1) A distance-aware label selection module, aiming at generating pseudo-labels for selected unlabeled data, which involves the inter-class distance prototype and progressive fine-tuning; (2) A mixed-supervision representation learning module to further alleviate the influence of noise and enhance the discrimination of representations. Extensive experiments in vulnerability detection are conducted to evaluate the effectiveness of PILOT based on real-world vulnerability datasets. The experimental results show that PILOT outperforms the popular weakly supervised methods by 2.78%-18.93% in the PU learning setting. Compared with the state-of-the-art methods, PILOT also improves the performance of 1.34%-12.46 % in F1 score metrics in the supervised setting. In addition, PILOT can identify 23 mislabeled from the FFMPeg+Qemu dataset in the PU learning setting based on manual checking. |
|---|---|
| AbstractList | Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and quality of labeled data. However, the current labeled data are generally automatically collected, such as crawled from human-generated commits, making it hard to ensure the quality of the labels. Prior studies have demonstrated that the non-vulnerable code (i.e., negative labels) tends to be unreliable in commonly-used datasets, while vulnerable code (i.e., positive labels) is more determined. Considering the large numbers of unlabeled data in practice, it is necessary and worth exploring to leverage the positive data and large numbers of unlabeled data for more accurate vulnerability detection. In this paper, we focus on the Positive and Unlabeled (PU) learning problem for vulnerability detection and propose a novel model named PILOT, i.e., Positive and unlabeled Learning mOdel for vulnerability deTection. PILOT only learns from positive and unlabeled data for vulnerability detection. It mainly contains two modules: (1) A distance-aware label selection module, aiming at generating pseudo-labels for selected unlabeled data, which involves the inter-class distance prototype and progressive fine-tuning; (2) A mixed-supervision representation learning module to further alleviate the influence of noise and enhance the discrimination of representations. Extensive experiments in vulnerability detection are conducted to evaluate the effectiveness of PILOT based on real-world vulnerability datasets. The experimental results show that PILOT outperforms the popular weakly supervised methods by 2.78%-18.93% in the PU learning setting. Compared with the state-of-the-art methods, PILOT also improves the performance of 1.34%-12.46 % in F1 score metrics in the supervised setting. In addition, PILOT can identify 23 mislabeled from the FFMPeg+Qemu dataset in the PU learning setting based on manual checking. |
| Author | Wen, Xin-Cheng Wang, Xinchen Liu, Yang Wang, Shaohua Gao, Cuiyun Gu, Zhaoquan |
| Author_xml | – sequence: 1 givenname: Xin-Cheng surname: Wen fullname: Wen, Xin-Cheng email: xiamenwxc@foxmail.com organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China – sequence: 2 givenname: Xinchen surname: Wang fullname: Wang, Xinchen email: 200111115@stu.hit.edu.cn organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China – sequence: 3 givenname: Cuiyun surname: Gao fullname: Gao, Cuiyun email: gaocuiyun@hit.edu.cn organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China – sequence: 4 givenname: Shaohua surname: Wang fullname: Wang, Shaohua email: davidshwang@ieee.org organization: Central University of Finance and Economics,China – sequence: 5 givenname: Yang surname: Liu fullname: Liu, Yang email: yangliu@ntu.edu.sg organization: School of Computer Science and Engineering, Nanyang Technological University,China – sequence: 6 givenname: Zhaoquan surname: Gu fullname: Gu, Zhaoquan email: guzhaoquan@hit.edu.cn organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China |
| BookMark | eNotjl1LwzAYRqMouM39Ar3IH-jMV9vEuzHrB1QUdPNyvE3ebJGaStMJ-_cW9OrwwOHhTMlZ7CIScsXZgnNmbpZvVV4IYRaCCblgjCt1QuamNFrmTApjCnVKJqJQMuN5KS7INKVPxvJxlBOy-dhjpDWmREOiVewOu_0tfe1SGMIPUoiOrmMLDbboRg36GOKOPncOW-q7nm4ObcQemtCG4UjvcEA7hC5eknMPbcL5P2dkfV-9rx6z-uXhabWsMxBaDZnmAM4yi1xbrxrbWI5gIefGeGVd6RVoHDs1d0xprwR4IxphdOG08FbLGbn--w2IuP3uwxf0xy1noyILKX8BQyNUdw |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ASE56229.2023.00144 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798350329964 |
| EISSN | 2643-1572 |
| EndPage | 357 |
| ExternalDocumentID | 10298363 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
| ID | FETCH-LOGICAL-a284t-81aadc0ce18cf4bcbc1eaca5199f4cd7f4a8e57781d048f42af92b2986d82fc83 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 14 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001103357200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:32:41 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a284t-81aadc0ce18cf4bcbc1eaca5199f4cd7f4a8e57781d048f42af92b2986d82fc83 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_10298363 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-Sept.-11 |
| PublicationDateYYYYMMDD | 2023-09-11 |
| PublicationDate_xml | – month: 09 year: 2023 text: 2023-Sept.-11 day: 11 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
| PublicationTitleAbbrev | ASE |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0051577 ssib057256115 |
| Score | 2.406248 |
| Snippet | Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 345 |
| SubjectTerms | Benchmark testing Codes Deep learning Measurement positive and unlabeled learning Prototypes Representation learning Software vulnerability detection source code representation Source coding |
| Title | When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection |
| URI | https://ieeexplore.ieee.org/document/10298363 |
| WOSCitedRecordID | wos001103357200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcBUHkW85YE1UCdObLMhaMWAqkrQqlvl2GdUqXJRm1bi33N20yIGBjYrU3K-u-8u9_gIuTUYImAqaxPQBWCCEoqEDE8qd6hBjjuZlZFsQvT7cjxWg3pYPc7CAEBsPoO7cIy1fDs3q_CrDC08VTIrsgZpCFFshrW2ypMLBG_GdrEv4rQQ9Zoh1lH3j29dhPo0zKakYakpJhK_CFUinvRa_3yTQ9L-mcyjgx3mHJE98MektaVmoLWlnpAROllPX9GN0emSdn3g4nmgg9ihtQaqvaVDjwqAoGNpvWP1gwZitBnFMJaOVrOwjjp2zn7RZ6hix5Zvk2Gv-_70ktQUColG3KkSybS2pmOASeN4aUrD0NNqDNuU48YKx7UEFBRGrWjKjqfaqbTEbyusTJ2R2Slp-rmHM0I1zzFVk6CUQUTLdClzpQsV6pocpOqck3aQ0-RzsyVjshXRxR_PL8lBuIrQe8HYFWlWixVck32zrqbLxU2822965KP- |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG4UTfSED4xve_C6Sne7u603oxCMSEgEwo10u1NDQoqBhcR_77QsGA8evDV72p3OzDez8_gIudUYImAqmwegEsAExRUJGZ5kbFCDDDciyjzZRNrpiOFQdsthdT8LAwC--Qzu3NHX8vOpXrhfZWjhoRRREm2TnZjzsL4a11qrT5wifDO2iX4RqdO0XDTE6vL-8b2BYB-66ZTQrTXFVOIXpYpHlGb1n-9yQGo_s3m0u0GdQ7IF9ohU1-QMtLTVYzJAN2tpGx0ZHc9pwzo2ngfa9T1aS6DK5rRvUQUQdnJabln9oI4abUIxkKWDxcQtpPa9s1_0GQrfs2VrpN9s9J5aQUmiEChEniIQTKlc1zUwoQ3PdKYZ-lqFgZs0XOep4UoACgrjVjRmw0NlZJjhtyW5CI0W0Qmp2KmFU0IVjzFZEyClRkyLVCZiqRLpKpschKyfkZqT0-hztSdjtBbR-R_Pb8heq_fWHrVfOq8XZN9di-vEYOySVIrZAq7Irl4W4_ns2t_zNzItp0U |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=When+Less+is+Enough%3A+Positive+and+Unlabeled+Learning+Model+for+Vulnerability+Detection&rft.au=Wen%2C+Xin-Cheng&rft.au=Wang%2C+Xinchen&rft.au=Gao%2C+Cuiyun&rft.au=Wang%2C+Shaohua&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=345&rft.epage=357&rft_id=info:doi/10.1109%2FASE56229.2023.00144&rft.externalDocID=10298363 |