When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection

Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and qual...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 345 - 357
Main Authors: Wen, Xin-Cheng, Wang, Xinchen, Gao, Cuiyun, Wang, Shaohua, Liu, Yang, Gu, Zhaoquan
Format: Conference Proceeding
Language:English
Published: IEEE 11.09.2023
Subjects:
ISSN:2643-1572
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and quality of labeled data. However, the current labeled data are generally automatically collected, such as crawled from human-generated commits, making it hard to ensure the quality of the labels. Prior studies have demonstrated that the non-vulnerable code (i.e., negative labels) tends to be unreliable in commonly-used datasets, while vulnerable code (i.e., positive labels) is more determined. Considering the large numbers of unlabeled data in practice, it is necessary and worth exploring to leverage the positive data and large numbers of unlabeled data for more accurate vulnerability detection. In this paper, we focus on the Positive and Unlabeled (PU) learning problem for vulnerability detection and propose a novel model named PILOT, i.e., Positive and unlabeled Learning mOdel for vulnerability deTection. PILOT only learns from positive and unlabeled data for vulnerability detection. It mainly contains two modules: (1) A distance-aware label selection module, aiming at generating pseudo-labels for selected unlabeled data, which involves the inter-class distance prototype and progressive fine-tuning; (2) A mixed-supervision representation learning module to further alleviate the influence of noise and enhance the discrimination of representations. Extensive experiments in vulnerability detection are conducted to evaluate the effectiveness of PILOT based on real-world vulnerability datasets. The experimental results show that PILOT outperforms the popular weakly supervised methods by 2.78%-18.93% in the PU learning setting. Compared with the state-of-the-art methods, PILOT also improves the performance of 1.34%-12.46 % in F1 score metrics in the supervised setting. In addition, PILOT can identify 23 mislabeled from the FFMPeg+Qemu dataset in the PU learning setting based on manual checking.
AbstractList Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and quality of labeled data. However, the current labeled data are generally automatically collected, such as crawled from human-generated commits, making it hard to ensure the quality of the labels. Prior studies have demonstrated that the non-vulnerable code (i.e., negative labels) tends to be unreliable in commonly-used datasets, while vulnerable code (i.e., positive labels) is more determined. Considering the large numbers of unlabeled data in practice, it is necessary and worth exploring to leverage the positive data and large numbers of unlabeled data for more accurate vulnerability detection. In this paper, we focus on the Positive and Unlabeled (PU) learning problem for vulnerability detection and propose a novel model named PILOT, i.e., Positive and unlabeled Learning mOdel for vulnerability deTection. PILOT only learns from positive and unlabeled data for vulnerability detection. It mainly contains two modules: (1) A distance-aware label selection module, aiming at generating pseudo-labels for selected unlabeled data, which involves the inter-class distance prototype and progressive fine-tuning; (2) A mixed-supervision representation learning module to further alleviate the influence of noise and enhance the discrimination of representations. Extensive experiments in vulnerability detection are conducted to evaluate the effectiveness of PILOT based on real-world vulnerability datasets. The experimental results show that PILOT outperforms the popular weakly supervised methods by 2.78%-18.93% in the PU learning setting. Compared with the state-of-the-art methods, PILOT also improves the performance of 1.34%-12.46 % in F1 score metrics in the supervised setting. In addition, PILOT can identify 23 mislabeled from the FFMPeg+Qemu dataset in the PU learning setting based on manual checking.
Author Wen, Xin-Cheng
Wang, Xinchen
Liu, Yang
Wang, Shaohua
Gao, Cuiyun
Gu, Zhaoquan
Author_xml – sequence: 1
  givenname: Xin-Cheng
  surname: Wen
  fullname: Wen, Xin-Cheng
  email: xiamenwxc@foxmail.com
  organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China
– sequence: 2
  givenname: Xinchen
  surname: Wang
  fullname: Wang, Xinchen
  email: 200111115@stu.hit.edu.cn
  organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China
– sequence: 3
  givenname: Cuiyun
  surname: Gao
  fullname: Gao, Cuiyun
  email: gaocuiyun@hit.edu.cn
  organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China
– sequence: 4
  givenname: Shaohua
  surname: Wang
  fullname: Wang, Shaohua
  email: davidshwang@ieee.org
  organization: Central University of Finance and Economics,China
– sequence: 5
  givenname: Yang
  surname: Liu
  fullname: Liu, Yang
  email: yangliu@ntu.edu.sg
  organization: School of Computer Science and Engineering, Nanyang Technological University,China
– sequence: 6
  givenname: Zhaoquan
  surname: Gu
  fullname: Gu, Zhaoquan
  email: guzhaoquan@hit.edu.cn
  organization: School of Computer Science and Technology, Harbin Institute of Technology,Shenzhen,China
BookMark eNotjl1LwzAYRqMouM39Ar3IH-jMV9vEuzHrB1QUdPNyvE3ebJGaStMJ-_cW9OrwwOHhTMlZ7CIScsXZgnNmbpZvVV4IYRaCCblgjCt1QuamNFrmTApjCnVKJqJQMuN5KS7INKVPxvJxlBOy-dhjpDWmREOiVewOu_0tfe1SGMIPUoiOrmMLDbboRg36GOKOPncOW-q7nm4ObcQemtCG4UjvcEA7hC5eknMPbcL5P2dkfV-9rx6z-uXhabWsMxBaDZnmAM4yi1xbrxrbWI5gIefGeGVd6RVoHDs1d0xprwR4IxphdOG08FbLGbn--w2IuP3uwxf0xy1noyILKX8BQyNUdw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ASE56229.2023.00144
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350329964
EISSN 2643-1572
EndPage 357
ExternalDocumentID 10298363
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a284t-81aadc0ce18cf4bcbc1eaca5199f4cd7f4a8e57781d048f42af92b2986d82fc83
IEDL.DBID RIE
ISICitedReferencesCount 14
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001103357200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:32:41 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a284t-81aadc0ce18cf4bcbc1eaca5199f4cd7f4a8e57781d048f42af92b2986d82fc83
PageCount 13
ParticipantIDs ieee_primary_10298363
PublicationCentury 2000
PublicationDate 2023-Sept.-11
PublicationDateYYYYMMDD 2023-09-11
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-Sept.-11
  day: 11
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0051577
ssib057256115
Score 2.4063332
Snippet Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable...
SourceID ieee
SourceType Publisher
StartPage 345
SubjectTerms Benchmark testing
Codes
Deep learning
Measurement
positive and unlabeled learning
Prototypes
Representation learning
Software vulnerability detection
source code representation
Source coding
Title When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection
URI https://ieeexplore.ieee.org/document/10298363
WOSCitedRecordID wos001103357200028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxcBUPor4lgfWQB2H2GZD0IqhqipBq26VY59RpcpFbVKJf8_ZTYsYGNiSTM7Zd-8uefeOkNs8x3orB5EEyZskC9wa5TjeOqfSgvHCRNnFcV8MBnIyUcO6WT32wgBAJJ_BXbiM__LtwlThUxl6eKokz3mDNITIN81a28PzIBC8GdvlvojTQtQyQ6yj7p_eugj1aehNSYOoKRYSvwaqRDzptf65kkPS_unMo8Md5hyRPfDHpLUdzUBrTz0hYwyynvYxjNHZinZ9mMXzSIeRobUGqr2lI48HAEHH0lpj9YOGwWhzimksHVfzIEcdmbNf9AXKyNjybTLqdd-fX5N6hEKiEXfKRDKtrekYYNK4rDCFYRhpNaZtymXGCpdpCWgozFrRlV2W6rBF-G65lakzkp-Spl94OCNU5sCxDGcdC0FUUCitU8ZtxkErY7g8J-1gp-nnRiVjujXRxR_PL8lB2IrAvWDsijTLZQXXZN-sy9lqeRP39htVCqSU
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG4UTfSED4xve_C6Srdlt_VmFIJxJSQC4Ua6fRgSUgwsJP57p2XBePDgbXdP3Wlnvpndb75B6DZJoN5KTBp5yZuIeW6NsBRurRVxTmiuguziIEs7HT4cim7ZrB56YYwxgXxm7vxl-Jevp2rhP5WBh8eC04Ruo50GY3F91a61Pj6NFOCbkE32C0idpqXQEKmL-8f3JoB97LtTYi9rCqXEr5EqAVFa1X-u5QDVfnrzcHeDOodoy7gjVF0PZ8Clrx6jAYRZhzMIZHg8x03np_E84G7gaC0Nlk7jvoMjALCjcamy-oH9aLQJhkQWDxYTL0gduLNf-NkUgbPlaqjfavae2lE5RCGSgDxFxImUWtWVIVxZlqtcEYi1EhI3YZnSqWWSGzAU5K3gzJbF0m8SvFuieWwVpyeo4qbOnCLME0OhECd1bbysYCqkjAnVjBoplKL8DNW8nUafK52M0dpE5388v0F77d5bNspeOq8XaN9vi2diEHKJKsVsYa7QrloW4_nsOuzzNxL7p9s
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=When+Less+is+Enough%3A+Positive+and+Unlabeled+Learning+Model+for+Vulnerability+Detection&rft.au=Wen%2C+Xin-Cheng&rft.au=Wang%2C+Xinchen&rft.au=Gao%2C+Cuiyun&rft.au=Wang%2C+Shaohua&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=345&rft.epage=357&rft_id=info:doi/10.1109%2FASE56229.2023.00144&rft.externalDocID=10298363