1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection

PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 1159 - 1170
Hlavní autoři:	Sun, Xiaobing, Gao, Xingan, Cao, Sicong, Bo, Lili, Wu, Xiaoxue, Huang, Kaifeng
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	ACM 27.10.2024
Témata:	BERT Codes Feature extraction Malicious Packages Malware Metadata Open-Source Software Privacy PyPI Python Security Semantics Software engineering Surges
ISSN:	2643-1572
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.CCS CONCEPTS* Security and privacy → Malware and its mitigation.
AbstractList	PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.CCS CONCEPTS* Security and privacy → Malware and its mitigation.
Author	Wu, Xiaoxue Sun, Xiaobing Bo, Lili Gao, Xingan Cao, Sicong Huang, Kaifeng
Author_xml	– sequence: 1 givenname: Xiaobing surname: Sun fullname: Sun, Xiaobing email: xbsun@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 2 givenname: Xingan surname: Gao fullname: Gao, Xingan email: MX120230566@stu.yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 3 givenname: Sicong surname: Cao fullname: Cao, Sicong email: DX120210088@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 4 givenname: Lili surname: Bo fullname: Bo, Lili email: lilibo@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 5 givenname: Xiaoxue surname: Wu fullname: Wu, Xiaoxue email: xiaoxuewu@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 6 givenname: Kaifeng surname: Huang fullname: Huang, Kaifeng email: kaifengh@tongji.edu.cn organization: Tongji University,Shanghai,China
BookMark	eNotzDtPwzAUQGGDQKKUzCwM3lGKr-34wYBEA4VIrcgAc-U4N6lFSarEBfXfUwmm70znkpx1fYeEXAObAcjsTigLirPZ0UxacUISq62RjGng0uhTMuFKihQyzS9IMo6hYsfMFICakApu4YHf06KL2A4uhq6lT4g7mvc10jlu3Hfoh5H-hLihK4yudtHRBbq4H3CkTT_QldsGH_r9SMtDWdDS-U_X4vES0cfQd1fkvHHbEZN_p-Rj8fyev6bLt5cif1ymDpSIqWKVbIStRVVL0Qgu0aLX4K0RHsEwrgCRecxMVWnvBSqO2qIxklvDLIgpufn7BkRc74bw5YbDGphW0nAtfgEt-lYM
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1145/3691620.3695493
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798400712487
EISSN	2643-1572
EndPage	1170
ExternalDocumentID	10764827
Genre	orig-research
GrantInformation_xml	– fundername: China Scholarship Council funderid: 10.13039/501100004543 – fundername: Six Talent Peaks Project in Jiangsu Province funderid: 10.13039/501100010014 – fundername: Yangzhou University funderid: 10.13039/501100007062 – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 – fundername: Nanjing University funderid: 10.13039/501100008048
GroupedDBID	6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL
ID	FETCH-LOGICAL-a163t-60b4f39d3bd43f324e9ec71c983ce180261ee0ce58bb7cc3e62e79e8842980913
IEDL.DBID	RIE
IngestDate	Wed Jan 15 06:20:43 EST 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a163t-60b4f39d3bd43f324e9ec71c983ce180261ee0ce58bb7cc3e62e79e8842980913
PageCount	12
ParticipantIDs	ieee_primary_10764827
PublicationCentury	2000
PublicationDate	2024-Oct.-27
PublicationDateYYYYMMDD	2024-10-27
PublicationDate_xml	– month: 10 year: 2024 text: 2024-Oct.-27 day: 27
PublicationDecade	2020
PublicationTitle	IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev	ASE
PublicationYear	2024
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssib057256116 ssj0051577
Score	2.2994587
Snippet	PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated...
SourceID	ieee
SourceType	Publisher
StartPage	1159
SubjectTerms	BERT Codes Feature extraction Malicious Packages Malware Metadata Open-Source Software Privacy PyPI Python Security Semantics Software engineering Surges
Title	1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection
URI	https://ieeexplore.ieee.org/document/10764827
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwMhECW28eCpftT4HQ7eDFoWFhYPXqqNPbTZgya9NSzMJsZk27S7Jv57B7rVePDgjRCWEFh4b2DmDSHXVkoEReeYNb5kUhrHCkhSJlWpRJAHt76MySb0dJrNZiZvg9VjLAwAROczuA3F-JbvF64JV2W4w7UKspUd0tFabYK1tj9PqhG8eeA6m2MYcVrrVsuHy_ROKCRCCdqoKjxsiV_JVCKWjHr_HMU-6f9E5dH8G28OyA5Uh6S3TctA2116RAp-wx-SezpulSCwMX0EWNLhwgNtBRFXaxquYOkEahu8RGnggg3a3hRZLJ0gO3fBO5bmn_mY5ta947mDvdTRc6vqk9fR08vwmbWpFJhFwlUzNShkKYwXhZeixCUAA05zZzLhIIjAKQ4wcJBmRaGdE6AS0AayDOEqC9Khx6RbLSo4IVRK4Z3BbnwB0qL9pRD9EpHaAfASPzsl_TBn8-VGLWO-na6zP-rPyV6CRCHgQaIvSLdeNXBJdt1H_bZeXcU1_gL2_qXY
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7oFPQ0f0z8bQ7epLo2adJ48KKODbfRw4TdRpq8ggjd2DrB_96XrlM8ePBWShtK0uT7XvLe9wFcGyEIFK0NjHZ5IIS2QYZRHAiZS-7lwY3LK7MJNRwm47FO62L1qhYGEavkM7z1l9VZvpvapd8qoxmupJet3IQtb51Vl2utf59YEXyHnu2sFmJCaqVqNZ9QxHdcEhWKKEqV_miL_7JTqdCk0_znd-xB66cuj6XfiLMPG1gcQHNtzMDqeXoIWXgTPkT3rFdrQdDD7Alxxh6nDlktiThfML8JywZYGp8nyjwbXFL0zYjHsgHxc-vzY1n6mfZYauw7rTzUSlnlbhUteO08jx67QW2mEBiiXGUg25nIuXY8c4LnNAio0arQ6oRb9DJwMkRsW4yTLFPWcpQRKo1JQoCVePHQI2gU0wKPgQnBndXUjMtQGIrAJOFfxGPTxjCn106g5ftsMlvpZUzW3XX6x_0r2OmOBv1Jvzd8OYPdiGiDR4dInUOjnC_xArbtR_m2mF9W4_0FzDipIQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=1%2B1%3E2%3A+Integrating+Deep+Code+Behaviors+with+Metadata+Features+for+Malicious+PyPI+Package+Detection&rft.au=Sun%2C+Xiaobing&rft.au=Gao%2C+Xingan&rft.au=Cao%2C+Sicong&rft.au=Bo%2C+Lili&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1159&rft.epage=1170&rft_id=info:doi/10.1145%2F3691620.3695493&rft.externalDocID=10764827