1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection

PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 1159 - 1170
Hlavní autoři: Sun, Xiaobing, Gao, Xingan, Cao, Sicong, Bo, Lili, Wu, Xiaoxue, Huang, Kaifeng
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: ACM 27.10.2024
Témata:
ISSN:2643-1572
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.CCS CONCEPTS* Security and privacy → Malware and its mitigation.
AbstractList PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.CCS CONCEPTS* Security and privacy → Malware and its mitigation.
Author Wu, Xiaoxue
Sun, Xiaobing
Bo, Lili
Gao, Xingan
Cao, Sicong
Huang, Kaifeng
Author_xml – sequence: 1
  givenname: Xiaobing
  surname: Sun
  fullname: Sun, Xiaobing
  email: xbsun@yzu.edu.cn
  organization: Yangzhou University,Yangzhou,China
– sequence: 2
  givenname: Xingan
  surname: Gao
  fullname: Gao, Xingan
  email: MX120230566@stu.yzu.edu.cn
  organization: Yangzhou University,Yangzhou,China
– sequence: 3
  givenname: Sicong
  surname: Cao
  fullname: Cao, Sicong
  email: DX120210088@yzu.edu.cn
  organization: Yangzhou University,Yangzhou,China
– sequence: 4
  givenname: Lili
  surname: Bo
  fullname: Bo, Lili
  email: lilibo@yzu.edu.cn
  organization: Yangzhou University,Yangzhou,China
– sequence: 5
  givenname: Xiaoxue
  surname: Wu
  fullname: Wu, Xiaoxue
  email: xiaoxuewu@yzu.edu.cn
  organization: Yangzhou University,Yangzhou,China
– sequence: 6
  givenname: Kaifeng
  surname: Huang
  fullname: Huang, Kaifeng
  email: kaifengh@tongji.edu.cn
  organization: Tongji University,Shanghai,China
BookMark eNotzDtPwzAUQGGDQKKUzCwM3lGKr-34wYBEA4VIrcgAc-U4N6lFSarEBfXfUwmm70znkpx1fYeEXAObAcjsTigLirPZ0UxacUISq62RjGng0uhTMuFKihQyzS9IMo6hYsfMFICakApu4YHf06KL2A4uhq6lT4g7mvc10jlu3Hfoh5H-hLihK4yudtHRBbq4H3CkTT_QldsGH_r9SMtDWdDS-U_X4vES0cfQd1fkvHHbEZN_p-Rj8fyev6bLt5cif1ymDpSIqWKVbIStRVVL0Qgu0aLX4K0RHsEwrgCRecxMVWnvBSqO2qIxklvDLIgpufn7BkRc74bw5YbDGphW0nAtfgEt-lYM
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3691620.3695493
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798400712487
EISSN 2643-1572
EndPage 1170
ExternalDocumentID 10764827
Genre orig-research
GrantInformation_xml – fundername: China Scholarship Council
  funderid: 10.13039/501100004543
– fundername: Six Talent Peaks Project in Jiangsu Province
  funderid: 10.13039/501100010014
– fundername: Yangzhou University
  funderid: 10.13039/501100007062
– fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
– fundername: Nanjing University
  funderid: 10.13039/501100008048
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a163t-60b4f39d3bd43f324e9ec71c983ce180261ee0ce58bb7cc3e62e79e8842980913
IEDL.DBID RIE
IngestDate Wed Jan 15 06:20:43 EST 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a163t-60b4f39d3bd43f324e9ec71c983ce180261ee0ce58bb7cc3e62e79e8842980913
PageCount 12
ParticipantIDs ieee_primary_10764827
PublicationCentury 2000
PublicationDate 2024-Oct.-27
PublicationDateYYYYMMDD 2024-10-27
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-Oct.-27
  day: 27
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2024
Publisher ACM
Publisher_xml – name: ACM
SSID ssib057256116
ssj0051577
Score 2.2994587
Snippet PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated...
SourceID ieee
SourceType Publisher
StartPage 1159
SubjectTerms BERT
Codes
Feature extraction
Malicious Packages
Malware
Metadata
Open-Source Software
Privacy
PyPI
Python
Security
Semantics
Software engineering
Surges
Title 1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection
URI https://ieeexplore.ieee.org/document/10764827
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwMhECW28eCpftT4HQ7eDFoWFhYPXqqNPbTZgya9NSzMJsZk27S7Jv57B7rVePDgjRCWEFh4b2DmDSHXVkoEReeYNb5kUhrHCkhSJlWpRJAHt76MySb0dJrNZiZvg9VjLAwAROczuA3F-JbvF64JV2W4w7UKspUd0tFabYK1tj9PqhG8eeA6m2MYcVrrVsuHy_ROKCRCCdqoKjxsiV_JVCKWjHr_HMU-6f9E5dH8G28OyA5Uh6S3TctA2116RAp-wx-SezpulSCwMX0EWNLhwgNtBRFXaxquYOkEahu8RGnggg3a3hRZLJ0gO3fBO5bmn_mY5ta947mDvdTRc6vqk9fR08vwmbWpFJhFwlUzNShkKYwXhZeixCUAA05zZzLhIIjAKQ4wcJBmRaGdE6AS0AayDOEqC9Khx6RbLSo4IVRK4Z3BbnwB0qL9pRD9EpHaAfASPzsl_TBn8-VGLWO-na6zP-rPyV6CRCHgQaIvSLdeNXBJdt1H_bZeXcU1_gL2_qXY
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7oFPQ0f0z8bQ7epLo2adJ48KKODbfRw4TdRpq8ggjd2DrB_96XrlM8ePBWShtK0uT7XvLe9wFcGyEIFK0NjHZ5IIS2QYZRHAiZS-7lwY3LK7MJNRwm47FO62L1qhYGEavkM7z1l9VZvpvapd8qoxmupJet3IQtb51Vl2utf59YEXyHnu2sFmJCaqVqNZ9QxHdcEhWKKEqV_miL_7JTqdCk0_znd-xB66cuj6XfiLMPG1gcQHNtzMDqeXoIWXgTPkT3rFdrQdDD7Alxxh6nDlktiThfML8JywZYGp8nyjwbXFL0zYjHsgHxc-vzY1n6mfZYauw7rTzUSlnlbhUteO08jx67QW2mEBiiXGUg25nIuXY8c4LnNAio0arQ6oRb9DJwMkRsW4yTLFPWcpQRKo1JQoCVePHQI2gU0wKPgQnBndXUjMtQGIrAJOFfxGPTxjCn106g5ftsMlvpZUzW3XX6x_0r2OmOBv1Jvzd8OYPdiGiDR4dInUOjnC_xArbtR_m2mF9W4_0FzDipIQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=1%2B1%3E2%3A+Integrating+Deep+Code+Behaviors+with+Metadata+Features+for+Malicious+PyPI+Package+Detection&rft.au=Sun%2C+Xiaobing&rft.au=Gao%2C+Xingan&rft.au=Cao%2C+Sicong&rft.au=Bo%2C+Lili&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1159&rft.epage=1170&rft_id=info:doi/10.1145%2F3691620.3695493&rft.externalDocID=10764827