1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection
PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle...
Uloženo v:
| Vydáno v: | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 1159 - 1170 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
ACM
27.10.2024
|
| Témata: | |
| ISSN: | 2643-1572 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.CCS CONCEPTS* Security and privacy → Malware and its mitigation. |
|---|---|
| AbstractList | PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.CCS CONCEPTS* Security and privacy → Malware and its mitigation. |
| Author | Wu, Xiaoxue Sun, Xiaobing Bo, Lili Gao, Xingan Cao, Sicong Huang, Kaifeng |
| Author_xml | – sequence: 1 givenname: Xiaobing surname: Sun fullname: Sun, Xiaobing email: xbsun@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 2 givenname: Xingan surname: Gao fullname: Gao, Xingan email: MX120230566@stu.yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 3 givenname: Sicong surname: Cao fullname: Cao, Sicong email: DX120210088@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 4 givenname: Lili surname: Bo fullname: Bo, Lili email: lilibo@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 5 givenname: Xiaoxue surname: Wu fullname: Wu, Xiaoxue email: xiaoxuewu@yzu.edu.cn organization: Yangzhou University,Yangzhou,China – sequence: 6 givenname: Kaifeng surname: Huang fullname: Huang, Kaifeng email: kaifengh@tongji.edu.cn organization: Tongji University,Shanghai,China |
| BookMark | eNotzDtPwzAUQGGDQKKUzCwM3lGKr-34wYBEA4VIrcgAc-U4N6lFSarEBfXfUwmm70znkpx1fYeEXAObAcjsTigLirPZ0UxacUISq62RjGng0uhTMuFKihQyzS9IMo6hYsfMFICakApu4YHf06KL2A4uhq6lT4g7mvc10jlu3Hfoh5H-hLihK4yudtHRBbq4H3CkTT_QldsGH_r9SMtDWdDS-U_X4vES0cfQd1fkvHHbEZN_p-Rj8fyev6bLt5cif1ymDpSIqWKVbIStRVVL0Qgu0aLX4K0RHsEwrgCRecxMVWnvBSqO2qIxklvDLIgpufn7BkRc74bw5YbDGphW0nAtfgEt-lYM |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1145/3691620.3695493 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798400712487 |
| EISSN | 2643-1572 |
| EndPage | 1170 |
| ExternalDocumentID | 10764827 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: China Scholarship Council funderid: 10.13039/501100004543 – fundername: Six Talent Peaks Project in Jiangsu Province funderid: 10.13039/501100010014 – fundername: Yangzhou University funderid: 10.13039/501100007062 – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 – fundername: Nanjing University funderid: 10.13039/501100008048 |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
| ID | FETCH-LOGICAL-a163t-60b4f39d3bd43f324e9ec71c983ce180261ee0ce58bb7cc3e62e79e8842980913 |
| IEDL.DBID | RIE |
| IngestDate | Wed Jan 15 06:20:43 EST 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a163t-60b4f39d3bd43f324e9ec71c983ce180261ee0ce58bb7cc3e62e79e8842980913 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_10764827 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Oct.-27 |
| PublicationDateYYYYMMDD | 2024-10-27 |
| PublicationDate_xml | – month: 10 year: 2024 text: 2024-Oct.-27 day: 27 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
| PublicationTitleAbbrev | ASE |
| PublicationYear | 2024 |
| Publisher | ACM |
| Publisher_xml | – name: ACM |
| SSID | ssib057256116 ssj0051577 |
| Score | 2.2994587 |
| Snippet | PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1159 |
| SubjectTerms | BERT Codes Feature extraction Malicious Packages Malware Metadata Open-Source Software Privacy PyPI Python Security Semantics Software engineering Surges |
| Title | 1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection |
| URI | https://ieeexplore.ieee.org/document/10764827 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwMhECW28eCpftT4HQ7eDFoWFhYPXqqNPbTZgya9NSzMJsZk27S7Jv57B7rVePDgjRCWEFh4b2DmDSHXVkoEReeYNb5kUhrHCkhSJlWpRJAHt76MySb0dJrNZiZvg9VjLAwAROczuA3F-JbvF64JV2W4w7UKspUd0tFabYK1tj9PqhG8eeA6m2MYcVrrVsuHy_ROKCRCCdqoKjxsiV_JVCKWjHr_HMU-6f9E5dH8G28OyA5Uh6S3TctA2116RAp-wx-SezpulSCwMX0EWNLhwgNtBRFXaxquYOkEahu8RGnggg3a3hRZLJ0gO3fBO5bmn_mY5ta947mDvdTRc6vqk9fR08vwmbWpFJhFwlUzNShkKYwXhZeixCUAA05zZzLhIIjAKQ4wcJBmRaGdE6AS0AayDOEqC9Khx6RbLSo4IVRK4Z3BbnwB0qL9pRD9EpHaAfASPzsl_TBn8-VGLWO-na6zP-rPyV6CRCHgQaIvSLdeNXBJdt1H_bZeXcU1_gL2_qXY |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7oFPQ0f0z8bQ7epLo2adJ48KKODbfRw4TdRpq8ggjd2DrB_96XrlM8ePBWShtK0uT7XvLe9wFcGyEIFK0NjHZ5IIS2QYZRHAiZS-7lwY3LK7MJNRwm47FO62L1qhYGEavkM7z1l9VZvpvapd8qoxmupJet3IQtb51Vl2utf59YEXyHnu2sFmJCaqVqNZ9QxHdcEhWKKEqV_miL_7JTqdCk0_znd-xB66cuj6XfiLMPG1gcQHNtzMDqeXoIWXgTPkT3rFdrQdDD7Alxxh6nDlktiThfML8JywZYGp8nyjwbXFL0zYjHsgHxc-vzY1n6mfZYauw7rTzUSlnlbhUteO08jx67QW2mEBiiXGUg25nIuXY8c4LnNAio0arQ6oRb9DJwMkRsW4yTLFPWcpQRKo1JQoCVePHQI2gU0wKPgQnBndXUjMtQGIrAJOFfxGPTxjCn106g5ftsMlvpZUzW3XX6x_0r2OmOBv1Jvzd8OYPdiGiDR4dInUOjnC_xArbtR_m2mF9W4_0FzDipIQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=1%2B1%3E2%3A+Integrating+Deep+Code+Behaviors+with+Metadata+Features+for+Malicious+PyPI+Package+Detection&rft.au=Sun%2C+Xiaobing&rft.au=Gao%2C+Xingan&rft.au=Cao%2C+Sicong&rft.au=Bo%2C+Lili&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1159&rft.epage=1170&rft_id=info:doi/10.1145%2F3691620.3695493&rft.externalDocID=10764827 |