Learning to Find Usages of Library Functions in Optimized Binaries
Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensi...
Uložené v:
| Vydané v: | IEEE transactions on software engineering Ročník 48; číslo 10; s. 3862 - 3876 |
|---|---|
| Hlavní autori: | , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
New York
IEEE
01.10.2022
IEEE Computer Society |
| Predmet: | |
| ISSN: | 0098-5589, 1939-3520 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization. |
|---|---|
| AbstractList | Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization. |
| Author | Sawant, Anand Ashok Ahmed, Toufique Devanbu, Premkumar |
| Author_xml | – sequence: 1 givenname: Toufique orcidid: 0000-0002-4427-1350 surname: Ahmed fullname: Ahmed, Toufique email: tfahmed@ucdavis.edu organization: Department of Computer Science, University of California, Davis, CA, USA – sequence: 2 givenname: Premkumar surname: Devanbu fullname: Devanbu, Premkumar email: ptdevanbu@ucdavis.edu organization: Department of Computer Science, University of California, Davis, CA, USA – sequence: 3 givenname: Anand Ashok orcidid: 0000-0002-5816-8020 surname: Sawant fullname: Sawant, Anand Ashok email: asawant@ucdavis.edu organization: Department of Computer Science, University of California, Davis, CA, USA |
| BookMark | eNp9kE1PAjEQhhuDiYjeTbw08bzYD9rdHoWAmmzCQTg33XZKSqCL7XLQX-8SiAcPnubyPvPOPLdoENsICD1QMqaUqOfVx3zMCKNjTokUJbtCQ6q4KrhgZICGhKiqEKJSN-g25y0hRJSlGKJpDSbFEDe4a_EiRIfX2Wwg49bjOjTJpC-8OEbbhTZmHCJeHrqwD9_g8DREkwLkO3TtzS7D_WWO0HoxX83einr5-j57qQvLFO0K5qoGrPXSAFQNUU4pKypwtPK-lJQ33hnHG-L7S4kXpXETBZMJd0xYz6zkI_R03ntI7ecRcqe37THFvlKzkklKpVS8T5FzyqY25wReH1LY929oSvTJlO5N6ZMpfTHVI_IPYkNnTh93yYTdf-DjGQwA8NujeuNMSf4DcGR3uQ |
| CODEN | IESEDJ |
| CitedBy_id | crossref_primary_10_1109_TSE_2022_3212635 crossref_primary_10_1145_3765521 |
| Cites_doi | 10.1109/SECPRI.2001.924286 10.1109/ICSM.2001.972777 10.1145/3073559 10.1109/TSE.2015.2470241 10.1109/MALWARE.2010.5665796 10.1109/CEC.2006.1688720 10.1016/j.jnca.2012.10.004 10.1109/ICSE.2019.00048 10.1007/978-3-319-60876-1_14 10.1145/3428293 10.1145/1281192.1281308 10.1109/MSP.2007.45 10.1109/SANER.2015.7081836 10.1145/1557019.1557167 10.1007/978-3-540-89862-7_1 10.14722/ndss.2021.23112 10.1109/ASE.2019.00064 10.1145/3243734.3243866 10.1109/ICSME.2017.59 10.18653/v1/2020.findings-emnlp.139 10.1007/s10844-009-0086-7 10.1109/SCAM.2011.19 10.1007/s10664-018-9669-7 10.1109/TSMCC.2009.2037978 10.1109/52.43044 |
| ContentType | Journal Article |
| Copyright | Copyright IEEE Computer Society 2022 |
| Copyright_xml | – notice: Copyright IEEE Computer Society 2022 |
| DBID | 97E RIA RIE AAYXX CITATION JQ2 K9. |
| DOI | 10.1109/TSE.2021.3106572 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef ProQuest Computer Science Collection ProQuest Health & Medical Complete (Alumni) |
| DatabaseTitle | CrossRef ProQuest Health & Medical Complete (Alumni) ProQuest Computer Science Collection |
| DatabaseTitleList | ProQuest Health & Medical Complete (Alumni) |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1939-3520 |
| EndPage | 3876 |
| ExternalDocumentID | 10_1109_TSE_2021_3106572 9520296 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Science Foundation grantid: 1414172 funderid: 10.13039/501100008982 – fundername: Dean's Distinguished Graduate Fellowship – fundername: Sandia National Laboratories funderid: 10.13039/100006234 |
| GroupedDBID | --Z -DZ -~X .DC 0R~ 29I 4.4 5GY 6IK 85S 8R4 8R5 97E AAJGR AARMG AASAJ AAWTH ABAZT ABPPZ ABQJQ ABVLG ACGFO ACGOD ACIWK ACNCT AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BKOMP BPEOZ CS3 DU5 EBS EDO EJD HZ~ I-F IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P Q2X RIA RIE RNS RXW S10 TAE TN5 TWZ UHB UPT WH7 YZZ AAYXX CITATION JQ2 K9. |
| ID | FETCH-LOGICAL-c291t-2d8beccf6aee8b09d99c58ed18ff7613bfdad3b0f0980f57ad49e443d25cf2c63 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870301800008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0098-5589 |
| IngestDate | Fri Oct 03 04:10:58 EDT 2025 Sat Nov 29 03:10:26 EST 2025 Tue Nov 18 21:24:09 EST 2025 Wed Aug 27 02:18:30 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 10 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c291t-2d8beccf6aee8b09d99c58ed18ff7613bfdad3b0f0980f57ad49e443d25cf2c63 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-4427-1350 0000-0002-5816-8020 |
| PQID | 2726116693 |
| PQPubID | 21418 |
| PageCount | 15 |
| ParticipantIDs | crossref_primary_10_1109_TSE_2021_3106572 ieee_primary_9520296 proquest_journals_2726116693 crossref_citationtrail_10_1109_TSE_2021_3106572 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-10-01 |
| PublicationDateYYYYMMDD | 2022-10-01 |
| PublicationDate_xml | – month: 10 year: 2022 text: 2022-10-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on software engineering |
| PublicationTitleAbbrev | TSE |
| PublicationYear | 2022 |
| Publisher | IEEE IEEE Computer Society |
| Publisher_xml | – name: IEEE – name: IEEE Computer Society |
| References | ref13 ref35 ref12 Hunt (ref20) ref15 ref37 (ref38) 2020 Masud (ref10) 2008 ref31 ref11 ref33 (ref2) 2020 ref1 (ref41) 2020 ref19 Kanade (ref36) Vaswani (ref18) ref24 ref46 ref23 (ref42) 2020 ref26 ref25 (ref39) 2020 Bayer (ref22) 2006 ref44 (ref14) 2020 ref21 (ref40) 2020 (ref3) 2020 ref27 ref29 Katz (ref32) 2019 ref8 (ref45) 2020 ref7 (ref16) 2020 ref9 Bao (ref28) ref4 Liu (ref17) 2019 Shin (ref30) ref6 Devlin (ref34) 2018 ref5 (ref43) 2020 |
| References_xml | – year: 2020 ident: ref3 article-title: Ghidra – year: 2020 ident: ref43 article-title: Docker containers created – ident: ref4 doi: 10.1109/SECPRI.2001.924286 – ident: ref19 doi: 10.1109/ICSM.2001.972777 – year: 2019 ident: ref17 article-title: RoBERTa: A robustly optimized BERT pretraining approach – ident: ref13 doi: 10.1145/3073559 – year: 2020 ident: ref41 article-title: Travis build utility – ident: ref24 doi: 10.1109/TSE.2015.2470241 – ident: ref11 doi: 10.1109/MALWARE.2010.5665796 – ident: ref5 doi: 10.1109/CEC.2006.1688720 – ident: ref12 doi: 10.1016/j.jnca.2012.10.004 – start-page: 5998 volume-title: Proc. Adv. Neural Inf. Process. Syst. ident: ref18 article-title: Attention is all you need – ident: ref37 doi: 10.1109/ICSE.2019.00048 – ident: ref25 doi: 10.1007/978-3-319-60876-1_14 – year: 2020 ident: ref40 article-title: Docker job matrix configuration – ident: ref27 doi: 10.1145/3428293 – ident: ref6 doi: 10.1145/1281192.1281308 – ident: ref21 doi: 10.1109/MSP.2007.45 – ident: ref15 doi: 10.1109/SANER.2015.7081836 – year: 2006 ident: ref22 article-title: TTAnalyze: A tool for analyzing malware – ident: ref9 doi: 10.1145/1557019.1557167 – start-page: 845 volume-title: Proc. 23rd USENIX Conf. Secur. Symp. ident: ref28 article-title: BYTEWEIGHT: Learning to recognize functions in binary code – ident: ref23 doi: 10.1007/978-3-540-89862-7_1 – year: 2018 ident: ref34 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding – ident: ref31 doi: 10.14722/ndss.2021.23112 – ident: ref33 doi: 10.1109/ASE.2019.00064 – start-page: 611 volume-title: Proc. 24th USENIX Secur. Symp. Secur. ident: ref30 article-title: Recognizing functions in binaries with neural networks – ident: ref26 doi: 10.1145/3243734.3243866 – year: 2020 ident: ref2 article-title: Hexrays ida pro – year: 2020 ident: ref45 article-title: Huggingface transformers – year: 2020 ident: ref42 article-title: Travis dockerhub repository – year: 2020 ident: ref14 article-title: Flirt signatures. – volume-title: Tech. Rep. UTDCS-05–08 year: 2008 ident: ref10 article-title: Mining concept-drifting data stream to detect peer to peer botnet traffic – ident: ref29 doi: 10.1109/ICSME.2017.59 – year: 2020 ident: ref39 article-title: Replication package for this work – year: 2019 ident: ref32 article-title: Towards neural decompilation – ident: ref35 doi: 10.18653/v1/2020.findings-emnlp.139 – ident: ref8 doi: 10.1007/s10844-009-0086-7 – start-page: 5110 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref36 article-title: Learning and evaluating contextual embedding of source code – ident: ref44 doi: 10.1109/SCAM.2011.19 – ident: ref46 doi: 10.1007/s10664-018-9669-7 – ident: ref7 doi: 10.1109/TSMCC.2009.2037978 – ident: ref1 doi: 10.1109/52.43044 – start-page: 1 volume-title: Proc. 3rd USENIX Windows NT Symp. ident: ref20 article-title: Detours: Binary interception of Win32 functions – year: 2020 ident: ref16 article-title: Hexrays ida pro inlined function recovery – year: 2020 ident: ref38 article-title: BugSwarm githubory repository |
| SSID | ssj0005775 ssib053395008 |
| Score | 2.4467833 |
| Snippet | Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be... Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries’ behavior can be... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 3862 |
| SubjectTerms | Computer programming Datasets deep learning Labeling Libraries Malware Open source software Optimization Reverse engineering Software software modeling Source code Supervised learning Theft Training |
| Title | Learning to Find Usages of Library Functions in Optimized Binaries |
| URI | https://ieeexplore.ieee.org/document/9520296 https://www.proquest.com/docview/2726116693 |
| Volume | 48 |
| WOSCitedRecordID | wos000870301800008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1939-3520 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0005775 issn: 0098-5589 databaseCode: RIE dateStart: 19750101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB3a4sGLVatYrZKDF8G1u9nuJjlaafFUBVvobdkmEyloK_3w4K93ss1WRBG87SGB8JLJm7eZD4BL4sA8T8gAJxG3QUdaHqjUxoEi9yDlXOg43DSbEIOBHI_VYwWut7kwiFgEn-GN-yze8s1cr92vsrZKSKqrtApVIdJNrtZXOIcQSVkfM0mkKp8kQ9UePvVICPKI9CkRruDfKKjoqfLjIi7YpV__37r2Yc97kex2s-0HUMHZIdTLDg3MG2wDur586jNbzVmf5DcbuUCyJZtb5jMWWJ-YrTh8bDpjD3SDvE4_0LCuS9QlGX0Eo35veHcf-K4JgeYqWgXcSLcvNs0R5SRURimdSDSRtFYQeU-syU08CS0hFdpE5KajsNOJDU-05TqNj6E2m8_wBBi5klpijjyKBRE_SozRtbCKheM-K5vQLoHMtC8p7jpbvGSFtAhVRtBnDvrMQ9-Eq-2Mt005jT_GNhzU23Ee5Sa0yr3KvL0tMy5ICUYpLe3091lnsMtd4kIRhteC2mqxxnPY0e-r6XJxURylT8azxKY |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NSyMxFH-4dWG9-NVdrF-bgxfB2c5kJpPkqGJRdKtgC96GafKyFLQV23rwr_dlmqnIiuBtDgmEX_Lye7_J-wA4IA4sS0EGOEi4izLleKRzl0aa3IOcc2nSeN5sQna76u5O3yzB0SIXBhGr4DP84z-rt3w7NjP_q6ytBUl1nX-DZZFlPJ5na70FdEgp6gqZQihdP0rGut27PSMpyBNSqES5kr8joaqryn9XccUvnbWvrWwdVoMfyY7nG78BSzjahLW6RwMLJtuEk1BA9R-bjlmHBDjr-1CyCRs7FnIWWIe4rTp-bDhi13SHPAxf0LITn6pLQvon9DtnvdPzKPRNiAzXyTTiVvmdcXmJqAaxtlobodAmyjlJ9D1wtrTpIHaEVOyELG2mMctSy4Vx3OTpL2iMxiPcAkbOpFFYIk9SSdSPClP0TaxS6dnPqRa0ayALE4qK-94W90UlLmJdEPSFh74I0LfgcDHjcV5Q45OxTQ_1YlxAuQW79V4VweImBZekBZOclrb98azf8OO89_equLroXu7ACvdpDFVQ3i40pk8z3IPv5nk6nDztV8fqFS0Nx-0 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Learning+to+Find+Usages+of+Library+Functions+in+Optimized+Binaries&rft.jtitle=IEEE+transactions+on+software+engineering&rft.au=Ahmed%2C+Toufique&rft.au=Devanbu%2C+Premkumar&rft.au=Sawant%2C+Anand+Ashok&rft.date=2022-10-01&rft.issn=0098-5589&rft.eissn=1939-3520&rft.volume=48&rft.issue=10&rft.spage=3862&rft.epage=3876&rft_id=info:doi/10.1109%2FTSE.2021.3106572&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TSE_2021_3106572 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-5589&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-5589&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-5589&client=summon |