Learning to Find Usages of Library Functions in Optimized Binaries

Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensi...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on software engineering Vol. 48; no. 10; pp. 3862 - 3876
Main Authors: Ahmed, Toufique, Devanbu, Premkumar, Sawant, Anand Ashok
Format: Journal Article
Language:English
Published: New York IEEE 01.10.2022
IEEE Computer Society
Subjects:
ISSN:0098-5589, 1939-3520
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.
AbstractList Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.
Author Sawant, Anand Ashok
Ahmed, Toufique
Devanbu, Premkumar
Author_xml – sequence: 1
  givenname: Toufique
  orcidid: 0000-0002-4427-1350
  surname: Ahmed
  fullname: Ahmed, Toufique
  email: tfahmed@ucdavis.edu
  organization: Department of Computer Science, University of California, Davis, CA, USA
– sequence: 2
  givenname: Premkumar
  surname: Devanbu
  fullname: Devanbu, Premkumar
  email: ptdevanbu@ucdavis.edu
  organization: Department of Computer Science, University of California, Davis, CA, USA
– sequence: 3
  givenname: Anand Ashok
  orcidid: 0000-0002-5816-8020
  surname: Sawant
  fullname: Sawant, Anand Ashok
  email: asawant@ucdavis.edu
  organization: Department of Computer Science, University of California, Davis, CA, USA
BookMark eNp9kE1PAjEQhhuDiYjeTbw08bzYD9rdHoWAmmzCQTg33XZKSqCL7XLQX-8SiAcPnubyPvPOPLdoENsICD1QMqaUqOfVx3zMCKNjTokUJbtCQ6q4KrhgZICGhKiqEKJSN-g25y0hRJSlGKJpDSbFEDe4a_EiRIfX2Wwg49bjOjTJpC-8OEbbhTZmHCJeHrqwD9_g8DREkwLkO3TtzS7D_WWO0HoxX83einr5-j57qQvLFO0K5qoGrPXSAFQNUU4pKypwtPK-lJQ33hnHG-L7S4kXpXETBZMJd0xYz6zkI_R03ntI7ecRcqe37THFvlKzkklKpVS8T5FzyqY25wReH1LY929oSvTJlO5N6ZMpfTHVI_IPYkNnTh93yYTdf-DjGQwA8NujeuNMSf4DcGR3uQ
CODEN IESEDJ
CitedBy_id crossref_primary_10_1109_TSE_2022_3212635
crossref_primary_10_1145_3765521
Cites_doi 10.1109/SECPRI.2001.924286
10.1109/ICSM.2001.972777
10.1145/3073559
10.1109/TSE.2015.2470241
10.1109/MALWARE.2010.5665796
10.1109/CEC.2006.1688720
10.1016/j.jnca.2012.10.004
10.1109/ICSE.2019.00048
10.1007/978-3-319-60876-1_14
10.1145/3428293
10.1145/1281192.1281308
10.1109/MSP.2007.45
10.1109/SANER.2015.7081836
10.1145/1557019.1557167
10.1007/978-3-540-89862-7_1
10.14722/ndss.2021.23112
10.1109/ASE.2019.00064
10.1145/3243734.3243866
10.1109/ICSME.2017.59
10.18653/v1/2020.findings-emnlp.139
10.1007/s10844-009-0086-7
10.1109/SCAM.2011.19
10.1007/s10664-018-9669-7
10.1109/TSMCC.2009.2037978
10.1109/52.43044
ContentType Journal Article
Copyright Copyright IEEE Computer Society 2022
Copyright_xml – notice: Copyright IEEE Computer Society 2022
DBID 97E
RIA
RIE
AAYXX
CITATION
JQ2
K9.
DOI 10.1109/TSE.2021.3106572
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
ProQuest Computer Science Collection
ProQuest Health & Medical Complete (Alumni)
DatabaseTitle CrossRef
ProQuest Health & Medical Complete (Alumni)
ProQuest Computer Science Collection
DatabaseTitleList
ProQuest Health & Medical Complete (Alumni)
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1939-3520
EndPage 3876
ExternalDocumentID 10_1109_TSE_2021_3106572
9520296
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  grantid: 1414172
  funderid: 10.13039/501100008982
– fundername: Dean's Distinguished Graduate Fellowship
– fundername: Sandia National Laboratories
  funderid: 10.13039/100006234
GroupedDBID --Z
-DZ
-~X
.DC
0R~
29I
4.4
5GY
6IK
85S
8R4
8R5
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABPPZ
ABQJQ
ABVLG
ACGFO
ACGOD
ACIWK
ACNCT
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BKOMP
BPEOZ
CS3
DU5
EBS
EDO
EJD
HZ~
I-F
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
Q2X
RIA
RIE
RNS
RXW
S10
TAE
TN5
TWZ
UHB
UPT
WH7
YZZ
AAYXX
CITATION
JQ2
K9.
ID FETCH-LOGICAL-c291t-2d8beccf6aee8b09d99c58ed18ff7613bfdad3b0f0980f57ad49e443d25cf2c63
IEDL.DBID RIE
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870301800008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0098-5589
IngestDate Fri Oct 03 04:10:58 EDT 2025
Sat Nov 29 03:10:26 EST 2025
Tue Nov 18 21:24:09 EST 2025
Wed Aug 27 02:18:30 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 10
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c291t-2d8beccf6aee8b09d99c58ed18ff7613bfdad3b0f0980f57ad49e443d25cf2c63
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-4427-1350
0000-0002-5816-8020
PQID 2726116693
PQPubID 21418
PageCount 15
ParticipantIDs crossref_primary_10_1109_TSE_2021_3106572
ieee_primary_9520296
proquest_journals_2726116693
crossref_citationtrail_10_1109_TSE_2021_3106572
PublicationCentury 2000
PublicationDate 2022-10-01
PublicationDateYYYYMMDD 2022-10-01
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-10-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on software engineering
PublicationTitleAbbrev TSE
PublicationYear 2022
Publisher IEEE
IEEE Computer Society
Publisher_xml – name: IEEE
– name: IEEE Computer Society
References ref13
ref35
ref12
Hunt (ref20)
ref15
ref37
(ref38) 2020
Masud (ref10) 2008
ref31
ref11
ref33
(ref2) 2020
ref1
(ref41) 2020
ref19
Kanade (ref36)
Vaswani (ref18)
ref24
ref46
ref23
(ref42) 2020
ref26
ref25
(ref39) 2020
Bayer (ref22) 2006
ref44
(ref14) 2020
ref21
(ref40) 2020
(ref3) 2020
ref27
ref29
Katz (ref32) 2019
ref8
(ref45) 2020
ref7
(ref16) 2020
ref9
Bao (ref28)
ref4
Liu (ref17) 2019
Shin (ref30)
ref6
Devlin (ref34) 2018
ref5
(ref43) 2020
References_xml – year: 2020
  ident: ref3
  article-title: Ghidra
– year: 2020
  ident: ref43
  article-title: Docker containers created
– ident: ref4
  doi: 10.1109/SECPRI.2001.924286
– ident: ref19
  doi: 10.1109/ICSM.2001.972777
– year: 2019
  ident: ref17
  article-title: RoBERTa: A robustly optimized BERT pretraining approach
– ident: ref13
  doi: 10.1145/3073559
– year: 2020
  ident: ref41
  article-title: Travis build utility
– ident: ref24
  doi: 10.1109/TSE.2015.2470241
– ident: ref11
  doi: 10.1109/MALWARE.2010.5665796
– ident: ref5
  doi: 10.1109/CEC.2006.1688720
– ident: ref12
  doi: 10.1016/j.jnca.2012.10.004
– start-page: 5998
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  ident: ref18
  article-title: Attention is all you need
– ident: ref37
  doi: 10.1109/ICSE.2019.00048
– ident: ref25
  doi: 10.1007/978-3-319-60876-1_14
– year: 2020
  ident: ref40
  article-title: Docker job matrix configuration
– ident: ref27
  doi: 10.1145/3428293
– ident: ref6
  doi: 10.1145/1281192.1281308
– ident: ref21
  doi: 10.1109/MSP.2007.45
– ident: ref15
  doi: 10.1109/SANER.2015.7081836
– year: 2006
  ident: ref22
  article-title: TTAnalyze: A tool for analyzing malware
– ident: ref9
  doi: 10.1145/1557019.1557167
– start-page: 845
  volume-title: Proc. 23rd USENIX Conf. Secur. Symp.
  ident: ref28
  article-title: BYTEWEIGHT: Learning to recognize functions in binary code
– ident: ref23
  doi: 10.1007/978-3-540-89862-7_1
– year: 2018
  ident: ref34
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
– ident: ref31
  doi: 10.14722/ndss.2021.23112
– ident: ref33
  doi: 10.1109/ASE.2019.00064
– start-page: 611
  volume-title: Proc. 24th USENIX Secur. Symp. Secur.
  ident: ref30
  article-title: Recognizing functions in binaries with neural networks
– ident: ref26
  doi: 10.1145/3243734.3243866
– year: 2020
  ident: ref2
  article-title: Hexrays ida pro
– year: 2020
  ident: ref45
  article-title: Huggingface transformers
– year: 2020
  ident: ref42
  article-title: Travis dockerhub repository
– year: 2020
  ident: ref14
  article-title: Flirt signatures.
– volume-title: Tech. Rep. UTDCS-05–08
  year: 2008
  ident: ref10
  article-title: Mining concept-drifting data stream to detect peer to peer botnet traffic
– ident: ref29
  doi: 10.1109/ICSME.2017.59
– year: 2020
  ident: ref39
  article-title: Replication package for this work
– year: 2019
  ident: ref32
  article-title: Towards neural decompilation
– ident: ref35
  doi: 10.18653/v1/2020.findings-emnlp.139
– ident: ref8
  doi: 10.1007/s10844-009-0086-7
– start-page: 5110
  volume-title: Proc. Int. Conf. Mach. Learn.
  ident: ref36
  article-title: Learning and evaluating contextual embedding of source code
– ident: ref44
  doi: 10.1109/SCAM.2011.19
– ident: ref46
  doi: 10.1007/s10664-018-9669-7
– ident: ref7
  doi: 10.1109/TSMCC.2009.2037978
– ident: ref1
  doi: 10.1109/52.43044
– start-page: 1
  volume-title: Proc. 3rd USENIX Windows NT Symp.
  ident: ref20
  article-title: Detours: Binary interception of Win32 functions
– year: 2020
  ident: ref16
  article-title: Hexrays ida pro inlined function recovery
– year: 2020
  ident: ref38
  article-title: BugSwarm githubory repository
SSID ssj0005775
ssib053395008
Score 2.4467833
Snippet Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be...
Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries’ behavior can be...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 3862
SubjectTerms Computer programming
Datasets
deep learning
Labeling
Libraries
Malware
Open source software
Optimization
Reverse engineering
Software
software modeling
Source code
Supervised learning
Theft
Training
Title Learning to Find Usages of Library Functions in Optimized Binaries
URI https://ieeexplore.ieee.org/document/9520296
https://www.proquest.com/docview/2726116693
Volume 48
WOSCitedRecordID wos000870301800008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1939-3520
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0005775
  issn: 0098-5589
  databaseCode: RIE
  dateStart: 19750101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8QwFH7o4MGLuzhu5OBFsE6btlmOKg6eRkEH5lbaLDKgrcziwV_vSyYdEUXwFmgC5XtJvnzJWwDOlGLWVqKMEitQoFSGRZJyHVVxmvkPVaZ9sQk-GIjRSD6swMUyFsYY453PzKVr-rd83ai5uyrryRylumSrsMo5W8RqfblzcJ63-THzXMj2STKWvafHWxSCNEF9ioTL6TcK8jVVfmzEnl36m__7ry3YCKdIcrUw-zasmHoHNtsKDSQs2F24DulTn8msIX2U32ToHMmmpLEkRCyQPjKbn3xkXJN73EFexx9Gk2sXqIsyeg-G_dunm7soVE2IFJXJLKJaOLtYVhojqlhqKVUujE6EtRzJu7K61GkVW0QqtjkvdSZNlqWa5spSxdJ96NRNbQ6AWE2ZRj0hUIVkCUUmK5G-rDEcD0GMl13otUAWKqQUd5UtXgovLWJZIPSFg74I0HfhfDnibZFO44--uw7qZb-AcheOW1sVYb1NC8pRCSaMyfTw91FHsE5d4IJ3wzuGzmwyNyewpt5n4-nk1E-lT7V3xpA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JSwMxFH64gV7cxbrm4EVw7ExmJstRxaKoVbCCt2EmixS0lbZ68Nf7kmYqogjeApPA8L0kX77kLQAHSjFrK1FGiRUoUCrDIkm5jqo4zfyHKtO-2ARvt8Xjo7ybgqNJLIwxxjufmWPX9G_5uq_e3FVZU-Yo1SWbhtk8y2g8jtb6cujgPK8zZOa5kPWjZCybnftzlII0QYWKlMvpNxLyVVV-bMWeX1pL__uzZVgM50hyMjb8CkyZ3ios1TUaSFiya3AaEqg-kVGftFCAkwfnSjYkfUtCzAJpIbf56Ue6PXKLe8hL98NocupCdVFIr8ND67xzdhGFugmRojIZRVQLZxnLSmNEFUstpcqF0YmwliN9V1aXOq1ii0jFNuelzqTJslTTXFmqWLoBM71-z2wCsZoyjYpCoA7JEopcViKBWWM4HoMYLxvQrIEsVEgq7mpbPBdeXMSyQOgLB30RoG_A4WTE6zihxh991xzUk34B5Qbs1LYqwoobFpSjFkwYk-nW76P2Yf6ic3NdXF-2r7ZhgbowBu-UtwMzo8Gb2YU59T7qDgd7flp9AjZ_ydc
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Learning+to+Find+Usages+of+Library+Functions+in+Optimized+Binaries&rft.jtitle=IEEE+transactions+on+software+engineering&rft.au=Ahmed%2C+Toufique&rft.au=Devanbu%2C+Premkumar&rft.au=Sawant%2C+Anand+Ashok&rft.date=2022-10-01&rft.issn=0098-5589&rft.eissn=1939-3520&rft.volume=48&rft.issue=10&rft.spage=3862&rft.epage=3876&rft_id=info:doi/10.1109%2FTSE.2021.3106572&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TSE_2021_3106572
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0098-5589&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0098-5589&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0098-5589&client=summon