Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis

Uloženo v:
Podrobná bibliografie
Název: Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis
Autoři: Wang, Huaijin, Liu, Zhibo, Wang, Shuai, Wang, Ying, Tang, Qiyi, Nie, Sen, Wu, Shi
Informace o vydavateli: Institute of Electrical and Electronics Engineers Inc.
Rok vydání: 2024
Sbírka: The Hong Kong University of Science and Technology: HKUST Institutional Repository
Témata: Binary code analysis, Binary similarity analysis, Software composition analysis, Software reuse, Vulnerability detection
Popis: Software composition analysis (SCA) has attracted the attention of the industry and academic community in recent years. Given a piece of program source code, SCA facilitates extracting certain components from the input program and matching the extracted components with opensource software (OSS) libraries. Despite the prosperous development of SCA, binary SCA (BSCA) is highly challenging and still underdeveloped. Few available BSCA solutions are either closed source (for commercial usage) or suffer from low performance. Nevertheless, a related line of research, binary similarity analysis (BSA), which decides the similarity between two pieces of binary code, has been progressively developed in academia for decades. De facto BSA techniques, often based on deep learning, efficiently analyze large-scale executables with high accuracy. This study explores bridging the gap between state-of-the-art (SOTA) BSA and BSCA. We spent considerable manual effort building the first large real-world benchmark dataset, containing over 55 million lines of C/C++ code. Then, we establish our BSCA pipeline by extending and calibrating the SOTA SCA pipeline. Particularly, we concretize the key procedure of BSCA, namely matching a binary component with OSS using six SOTA BSA techniques. Evaluation using our benchmark dataset reveals that simply employing BSA in BSCA exhibits less desirable accuracy, as BSCA faces unique challenges. After inspecting the failed cases, we propose three enhancements whose hybrid usage improves the F1 score of BSCA by over 30% and outperforms SOTA commercial BSCA software. Our experiment on 1-day vulnerability detection demonstrates our BSCA framework's effectiveness. We also discuss several open challenges and potential solutions to augment BSCA solutions.
Druh dokumentu: conference object
Jazyk: English
Relation: https://doi.org/10.1109/EuroSP60621.2024.00034; http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=LinksAMR&SrcApp=PARTNER_APP&DestLinkType=FullRecord&DestApp=WOS&KeyUT=001304430300026
DOI: 10.1109/EuroSP60621.2024.00034
Dostupnost: http://repository.hkust.edu.hk/ir/Record/1783.1-147321
https://doi.org/10.1109/EuroSP60621.2024.00034
http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=&rft.volume=&rft.issue=&rft.date=2024&rft.spage=&rft.aulast=Wang&rft.aufirst=Huaijin&rft.atitle=Are+We+There+Yet%3F+Filling+the+Gap+Between+ML-Based+Binary+Similarity+Analysis+and+Binary+Software+Composition+Analysis&rft.title=
http://www.scopus.com/record/display.url?eid=2-s2.0-85203675956&origin=inward
http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=LinksAMR&SrcApp=PARTNER_APP&DestLinkType=FullRecord&DestApp=WOS&KeyUT=001304430300026
Přístupové číslo: edsbas.A0305367
Databáze: BASE
Popis
Abstrakt:Software composition analysis (SCA) has attracted the attention of the industry and academic community in recent years. Given a piece of program source code, SCA facilitates extracting certain components from the input program and matching the extracted components with opensource software (OSS) libraries. Despite the prosperous development of SCA, binary SCA (BSCA) is highly challenging and still underdeveloped. Few available BSCA solutions are either closed source (for commercial usage) or suffer from low performance. Nevertheless, a related line of research, binary similarity analysis (BSA), which decides the similarity between two pieces of binary code, has been progressively developed in academia for decades. De facto BSA techniques, often based on deep learning, efficiently analyze large-scale executables with high accuracy. This study explores bridging the gap between state-of-the-art (SOTA) BSA and BSCA. We spent considerable manual effort building the first large real-world benchmark dataset, containing over 55 million lines of C/C++ code. Then, we establish our BSCA pipeline by extending and calibrating the SOTA SCA pipeline. Particularly, we concretize the key procedure of BSCA, namely matching a binary component with OSS using six SOTA BSA techniques. Evaluation using our benchmark dataset reveals that simply employing BSA in BSCA exhibits less desirable accuracy, as BSCA faces unique challenges. After inspecting the failed cases, we propose three enhancements whose hybrid usage improves the F1 score of BSCA by over 30% and outperforms SOTA commercial BSCA software. Our experiment on 1-day vulnerability detection demonstrates our BSCA framework's effectiveness. We also discuss several open challenges and potential solutions to augment BSCA solutions.
DOI:10.1109/EuroSP60621.2024.00034