Compiler Identification with Divisive Analysis and Support Vector Machine.

Gespeichert in:
Bibliographische Detailangaben
Titel: Compiler Identification with Divisive Analysis and Support Vector Machine.
Autoren: Liu, Changlan, Zhang, Yingsong, Zuo, Peng, Wang, Peng
Quelle: Symmetry (20738994); Jun2025, Vol. 17 Issue 6, p867, 25p
Schlagwörter: DISTRIBUTION (Probability theory), SUPPORT vector machines, REVERSE engineering, INFORMATION technology security, COMPUTER software development, COMPILERS (Computer programs)
Abstract: Compilers play a crucial role in software development, as most software must be compiled into binaries before release. Analyzing the compiler version from binary files is of great importance in software reverse engineering, maintenance, traceability, and information security. In this work, we propose a novel framework for compiler version identification. Firstly, we generated 1000 C language source codes using CSmith and subsequently compiled them into 16,000 binary files using 16 distinct versions of compilers. The symmetric distribution of the dataset among different compiler versions may ensure unbiased model training. Then, IDA Pro was used to decompile the binary files into assembly instruction sequences. From these sequences, we extracted frequency-based features via the Bag-of-Words (BOW) model and sequence-based features derived from the grey-level co-occurrence matrix (GLCM). Finally, we introduced a divide-and-conquer framework (DIANA-SVM) to effectively classify compiler versions. The experimental results demonstrate that traditional Support Vector Machine (SVM) models struggle to accurately identify compiler versions using compiled executable files. In contrast, DIANA-SVM's symmetric data separation approach enhances performance, achieving an accuracy of 94% (±0.375%). This framework enables precise identification of high-risk compiler versions, offering a reliable tool for software supply chain security. Theoretically, our GLCM-based sequence modeling and divide-and-conquer framework advance feature extraction methodologies for binary files, offering a scalable solution for similar classification tasks beyond compiler identification. [ABSTRACT FROM AUTHOR]
Copyright of Symmetry (20738994) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank: Complementary Index
Beschreibung
Abstract:Compilers play a crucial role in software development, as most software must be compiled into binaries before release. Analyzing the compiler version from binary files is of great importance in software reverse engineering, maintenance, traceability, and information security. In this work, we propose a novel framework for compiler version identification. Firstly, we generated 1000 C language source codes using CSmith and subsequently compiled them into 16,000 binary files using 16 distinct versions of compilers. The symmetric distribution of the dataset among different compiler versions may ensure unbiased model training. Then, IDA Pro was used to decompile the binary files into assembly instruction sequences. From these sequences, we extracted frequency-based features via the Bag-of-Words (BOW) model and sequence-based features derived from the grey-level co-occurrence matrix (GLCM). Finally, we introduced a divide-and-conquer framework (DIANA-SVM) to effectively classify compiler versions. The experimental results demonstrate that traditional Support Vector Machine (SVM) models struggle to accurately identify compiler versions using compiled executable files. In contrast, DIANA-SVM's symmetric data separation approach enhances performance, achieving an accuracy of 94% (±0.375%). This framework enables precise identification of high-risk compiler versions, offering a reliable tool for software supply chain security. Theoretically, our GLCM-based sequence modeling and divide-and-conquer framework advance feature extraction methodologies for binary files, offering a scalable solution for similar classification tasks beyond compiler identification. [ABSTRACT FROM AUTHOR]
ISSN:20738994
DOI:10.3390/sym17060867