CrossSimEmb: transformer-based embedding model for cross-compilation binary code similarity detection.

Gespeichert in:
Bibliographische Detailangaben
Titel: CrossSimEmb: transformer-based embedding model for cross-compilation binary code similarity detection.
Autoren: Yu, Gaoqing1,2 (AUTHOR) qing_yu@cuc.edu.cn, Huang, Wei1,2 (AUTHOR) huangwei.me@cuc.edu.cn, Lyu, Jiuyang1,2 (AUTHOR) lyujiuyang@cuc.edu.cn, Wang, Dongxia1,2 (AUTHOR) wangdongxia@cuc.edu.cn, Cheng, Yixuan1,2 (AUTHOR) yixuancheng@cuc.edu.cn, Sui, Aina1,2 (AUTHOR) blockchaincyber@163.com
Quelle: Journal of Supercomputing. Oct2025, Vol. 81 Issue 15, p1-33. 33p.
Abstract: Transformer-based pre-trained models have achieved breakthrough progress in binary code similarity detection tasks by modeling disassembled code as textual sequences, enabling researchers to successfully transfer natural language processing methodologies to binary code analysis. However, existing approaches remain limited by their lack of multi-granularity semantic feature extraction capability and multi-task generalization capability. In this paper, we propose CrossSimEmb, a cross-compilation binary code semantic representation and metric model, which constructs a multi-task joint optimization framework encompassing 4 pre-training tasks and 3 fine-tuning tasks. Leveraging high-performance computing resources, our framework is trained over large-scale binary datasets across multiple architectures, enabling effective integration of disassembled text features, instruction-level contextual features, inter-block structural features, and architecture-specific characteristics. In the fine-tuning phase, a Siamese network architecture is implemented to generate function-level code embeddings, while a joint optimization strategy combining binary classification, distance-based metric learning and multi-label classification is introduced for downstream tasks, effectively improving the model’s generalization capability and task adaptability in cross-compilation scenarios. Experimental results–conducted on HPC clusters–demonstrate that CrossSimEmb outperforms state-of-the-art solutions in both binary code similarity detection and outlier detection tasks: achieving 80.34% average accuracy and 0.901 AUC for binary code similarity detection, and attaining 0.961 MCC and 0.981 G-Means for outlier detection. These findings highlight both the effectiveness of our approach and the essential role of HPC resources in enabling large-scale training and evaluation. [ABSTRACT FROM AUTHOR]
Datenbank: Academic Search Index
Beschreibung
Abstract:Transformer-based pre-trained models have achieved breakthrough progress in binary code similarity detection tasks by modeling disassembled code as textual sequences, enabling researchers to successfully transfer natural language processing methodologies to binary code analysis. However, existing approaches remain limited by their lack of multi-granularity semantic feature extraction capability and multi-task generalization capability. In this paper, we propose CrossSimEmb, a cross-compilation binary code semantic representation and metric model, which constructs a multi-task joint optimization framework encompassing 4 pre-training tasks and 3 fine-tuning tasks. Leveraging high-performance computing resources, our framework is trained over large-scale binary datasets across multiple architectures, enabling effective integration of disassembled text features, instruction-level contextual features, inter-block structural features, and architecture-specific characteristics. In the fine-tuning phase, a Siamese network architecture is implemented to generate function-level code embeddings, while a joint optimization strategy combining binary classification, distance-based metric learning and multi-label classification is introduced for downstream tasks, effectively improving the model’s generalization capability and task adaptability in cross-compilation scenarios. Experimental results–conducted on HPC clusters–demonstrate that CrossSimEmb outperforms state-of-the-art solutions in both binary code similarity detection and outlier detection tasks: achieving 80.34% average accuracy and 0.901 AUC for binary code similarity detection, and attaining 0.961 MCC and 0.981 G-Means for outlier detection. These findings highlight both the effectiveness of our approach and the essential role of HPC resources in enabling large-scale training and evaluation. [ABSTRACT FROM AUTHOR]
ISSN:09208542
DOI:10.1007/s11227-025-07895-3