HexT5: Unified Pre-Training for Stripped Binary Code Information Inference

Decompilation is a widely used process for reverse engineers to significantly enhance code readability by lifting assembly code to a higher-level C-like language, pseudo-code. Nevertheless, the process of compilation and stripping irreversibly discards high-level semantic information that is crucial...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 774 - 786
Hlavní autori:	Xiong, Jiaqi, Chen, Guoqiang, Chen, Kejiang, Gao, Han, Cheng, Shaoyin, Zhang, Weiming
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 11.09.2023
Predmet:	Binary codes Binary Diffing Computer languages Data mining Deep Learning Information Inference Natural languages Object recognition Programming Language Model Reverse Engineering Semantics Task analysis
ISSN:	2643-1572
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Decompilation is a widely used process for reverse engineers to significantly enhance code readability by lifting assembly code to a higher-level C-like language, pseudo-code. Nevertheless, the process of compilation and stripping irreversibly discards high-level semantic information that is crucial to code comprehension, such as comments, identifier names, and types. Existing approaches typically recover only one type of information, making them suboptimal for semantic inference. In this paper, we treat pseudo-code as a special programming language, then present a unified pre-trained model, HexT5, that is trained on vast amounts of natural language comments, source identifiers, and pseudo-code using novel pseudo-code-based pre-training objectives. We fine-tune HexT5 on various downstream tasks, including code summarization, variable name recovery, function name recovery, and similarity detection. Comprehensive experiments show that HexT5 achieves state-of-the-art performance on four downstream tasks, and it demonstrates the robust effectiveness and generalizability of HexT5 for binary-related tasks.
ISSN:	2643-1572
DOI:	10.1109/ASE56229.2023.00099