Neural Machine Translation for Recovering ASTs from Binaries
Recovering higher-level abstractions of source code from binaries is an important task underlying malware identification, program verification, debugging, program comparison, vulnerability detection, and helping subject matter experts understand compiled code. Existing approaches to extracting highe...
Uloženo v:
| Vydáno v: | 2023 IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI) s. 80 - 85 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
16.06.2023
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Recovering higher-level abstractions of source code from binaries is an important task underlying malware identification, program verification, debugging, program comparison, vulnerability detection, and helping subject matter experts understand compiled code. Existing approaches to extracting higher-level structures from lower-level binary code rely on hand-crafted rules and generally require great time and effort of domain experts to design and implement. We present Binary2AST, a framework for generating a structured representation of binary code in the form of an abstract syntax tree (AST) using neural machine translation (NMT). We use the Ghidra binary analysis tool to extract assembly instructions from binaries. A tokenized version of these instructions are then translated by our NMT system into a sequence of symbols that represent an AST. The NMT framework uses deep neural network models that can require a lot of training examples. To address this, we have developed a C source code generator for a restricted subset of the C language, from which we can sample an arbitrary number of syntactically correct C source code files that in turn can be used to create a parallel data set suitable for NMT training. We evaluate several variant NMT models on their ability to recover AST representations of the original source code from compiled binaries, where the best-performing attention-based model achieves a BLEU score of 0.99 on our corpus. |
|---|---|
| DOI: | 10.1109/SEAI59139.2023.10217602 |