LLM-based program analysis for source codes, abstract syntax trees and webassembly instructions
The advancement of Web3.0 technology has brought about a urgent need for ensuring the safety and reliability of the software systems. Program analysis, a crucial aspect of software security research need a unified solution for various cross-language program. Moreover, the previous studies regard the...
Saved in:
| Published in: | Cluster computing Vol. 28; no. 14; p. 892 |
|---|---|
| Main Authors: | , , , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
New York
Springer US
01.11.2025
Springer Nature B.V |
| Subjects: | |
| ISSN: | 1386-7857, 1573-7543 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The advancement of Web3.0 technology has brought about a urgent need for ensuring the safety and reliability of the software systems. Program analysis, a crucial aspect of software security research need a unified solution for various cross-language program. Moreover, the previous studies regard the necessity of capturing structured features from the ASTs, commonly holding a conception that plain text and compiled binary instructions are challenging to represent and identify. This paper proposes three methods for extracting structural information to demonstrate that the underlying principles of source code and instructions align with those of an abstract syntax tree. These methods include (1) embedding program instruction files directly into natural language, (2) embedding formatted source code into natural language, and (3) embedding misformatted, non-compilable source code into natural language. We train large-scale language models (LLMs) to identify defect and non-defect of WebAssembly. Experiment results demonstrate that program instruction analysis surpasses traditional techniques, achieving state-of-the-art accuracy exceeding 98.1%. Our study also suggests a practical approach of plain text embedding using a 7.65 billion LLM. Interestingly, misformatted source codes are readable to humans but un-compilable, and the accuracy remains above 98.63%. This paper not only introduces novel instruction and plain text embedding approach for future program security analysis, but also provides new insights for subsequent research about the three program analysis forms of plain text, ASTs, and instructions. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1386-7857 1573-7543 |
| DOI: | 10.1007/s10586-025-05557-w |