LLM-based program analysis for source codes, abstract syntax trees and webassembly instructions

The advancement of Web3.0 technology has brought about a urgent need for ensuring the safety and reliability of the software systems. Program analysis, a crucial aspect of software security research need a unified solution for various cross-language program. Moreover, the previous studies regard the...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Cluster computing Ročník 28; číslo 14; s. 892
Hlavní autoři: Deng, Liangjun, Zhong, Qi, Qiu, Yao, Chen, Jingxue, Lei, Hang, Yang, Shunkun, Zhou, Liming, Cheng, Hongyuan
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York Springer US 01.11.2025
Springer Nature B.V
Témata:
ISSN:1386-7857, 1573-7543
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The advancement of Web3.0 technology has brought about a urgent need for ensuring the safety and reliability of the software systems. Program analysis, a crucial aspect of software security research need a unified solution for various cross-language program. Moreover, the previous studies regard the necessity of capturing structured features from the ASTs, commonly holding a conception that plain text and compiled binary instructions are challenging to represent and identify. This paper proposes three methods for extracting structural information to demonstrate that the underlying principles of source code and instructions align with those of an abstract syntax tree. These methods include (1) embedding program instruction files directly into natural language, (2) embedding formatted source code into natural language, and (3) embedding misformatted, non-compilable source code into natural language. We train large-scale language models (LLMs) to identify defect and non-defect of WebAssembly. Experiment results demonstrate that program instruction analysis surpasses traditional techniques, achieving state-of-the-art accuracy exceeding 98.1%. Our study also suggests a practical approach of plain text embedding using a 7.65 billion LLM. Interestingly, misformatted source codes are readable to humans but un-compilable, and the accuracy remains above 98.63%. This paper not only introduces novel instruction and plain text embedding approach for future program security analysis, but also provides new insights for subsequent research about the three program analysis forms of plain text, ASTs, and instructions.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1386-7857
1573-7543
DOI:10.1007/s10586-025-05557-w