LLM-based program analysis for source codes, abstract syntax trees and webassembly instructions

The advancement of Web3.0 technology has brought about a urgent need for ensuring the safety and reliability of the software systems. Program analysis, a crucial aspect of software security research need a unified solution for various cross-language program. Moreover, the previous studies regard the...

Full description

Saved in:

Bibliographic Details
Published in:	Cluster computing Vol. 28; no. 14; p. 892
Main Authors:	Deng, Liangjun, Zhong, Qi, Qiu, Yao, Chen, Jingxue, Lei, Hang, Yang, Shunkun, Zhou, Liming, Cheng, Hongyuan
Format:	Journal Article
Language:	English
Published:	New York Springer US 01.11.2025 Springer Nature B.V
Subjects:	Accuracy Codes Computer Communication Networks Computer Science Deep learning Defects Edge computing Embedding Internet of Things Large language models Medical research Natural language Natural language processing Neural networks Operating Systems Processor Architectures Program verification (computers) Programming languages Security Semantics Software engineering Software reliability Source code Syntax Webassembly Group key agreement Program analysis Blockchain
ISSN:	1386-7857, 1573-7543
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The advancement of Web3.0 technology has brought about a urgent need for ensuring the safety and reliability of the software systems. Program analysis, a crucial aspect of software security research need a unified solution for various cross-language program. Moreover, the previous studies regard the necessity of capturing structured features from the ASTs, commonly holding a conception that plain text and compiled binary instructions are challenging to represent and identify. This paper proposes three methods for extracting structural information to demonstrate that the underlying principles of source code and instructions align with those of an abstract syntax tree. These methods include (1) embedding program instruction files directly into natural language, (2) embedding formatted source code into natural language, and (3) embedding misformatted, non-compilable source code into natural language. We train large-scale language models (LLMs) to identify defect and non-defect of WebAssembly. Experiment results demonstrate that program instruction analysis surpasses traditional techniques, achieving state-of-the-art accuracy exceeding 98.1%. Our study also suggests a practical approach of plain text embedding using a 7.65 billion LLM. Interestingly, misformatted source codes are readable to humans but un-compilable, and the accuracy remains above 98.63%. This paper not only introduces novel instruction and plain text embedding approach for future program security analysis, but also provides new insights for subsequent research about the three program analysis forms of plain text, ASTs, and instructions.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1386-7857 1573-7543
DOI:	10.1007/s10586-025-05557-w