Machine Learning-Based Vulnerability Detection in Rust Code Using LLVM IR and Transformer Model

Rust’s growing popularity in high-integrity systems requires automated vulnerability detection in order to maintain its strong safety guarantees. Although Rust’s ownership model and compile-time checks prevent many errors, sometimes unexpected bugs may occasionally pass analysis, underlining the nec...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Machine learning and knowledge extraction Ročník 7; číslo 3; s. 79
Hlavní autoři:	Lee, Young, Boshra, Syeda Jannatul, Yang, Jeong, Cao, Zechun, Liang, Gongbo
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Basel MDPI AG 01.09.2025
Témata:	Accuracy Audits code embedding Computer programming Data security Datasets GraphCodeBERT Keywords LLVM IR Machine learning Methods Program development software Programming languages Representations Rust Semantics Software Source code vulnerability detection United States
ISSN:	2504-4990, 2504-4990
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Rust’s growing popularity in high-integrity systems requires automated vulnerability detection in order to maintain its strong safety guarantees. Although Rust’s ownership model and compile-time checks prevent many errors, sometimes unexpected bugs may occasionally pass analysis, underlining the necessity for automated safe and unsafe code detection. This paper presents Rust-IR-BERT, a machine learning approach to detect security vulnerabilities in Rust code by analyzing its compiled LLVM intermediate representation (IR) instead of the raw source code. This approach offers novelty by employing LLVM IR’s language-neutral, semantically rich representation of the program, facilitating robust detection by capturing core data and control-flow semantics and reducing language-specific syntactic noise. Our method leverages a graph-based transformer model, GraphCodeBERT, which is a transformer architecture pretrained model to encode structural code semantics via data-flow information, followed by a gradient boosting classifier, CatBoost, that is capable of handling complex feature interactions—to classify code as vulnerable or safe. The model was evaluated using a carefully curated dataset of over 2300 real-world Rust code samples (vulnerable and non-vulnerable Rust code snippets) from RustSec and OSV advisory databases, compiled to LLVM IR and labeled with corresponding Common Vulnerabilities and Exposures (CVEs) identifiers to ensure comprehensive and realistic coverage. Rust-IR-BERT achieved an overall accuracy of 98.11%, with a recall of 99.31% for safe code and 93.67% for vulnerable code. Despite these promising results, this study acknowledges potential limitations such as focusing primarily on known CVEs. Built on a representative dataset spanning over 2300 real-world Rust samples from diverse crates, Rust-IR-BERT delivers consistently strong performance. Looking ahead, practical deployment could take the form of a Cargo plugin or pre-commit hook that automatically generates and scans LLVM IR artifacts during the development cycle, enabling developers to catch vulnerabilities at an early stage in the development cycle.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2504-4990 2504-4990
DOI:	10.3390/make7030079