TEDVIL: Leveraging Transformer-Based Embeddings for Vulnerability Detection in Lifted Code

Ransomware and other malware inflict devastating financial and operational damage on organizations worldwide by exploiting deeply embedded, hard-to-detect vulnerabilities in their systems. Detecting these vulnerabilities in compiled code before malicious actors exploit them remains a critical challe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access Jg. 13; S. 1
Hauptverfasser:	McCully, Gary A., Hastings, John D., Xu, Shengjie
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Piscataway IEEE 01.01.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:	Bidirectional control Bidirectional encoder binary security buffer overflow Buffer overflows Codes Context modeling Cybersecurity Effectiveness Embedding Encoding machine learning Malware neural network Neural networks Organizations Ransomware Semantics Source coding Training transformer-based embedding Transformers unidirectional encoder Virtual environments
ISSN:	2169-3536, 2169-3536
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Ransomware and other malware inflict devastating financial and operational damage on organizations worldwide by exploiting deeply embedded, hard-to-detect vulnerabilities in their systems. Detecting these vulnerabilities in compiled code before malicious actors exploit them remains a critical challenge in cybersecurity. This research introduces TEDVIL (Transformer-based Embeddings for Discovering Vulnerabilities in Lifted Code), a novel framework which uses transformer-based embeddings to train neural networks to detect vulnerabilities in lifted code. The framework was implemented using bidirectional (BERT and RoBERTa) and unidirectional (GPT-1 and GPT-2) transformer-based models to generate embeddings for training Long Short-Term Memory (LSTM) neural networks to detect stack-based buffer overflows in Low-Level Virtual Machine (LLVM) intermediate representation code. For comparison, simpler word2vec models (Skip-Gram and Continuous Bag of Words) were also trained, and their embeddings were used to train LSTMs. The results show that the LSTMs using GPT-2 embeddings outperformed those using GPT-1, BERT, RoBERTa, and word2vec embeddings, achieving a top accuracy of 92.5% and an F1-score of 89.7%. Notably, these results are achieved when the embedding model is trained with a dataset of just 48,000 functions, demonstrating effectiveness in resource-constrained settings. The findings underscore the effectiveness of TEDVIL in identifying hard-to-detect vulnerabilities in compiled code, and lay the groundwork for future research in leveraging transformer-based models for vulnerability detection.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2025.3565980