View in EDS

Java Code Vulnerability Detection Dataset

Saved in:

Bibliographic Details
Title:	Java Code Vulnerability Detection Dataset
Authors:	Gajjar, Jugal, Subramaniakuppusamy, Kamalasankari, Ranaware, Kaustik
Publisher Information:	Zenodo
Publication Year:	2025
Collection:	Zenodo
Description:	We curated a large Java dataset from Vul4J, Kaggle CVE Fixes, and Kaggle Vulnerability Fixes, collecting 36,425 code samples. Due to heavy vulnerability bias, we added 6,000 synthetically crafted safe files. After removing duplicates and errors, the final dataset contained 35,610 CFG-parsable files (6,234 safe, 29,376 vulnerable). Control Flow Graphs were generated using PROGEX, averaging 20.6 nodes and 20.3 edges per graph, enabling simultaneous analysis of structural complexity and semantic intent via pretrained LLM embeddings. Graph embeddings: Node features are constructed from node text (e.g., statement tokens or code snippets). We extract graph-level embeddings using four complementary encoders: GraphSAGE for neighborhood sampling and aggregation, GCN for message passing, GAT as a graph attention network that learns edge attention coefficients, and Node2Vec for random-walk based structural embeddings. Each encoder produces node embeddings which are pooled (global mean pooling) to obtain a graph-level vector. In this work we fix the embedding dimension as 128 for all graph pipelines to ensure a fair comparison between encoders. LLM embeddings: For each code sample we obtain a contextual embedding from a local code-Large-Language-Model (LLM). To preserve privacy and enable execution on modest hardware we restrict to lightweight models (parameter count <4B). Models considered include Qwen2.5 Coder 3B, DeepSeek Coder 1.3B, Stable Code 3B, and Phi 3.5 Mini. LLM embeddings encode high-order semantics (name usage, API semantics, docstrings, idioms) that structural encoders cannot directly capture. Note: LLM embeddings have not been uploaded to Zenodo due to memory constraints. You can reproduce the LLM embeddings for the dataset using the code provided in the GitHub repository.
Document Type:	dataset
Language:	unknown
Relation:	https://zenodo.org/records/17259519; oai:zenodo.org:17259519; https://doi.org/10.5281/zenodo.17259519
DOI:	10.5281/zenodo.17259519
Availability:	https://doi.org/10.5281/zenodo.17259519 https://zenodo.org/records/17259519
Rights:	Creative Commons Attribution 4.0 International ; cc-by-4.0 ; https://creativecommons.org/licenses/by/4.0/legalcode
Accession Number:	edsbas.E579B25
Database:	BASE

View record from BASE

Nájsť tento článok vo Web of Science

Description
Abstract:	We curated a large Java dataset from Vul4J, Kaggle CVE Fixes, and Kaggle Vulnerability Fixes, collecting 36,425 code samples. Due to heavy vulnerability bias, we added 6,000 synthetically crafted safe files. After removing duplicates and errors, the final dataset contained 35,610 CFG-parsable files (6,234 safe, 29,376 vulnerable). Control Flow Graphs were generated using PROGEX, averaging 20.6 nodes and 20.3 edges per graph, enabling simultaneous analysis of structural complexity and semantic intent via pretrained LLM embeddings. Graph embeddings: Node features are constructed from node text (e.g., statement tokens or code snippets). We extract graph-level embeddings using four complementary encoders: GraphSAGE for neighborhood sampling and aggregation, GCN for message passing, GAT as a graph attention network that learns edge attention coefficients, and Node2Vec for random-walk based structural embeddings. Each encoder produces node embeddings which are pooled (global mean pooling) to obtain a graph-level vector. In this work we fix the embedding dimension as 128 for all graph pipelines to ensure a fair comparison between encoders. LLM embeddings: For each code sample we obtain a contextual embedding from a local code-Large-Language-Model (LLM). To preserve privacy and enable execution on modest hardware we restrict to lightweight models (parameter count <4B). Models considered include Qwen2.5 Coder 3B, DeepSeek Coder 1.3B, Stable Code 3B, and Phi 3.5 Mini. LLM embeddings encode high-order semantics (name usage, API semantics, docstrings, idioms) that structural encoders cannot directly capture. Note: LLM embeddings have not been uploaded to Zenodo due to memory constraints. You can reproduce the LLM embeddings for the dataset using the code provided in the GitHub repository.
DOI:	10.5281/zenodo.17259519