Java Code Vulnerability Detection Dataset
Saved in:
| Title: | Java Code Vulnerability Detection Dataset |
|---|---|
| Authors: | Gajjar, Jugal, Subramaniakuppusamy, Kamalasankari, Ranaware, Kaustik |
| Publisher Information: | Zenodo |
| Publication Year: | 2025 |
| Collection: | Zenodo |
| Description: | We curated a large Java dataset from Vul4J, Kaggle CVE Fixes, and Kaggle Vulnerability Fixes, collecting 36,425 code samples. Due to heavy vulnerability bias, we added 6,000 synthetically crafted safe files. After removing duplicates and errors, the final dataset contained 35,610 CFG-parsable files (6,234 safe, 29,376 vulnerable). Control Flow Graphs were generated using PROGEX, averaging 20.6 nodes and 20.3 edges per graph, enabling simultaneous analysis of structural complexity and semantic intent via pretrained LLM embeddings. Graph embeddings: Node features are constructed from node text (e.g., statement tokens or code snippets). We extract graph-level embeddings using four complementary encoders: GraphSAGE for neighborhood sampling and aggregation, GCN for message passing, GAT as a graph attention network that learns edge attention coefficients, and Node2Vec for random-walk based structural embeddings. Each encoder produces node embeddings which are pooled (global mean pooling) to obtain a graph-level vector. In this work we fix the embedding dimension as 128 for all graph pipelines to ensure a fair comparison between encoders. LLM embeddings: For each code sample we obtain a contextual embedding from a local code-Large-Language-Model (LLM). To preserve privacy and enable execution on modest hardware we restrict to lightweight models (parameter count <4B). Models considered include Qwen2.5 Coder 3B, DeepSeek Coder 1.3B, Stable Code 3B, and Phi 3.5 Mini. LLM embeddings encode high-order semantics (name usage, API semantics, docstrings, idioms) that structural encoders cannot directly capture. Note: LLM embeddings have not been uploaded to Zenodo due to memory constraints. You can reproduce the LLM embeddings for the dataset using the code provided in the GitHub repository. |
| Document Type: | dataset |
| Language: | unknown |
| Relation: | https://zenodo.org/records/17259519; oai:zenodo.org:17259519; https://doi.org/10.5281/zenodo.17259519 |
| DOI: | 10.5281/zenodo.17259519 |
| Availability: | https://doi.org/10.5281/zenodo.17259519 https://zenodo.org/records/17259519 |
| Rights: | Creative Commons Attribution 4.0 International ; cc-by-4.0 ; https://creativecommons.org/licenses/by/4.0/legalcode |
| Accession Number: | edsbas.E579B25 |
| Database: | BASE |
| Abstract: | We curated a large Java dataset from Vul4J, Kaggle CVE Fixes, and Kaggle Vulnerability Fixes, collecting 36,425 code samples. Due to heavy vulnerability bias, we added 6,000 synthetically crafted safe files. After removing duplicates and errors, the final dataset contained 35,610 CFG-parsable files (6,234 safe, 29,376 vulnerable). Control Flow Graphs were generated using PROGEX, averaging 20.6 nodes and 20.3 edges per graph, enabling simultaneous analysis of structural complexity and semantic intent via pretrained LLM embeddings. Graph embeddings: Node features are constructed from node text (e.g., statement tokens or code snippets). We extract graph-level embeddings using four complementary encoders: GraphSAGE for neighborhood sampling and aggregation, GCN for message passing, GAT as a graph attention network that learns edge attention coefficients, and Node2Vec for random-walk based structural embeddings. Each encoder produces node embeddings which are pooled (global mean pooling) to obtain a graph-level vector. In this work we fix the embedding dimension as 128 for all graph pipelines to ensure a fair comparison between encoders. LLM embeddings: For each code sample we obtain a contextual embedding from a local code-Large-Language-Model (LLM). To preserve privacy and enable execution on modest hardware we restrict to lightweight models (parameter count <4B). Models considered include Qwen2.5 Coder 3B, DeepSeek Coder 1.3B, Stable Code 3B, and Phi 3.5 Mini. LLM embeddings encode high-order semantics (name usage, API semantics, docstrings, idioms) that structural encoders cannot directly capture. Note: LLM embeddings have not been uploaded to Zenodo due to memory constraints. You can reproduce the LLM embeddings for the dataset using the code provided in the GitHub repository. |
|---|---|
| DOI: | 10.5281/zenodo.17259519 |
Nájsť tento článok vo Web of Science