On Inter-Dataset Code Duplication and Data Leakage in Large Language Models

Motivation. Large language models ( LLM s) have exhibited remarkable proficiency in diverse software engineering ( SE ) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datas...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on software engineering Vol. 51; no. 1; pp. 192 - 205
Main Authors: Lopez, Jose Antonio Hernandez, Chen, Boqi, Saad, Mootez, Sharma, Tushar, Varro, Daniel
Format: Journal Article
Language:English
Published: New York IEEE 01.01.2025
IEEE Computer Society
Subjects:
ISSN:0098-5589, 1939-3520, 1939-3520
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Motivation. Large language models ( LLM s) have exhibited remarkable proficiency in diverse software engineering ( SE ) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage i.e., using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication , which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLM s across diverse SE tasks. Study design. We conduct an empirical study using the CodeSearchNet dataset ( csn ), a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of LLM s using a subset of csn : one leaky LLM , which includes the identified intersection in its pre-training set, and one non-leaky LLM that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. Results. Our findings reveal a potential threat to the evaluation of LLM s across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as CodeBERT , GraphCodeBERT , and UnixCoder could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to SE researchers on strategies to prevent inter-dataset code duplication.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0098-5589
1939-3520
1939-3520
DOI:10.1109/TSE.2024.3504286