CODEMORPH: Mitigating Data Leakage in Large Language Model Assessment

Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Cons...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings (IEEE/ACM International Conference on Software Engineering Companion. Online) s. 267 - 278
Hlavní autori: Rao, Hongzhou, Zhao, Yanjie, Zhu, Wenjie, Xiao, Ling, Wang, Meizhen, Wang, Haoyu
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 27.04.2025
Predmet:
ISSN:2574-1934
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Consequently, generating new datasets through code perturbation has become essential. However, existing methods often fail to produce complex and diverse variations, struggle with complex cross-file dependencies, and lack support for multiple programming languages, which limits their effectiveness in enhancing LLM evaluations for coding tasks. To fill this gap, we propose Codemorph, an approach designed to support multiple programming languages while preserving cross-file dependencies to mitigate data leakage. CODEMORPH consists of two main components that work together to enhance the perturbation process. The first component employs 26 semantic-preserving transformation methods to iteratively perturb code, generating diverse variations while ensuring that the modified code remains compilable. The second component introduces a genetic algorithm-based selection algorithm, PESO, to identify the more effective perturbation method for each iteration by targeting lower similarity scores between the perturbed and original code, thereby enhancing overall perturbation effectiveness. Experimental results demonstrate that after applying CODEMORPH, the accuracy of the LLM on code completion tasks across five programming languages decreased by an average of 24.67%, with Python showing the most significant reduction at 45%. The similarity score of code optimized by PESO is, on average, 7.01% lower than that of randomly perturbed code, peaking at a reduction of 42.86%. Additionally, overall accuracy dropped by an average of 15%, with a maximum decrease of 25%. These findings indicate that CODEMORPH effectively reduces data contamination while PESO optimizes perturbation combinations for code.
ISSN:2574-1934
DOI:10.1109/ICSE-Companion66252.2025.00081