LevDetectCode: Zero-Shot Detection of AI-Generated Code Using Levenshtein Distance.

Gespeichert in:
Bibliographische Detailangaben
Titel: LevDetectCode: Zero-Shot Detection of AI-Generated Code Using Levenshtein Distance.
Autoren: Fu, Jiazhou, Shen, Guohua, Huang, Zhiqiu, Yu, Yaoshen, Zhao, Yu
Quelle: International Journal of Software Engineering & Knowledge Engineering; Dec2025, Vol. 35 Issue 12, p1713-1736, 24p
Schlagwörter: CODE generators, DETECTION algorithms, SIGNAL detection, COMPUTER performance
Abstract: AI-generated code is spreading rapidly, making it harder to trace the true origins of code. Text-based tools such as DetectGPT perform well on texts but struggle with code. We propose a new approach: under masked reconstruction, code produced by LLMs exhibits a much wider spread in Levenshtein distance than human-written code. Our key innovation is to treat this Levenshtein distance dispersion as an indicator for token-probability differences, leading us to develop LevDetectCode. Our zero-shot detector relies solely on string-level Levenshtein distance, requiring neither GPUs nor access to model internals. Experiments on Python snippets from MBPP-train and HumanEval, as well as Java data from CodeContest_Java_test, using GPT-3.5, GPT-4, and Deepseek-V3, show a 5–10 point AUROC improvement over other zero-shot baselines, with an additional case study for the C + + dataset, which also demonstrates strong effectiveness. Furthermore, it runs nearly 3000 times faster and significantly reduces memory usage. These results demonstrate that Levenshtein distance can outperform other zero-shot methods that rely on model-based computations, offering a practical and portable solution for detecting AI-generated code. [ABSTRACT FROM AUTHOR]
Copyright of International Journal of Software Engineering & Knowledge Engineering is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank: Complementary Index
Beschreibung
Abstract:AI-generated code is spreading rapidly, making it harder to trace the true origins of code. Text-based tools such as DetectGPT perform well on texts but struggle with code. We propose a new approach: under masked reconstruction, code produced by LLMs exhibits a much wider spread in Levenshtein distance than human-written code. Our key innovation is to treat this Levenshtein distance dispersion as an indicator for token-probability differences, leading us to develop LevDetectCode. Our zero-shot detector relies solely on string-level Levenshtein distance, requiring neither GPUs nor access to model internals. Experiments on Python snippets from MBPP-train and HumanEval, as well as Java data from CodeContest_Java_test, using GPT-3.5, GPT-4, and Deepseek-V3, show a 5–10 point AUROC improvement over other zero-shot baselines, with an additional case study for the C + + dataset, which also demonstrates strong effectiveness. Furthermore, it runs nearly 3000 times faster and significantly reduces memory usage. These results demonstrate that Levenshtein distance can outperform other zero-shot methods that rely on model-based computations, offering a practical and portable solution for detecting AI-generated code. [ABSTRACT FROM AUTHOR]
ISSN:02181940
DOI:10.1142/S0218194025500524