Distinguishing LLM-Generated from Human-Written Code by Contrastive Learning.
Saved in:
| Title: | Distinguishing LLM-Generated from Human-Written Code by Contrastive Learning. |
|---|---|
| Authors: | Xu, Xiaodan, Ni, Chao, Guo, Xinrong, Liu, Shaoxuan, Wang, Xiaoya, Liu, Kui, Yang, Xiaohu |
| Source: | ACM Transactions on Software Engineering & Methodology; May2025, Vol. 34 Issue 4, p1-31, 31p |
| Subject Terms: | CHATGPT, LANGUAGE models, CODE generators, MACHINE learning |
| Abstract: | Large language models (LLMs), such as ChatGPT released by OpenAI, have attracted significant attention from both industry and academia due to their demonstrated ability to generate high-quality content for various tasks. Despite the impressive capabilities of LLMs, there are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering. Recently, several commercial and open source LLM-generated content detectors have been proposed, which, however, are primarily designed for detecting natural language content without considering the specific characteristics of program code. This article aims to fill this gap by proposing a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder. To assess the effectiveness of CodeGPTSensor on differentiating ChatGPT-generated code from human-written code, we first curate a large-scale Human and Machine comparison Corpus (HMCorp), which includes 550k pairs of human-written and ChatGPT-generated code (i.e., 288k Python code pairs and 222k Java code pairs). Based on the HMCorp dataset, our qualitative and quantitative analysis of the characteristics of ChatGPT-generated code reveals the challenge and opportunity of distinguishing ChatGPT-generated code from human-written code with their representative features. Our experimental results indicate that CodeGPTSensor can effectively identify ChatGPT-generated code, outperforming all selected baselines. [ABSTRACT FROM AUTHOR] |
| Copyright of ACM Transactions on Software Engineering & Methodology is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Complementary Index |
Be the first to leave a comment!
Full Text Finder
Nájsť tento článok vo Web of Science