Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs.

Gespeichert in:
Bibliographische Detailangaben
Titel: Empirical Evaluation of Prompting Strategies for Python Syntax Error Detection with LLMs.
Autoren: Aloufi, Norah, Aljuhani, Abdulmajeed
Quelle: Applied Sciences (2076-3417); Aug2025, Vol. 15 Issue 16, p9223, 19p
Schlagwörter: PYTHON programming language, LANGUAGE models, COMPUTER software development, EVALUATION methodology, SIGNAL detection
Abstract: As large language models (LLMs) are increasingly integrated into software development, there is a growing need to assess how effectively they address subtle programming errors in real-world environments. Accordingly, this study investigates the effectiveness of LLMs in identifying syntax errors within large Python code repositories. Building on the bug in the code stack (BICS) benchmark, this research expands the evaluation to include additional models, such as DeepSeek and Grok, while assessing their ability to detect errors across varying code lengths and depths. Two prompting strategies—two-shot and role-based prompting—were employed to compare the performance of models including DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, and Grok-2-Latest with GPT-4o serving as the baseline. The findings indicate that the DeepSeek models generally outperformed GPT-4o in terms of accuracy (Acc). Notably, DeepSeek-Reasoner exhibited the highest overall performance, achieving an Acc of 86.6% and surpassing all other models, particularly when integrated prompting strategies were used. Nevertheless, all models demonstrated decreased Acc with increasing input length and consistently struggled with certain types of errors, such as missing quotations (MQo). This work provides insight into the current strengths and weaknesses of LLMs within real-world debugging environments, thereby informing ongoing efforts to improve automated software tools. [ABSTRACT FROM AUTHOR]
Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank: Complementary Index
Beschreibung
Abstract:As large language models (LLMs) are increasingly integrated into software development, there is a growing need to assess how effectively they address subtle programming errors in real-world environments. Accordingly, this study investigates the effectiveness of LLMs in identifying syntax errors within large Python code repositories. Building on the bug in the code stack (BICS) benchmark, this research expands the evaluation to include additional models, such as DeepSeek and Grok, while assessing their ability to detect errors across varying code lengths and depths. Two prompting strategies—two-shot and role-based prompting—were employed to compare the performance of models including DeepSeek-Chat, DeepSeek-Reasoner, DeepSeek-Coder, and Grok-2-Latest with GPT-4o serving as the baseline. The findings indicate that the DeepSeek models generally outperformed GPT-4o in terms of accuracy (Acc). Notably, DeepSeek-Reasoner exhibited the highest overall performance, achieving an Acc of 86.6% and surpassing all other models, particularly when integrated prompting strategies were used. Nevertheless, all models demonstrated decreased Acc with increasing input length and consistently struggled with certain types of errors, such as missing quotations (MQo). This work provides insight into the current strengths and weaknesses of LLMs within real-world debugging environments, thereby informing ongoing efforts to improve automated software tools. [ABSTRACT FROM AUTHOR]
ISSN:20763417
DOI:10.3390/app15169223