A Study of Large Language Models in Detecting Python Code Violations.

Saved in:
Bibliographic Details
Title: A Study of Large Language Models in Detecting Python Code Violations.
Authors: Mohammed Salih, Hekar A., Sarhan, Qusay I.
Source: ARO: The Scientific Journal of Koya University; 2025, Vol. 13 Issue 2, p215-225, 11p
Subject Terms: LANGUAGE models, QUALITY assurance, COMPUTER software quality control, NATURAL language processing
Abstract: Adhering to good coding practices is critical for enhancing a software's readability, maintainability, and reliability. Common static code analysis tools for Python, such as Pylint and Flake8, are widely used to enforce code quality by detecting coding violations without executing the code. Yet, they often fail to handle deeper semantic understanding and contextual reasoning. This study investigates the effectiveness of large language models (LLMs) compared to traditional static code analysis tools in detecting Python coding violations. Six state-of-the-art LLMs: ChatGPT, Gemini, Claude Sonnet, DeepSeek, Kimi, and Qwen were evaluated against Pylint and Flake8 tools. To do so, a curated dataset of 75 Python code snippets, annotated with 27 common code violations, was used. In addition, three common prompting strategies: Structural, chain-of-thought, and role-based, were used to instruct the selected LLMs. The experimental results reveal that Claude Sonnet achieved the highest F1-Score (0.81), outperforming Flake8 (0.79) and demonstrating strong precision (0.99) and recall (0.69). However, LLMs showed differences in performance, with Qwen and DeepSeek underperforming relative to others. Moreover, LLMs that identified documentation and design violations (such as type hints and nested method structures) performed better than stylistic consistency and complex semantic reasoning. The results were heavily influenced by the prompting approach, with structural prompts yielding the most balanced performance in the majority of cases. This research contributes to the practical empirical work on employing LLMs for code quality assurance while also demonstrating their potential role as complementary static code analysis tools for Python, with methodologies that may extend to other languages. [ABSTRACT FROM AUTHOR]
Copyright of ARO: The Scientific Journal of Koya University is the property of Koya University and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
Description
Abstract:Adhering to good coding practices is critical for enhancing a software's readability, maintainability, and reliability. Common static code analysis tools for Python, such as Pylint and Flake8, are widely used to enforce code quality by detecting coding violations without executing the code. Yet, they often fail to handle deeper semantic understanding and contextual reasoning. This study investigates the effectiveness of large language models (LLMs) compared to traditional static code analysis tools in detecting Python coding violations. Six state-of-the-art LLMs: ChatGPT, Gemini, Claude Sonnet, DeepSeek, Kimi, and Qwen were evaluated against Pylint and Flake8 tools. To do so, a curated dataset of 75 Python code snippets, annotated with 27 common code violations, was used. In addition, three common prompting strategies: Structural, chain-of-thought, and role-based, were used to instruct the selected LLMs. The experimental results reveal that Claude Sonnet achieved the highest F1-Score (0.81), outperforming Flake8 (0.79) and demonstrating strong precision (0.99) and recall (0.69). However, LLMs showed differences in performance, with Qwen and DeepSeek underperforming relative to others. Moreover, LLMs that identified documentation and design violations (such as type hints and nested method structures) performed better than stylistic consistency and complex semantic reasoning. The results were heavily influenced by the prompting approach, with structural prompts yielding the most balanced performance in the majority of cases. This research contributes to the practical empirical work on employing LLMs for code quality assurance while also demonstrating their potential role as complementary static code analysis tools for Python, with methodologies that may extend to other languages. [ABSTRACT FROM AUTHOR]
ISSN:24109355
DOI:10.14500/aro.12395