A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

Programming education traditionally requires extensive manual assessment of student assignments, which is both time-consuming and resource-intensive for instructors. Recent advances in large language models (LLMs) open opportunities for automating this process and providing timely feedback. This pap...

Full description

Saved in:

Bibliographic Details
Published in:	Applied sciences Vol. 15; no. 18; p. 10055
Main Authors:	Bernik, Andrija, Radošević, Danijel, Čep, Andrej
Format:	Journal Article
Language:	English
Published:	Basel MDPI AG 01.09.2025
Subjects:	Accuracy AI-assisted grading Artificial intelligence automated feedback Automation Chatbots ChatGPT Computer science Education educational technology Essays Feedback Higher education Large language models Learning Learning management systems Mechanization Peers programming education Students Teachers Teachers, Rating of
ISSN:	2076-3417, 2076-3417
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Programming education traditionally requires extensive manual assessment of student assignments, which is both time-consuming and resource-intensive for instructors. Recent advances in large language models (LLMs) open opportunities for automating this process and providing timely feedback. This paper investigates the application of artificial intelligence (AI) tools for preliminary assessment of undergraduate programming assignments. A multi-phase experimental study was conducted across three computer science courses: Introduction to Programming, Programming 2, and Advanced Programming Concepts. A total of 315 Python assignments were collected from the Moodle learning management system, with 100 randomly selected submissions analyzed in detail. AI evaluation was performed using ChatGPT-4 (GPT-4-turbo), Claude 3, and Gemini 1.5 Pro models, employing structured prompts aligned with a predefined rubric that assessed functionality, code structure, documentation, and efficiency. Quantitative results demonstrate high correlation between AI-generated scores and instructor evaluations, with ChatGPT-4 achieving the highest consistency (Pearson coefficient 0.91) and the lowest average absolute deviation (0.68 points). Qualitative analysis highlights AI’s ability to provide structured, actionable feedback, though variability across models was observed. The study identifies benefits such as faster evaluation and enhanced feedback quality, alongside challenges including model limitations, potential biases, and the need for human oversight. Recommendations emphasize hybrid evaluation approaches combining AI automation with instructor supervision, ethical guidelines, and integration of AI tools into learning management systems. The findings indicate that AI-assisted grading can improve efficiency and pedagogical outcomes while maintaining academic integrity.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app151810055