Hands-on analysis of using large language models for the auto evaluation of programming assignments

The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students’ coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LL...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information systems (Oxford) Jg. 128; S. 102473
Hauptverfasser:	Mohamed, Kareem, Yousef, Mina, Medhat, Walaa, Mohamed, Ensaf Hussein, Khoriba, Ghada, Arafa, Tamer
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Elsevier Ltd 01.02.2025
Schlagworte:	Auto evaluation Comparative analysis Educational technology Large language models Programming assignments Programming assignments Large language models Comparative analysis Auto evaluation Educational technology
ISSN:	0306-4379
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students’ coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LLMs) for the automated evaluation of programming assignments. LLMs can use advanced natural language processing capabilities to assess code quality, functionality, and adherence to best practices, providing detailed feedback and grades. We demonstrate the effectiveness of LLMs through experiments comparing their performance with human evaluators across various programming tasks. Our study evaluates the performance of several LLMs for automated grading. Gemini 1.5 Pro achieves an exact match accuracy of 86% and a ±1 accuracy of 98%. GPT-4o also demonstrates strong performance, with exact match and ±1 accuracies of 69% and 97%, respectively. Both models correlate highly with human evaluations, indicating their potential for reliable automated grading. However, models such as Llama 3 70B and Mixtral 8 × 7B exhibit low accuracy and alignment with human grading, particularly in problem-solving tasks. These findings suggest that advanced LLMs are instrumental in scalable, automated educational assessment. Additionally, LLMs enhance the learning experience by offering personalized, instant feedback, fostering an iterative learning process. The findings suggest that LLMs could play a pivotal role in the future of programming education, ensuring scalability and consistency in evaluation. •Analyze the programming education methods of evaluating programming assignments.•Present the capabilities of LLMs in understanding and evaluating code.•Compare the performance and consistency of LLMs with human evaluators.•Discuss the usage of LLMs to improve the learning experience and educational outcomes.
ISSN:	0306-4379
DOI:	10.1016/j.is.2024.102473