JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs)

This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by prop...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Future internet Jg. 17; H. 6; S. 265
Hauptverfasser:	Cisneros-González, Jorge, Gordo-Herrera, Natalia, Barcia-Santos, Iván, Sánchez-Soriano, Javier
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Basel MDPI AG 01.06.2025
Schlagworte:	academic assessment AI-helped feedback Artificial intelligence automated assessment automated code assessment Automation Comparative analysis Computer science Deep learning Evaluation Feedback generative artificial intelligence Large language models Machine learning Mechanization Multiple choice Natural language processing Programming Science education Students Teachers Teaching Workloads
ISSN:	1999-5903, 1999-5903
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by proposing a system that integrates several LLMs to streamline the assessment process. The system utilizes a graphic interface to process student submissions, allowing instructors to select an LLM and customize the grading rubric. A comparative analysis, using LLMs from OpenAI, Google, DeepSeek and ALIBABA to evaluate student code submissions, revealed a strong correlation between LLM-generated grades and those assigned by human instructors. Specifically, the reduced model using statistically significant variables demonstrates a high explanatory power, with an adjusted R2 of 0.9156 and a Mean Absolute Error of 0.4579, indicating that LLMs can effectively replicate human grading. The findings suggest that LLMs can automate grading when paired with human oversight, drastically reducing the instructor workload, transforming a task estimated to take more than 300 h of manual work into less than 15 min of automated processing and improving the efficiency and consistency of assessment in computer science education.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1999-5903 1999-5903
DOI:	10.3390/fi17060265