JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs)

This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by prop...

Full description

Saved in:

Bibliographic Details
Published in:	Future internet Vol. 17; no. 6; p. 265
Main Authors:	Cisneros-González, Jorge, Gordo-Herrera, Natalia, Barcia-Santos, Iván, Sánchez-Soriano, Javier
Format:	Journal Article
Language:	English
Published:	Basel MDPI AG 01.06.2025
Subjects:	academic assessment AI-helped feedback Artificial intelligence automated assessment automated code assessment Automation Comparative analysis Computer science Deep learning Evaluation Feedback generative artificial intelligence Large language models Machine learning Mechanization Multiple choice Natural language processing Programming Science education Students Teachers Teaching Workloads
ISSN:	1999-5903, 1999-5903
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by proposing a system that integrates several LLMs to streamline the assessment process. The system utilizes a graphic interface to process student submissions, allowing instructors to select an LLM and customize the grading rubric. A comparative analysis, using LLMs from OpenAI, Google, DeepSeek and ALIBABA to evaluate student code submissions, revealed a strong correlation between LLM-generated grades and those assigned by human instructors. Specifically, the reduced model using statistically significant variables demonstrates a high explanatory power, with an adjusted R2 of 0.9156 and a Mean Absolute Error of 0.4579, indicating that LLMs can effectively replicate human grading. The findings suggest that LLMs can automate grading when paired with human oversight, drastically reducing the instructor workload, transforming a task estimated to take more than 300 h of manual work into less than 15 min of automated processing and improving the efficiency and consistency of assessment in computer science education.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1999-5903 1999-5903
DOI:	10.3390/fi17060265