Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming

The emergence of Large Language Models (LLMs) has marked a significant change in education. The appearance of these LLMs and their associated chatbots has yielded several advantages for both students and educators, including their use as teaching assistants for content creation or summarisation. Thi...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of artificial intelligence in education Vol. 35; no. 2; pp. 774 - 790
Main Authors:	Estévez-Ayres, Iria, Callejo, Patricia, Hombrados-Herrera, Miguel Ángel, Alario-Hoyos, Carlos, Delgado Kloos, Carlos
Format:	Journal Article
Language:	English
Published:	New York Springer New York 01.06.2025 Springer Nature B.V
Subjects:	Arithmetic Artificial Intelligence Chatbots Cognition & reasoning Cognitive Processes Colleges & universities Computer Science Computers and Education Concurrency Concurrent processing Education Educational Technology Errors Feedback Feedback (Response) Generative artificial intelligence Language Language Processing Large language models Learner Engagement Learning Learning Management Systems Learning Processes Learning Strategies Literature Reviews Natural language processing Outcomes of Education Programming Sequential Approach Student Characteristics Student Improvement Students Teaching User Interfaces and Human Computer Interaction User services Programming Generative AI ChatGPT Concurrency Bard
ISSN:	1560-4292, 1560-4306
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The emergence of Large Language Models (LLMs) has marked a significant change in education. The appearance of these LLMs and their associated chatbots has yielded several advantages for both students and educators, including their use as teaching assistants for content creation or summarisation. This paper aims to evaluate the capacity of LLMs chatbots to provide feedback on student exercises in a university programming course. The complexity of the programming topic in this study (concurrency) makes the need for feedback to students even more important. The authors conducted an assessment of exercises submitted by students. Then, ChatGPT (from OpenAI) and Bard (from Google) were employed to evaluate each exercise, looking for typical concurrency errors, such as starvation, deadlocks, or race conditions. Compared to the ground-truth evaluations performed by expert teachers, it is possible to conclude that none of these two tools can accurately assess the exercises despite the generally positive reception of LLMs within the educational sector. All attempts result in an accuracy rate of 50%, meaning that both tools have limitations in their ability to evaluate these particular exercises effectively, specifically finding typical concurrency errors.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1560-4292 1560-4306
DOI:	10.1007/s40593-024-00406-0