View in EDS

Qualitative Coding with GPT-4: Where It Works Better

Saved in:

Bibliographic Details
Title:	Qualitative Coding with GPT-4: Where It Works Better
Language:	English
Authors:	Xiner Liu (ORCID 0009-0004-3796-2251), Andres Felipe Zambrano (ORCID 0000-0003-0692-1209), Ryan S. Baker (ORCID 0000-0002-3051-3232), Amanda Barany (ORCID 0000-0003-2239-2271), Jaclyn Ocumpaugh (ORCID 0000-0002-9667-8523), Jiayi Zhang (ORCID 0000-0002-7334-4256), Maciej Pankiewicz (ORCID 0000-0002-6945-0523), Nidhi Nasiar (ORCID 0009-0006-7063-5433), Zhanlan Wei (ORCID 0009-0002-3931-6398)
Source:	Journal of Learning Analytics. 2025 12(1):169-185.
Availability:	Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index
Peer Reviewed:	Y
Page Count:	17
Publication Date:	2025
Sponsoring Agency:	National Science Foundation (NSF), Division of Research on Learning in Formal and Informal Settings (DRL)
Contract Number:	2301173
Document Type:	Journal Articles Reports - Research
Descriptors:	Coding, Artificial Intelligence, Automation, Data Analysis, Educational Research, Engineering, Man Machine Systems, Algebra, Tutoring, Game Based Learning, Troubleshooting, Introductory Courses, Programming, Interrater Reliability, Program Effectiveness, Prompting
ISSN:	1929-7750
Abstract:	This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies -- Zero-shot, Few-shot, and Fewshot with contextual information -- as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct's degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on.
Abstractor:	As Provided
Entry Date:	2025
Accession Number:	EJ1465623
Database:	ERIC

Full Text from ERIC

Description
Abstract:	This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies -- Zero-shot, Few-shot, and Fewshot with contextual information -- as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct's degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on.
ISSN:	1929-7750