Qualitative Coding with GPT-4: Where It Works Better

Saved in:
Bibliographic Details
Title: Qualitative Coding with GPT-4: Where It Works Better
Language: English
Authors: Xiner Liu (ORCID 0009-0004-3796-2251), Andres Felipe Zambrano (ORCID 0000-0003-0692-1209), Ryan S. Baker (ORCID 0000-0002-3051-3232), Amanda Barany (ORCID 0000-0003-2239-2271), Jaclyn Ocumpaugh (ORCID 0000-0002-9667-8523), Jiayi Zhang (ORCID 0000-0002-7334-4256), Maciej Pankiewicz (ORCID 0000-0002-6945-0523), Nidhi Nasiar (ORCID 0009-0006-7063-5433), Zhanlan Wei (ORCID 0009-0002-3931-6398)
Source: Journal of Learning Analytics. 2025 12(1):169-185.
Availability: Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index
Peer Reviewed: Y
Page Count: 17
Publication Date: 2025
Sponsoring Agency: National Science Foundation (NSF), Division of Research on Learning in Formal and Informal Settings (DRL)
Contract Number: 2301173
Document Type: Journal Articles
Reports - Research
Descriptors: Coding, Artificial Intelligence, Automation, Data Analysis, Educational Research, Engineering, Man Machine Systems, Algebra, Tutoring, Game Based Learning, Troubleshooting, Introductory Courses, Programming, Interrater Reliability, Program Effectiveness, Prompting
ISSN: 1929-7750
Abstract: This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies -- Zero-shot, Few-shot, and Fewshot with contextual information -- as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct's degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on.
Abstractor: As Provided
Entry Date: 2025
Accession Number: EJ1465623
Database: ERIC
Description
Abstract:This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies -- Zero-shot, Few-shot, and Fewshot with contextual information -- as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct's degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on.
ISSN:1929-7750