Qualitative Coding with GPT-4: Where It Works Better
Saved in:
| Title: | Qualitative Coding with GPT-4: Where It Works Better |
|---|---|
| Language: | English |
| Authors: | Xiner Liu (ORCID |
| Source: | Journal of Learning Analytics. 2025 12(1):169-185. |
| Availability: | Society for Learning Analytics Research. 121 Pointe Marsan, Beaumont, AB T4X 0A2, Canada. Tel: +61-429-920-838; e-mail: info@solaresearch.org; Web site: https://learning-analytics.info/index.php/JLA/index |
| Peer Reviewed: | Y |
| Page Count: | 17 |
| Publication Date: | 2025 |
| Sponsoring Agency: | National Science Foundation (NSF), Division of Research on Learning in Formal and Informal Settings (DRL) |
| Contract Number: | 2301173 |
| Document Type: | Journal Articles Reports - Research |
| Descriptors: | Coding, Artificial Intelligence, Automation, Data Analysis, Educational Research, Engineering, Man Machine Systems, Algebra, Tutoring, Game Based Learning, Troubleshooting, Introductory Courses, Programming, Interrater Reliability, Program Effectiveness, Prompting |
| ISSN: | 1929-7750 |
| Abstract: | This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies -- Zero-shot, Few-shot, and Fewshot with contextual information -- as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct's degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on. |
| Abstractor: | As Provided |
| Entry Date: | 2025 |
| Accession Number: | EJ1465623 |
| Database: | ERIC |
| Abstract: | This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies -- Zero-shot, Few-shot, and Fewshot with contextual information -- as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct's degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on. |
|---|---|
| ISSN: | 1929-7750 |