Faculty versus artificial intelligence chatbot: a comparative analysis of multiple-choice question quality in physiology.

Saved in:
Bibliographic Details
Title: Faculty versus artificial intelligence chatbot: a comparative analysis of multiple-choice question quality in physiology.
Authors: Dhanvijay AD; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Kumari A; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Pinjar MJ; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Kumari A; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Ganguly A; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Priya A; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Juhi A; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Gupta P; Department of Microbiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India., Mondal H; Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India.
Source: Advances in physiology education [Adv Physiol Educ] 2025 Dec 01; Vol. 49 (4), pp. 1045-1051. Date of Electronic Publication: 2025 Sep 22.
Publication Type: Journal Article; Comparative Study
Language: English
Journal Info: Publisher: American Physiological Society Country of Publication: United States NLM ID: 100913944 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1522-1229 (Electronic) Linking ISSN: 10434046 NLM ISO Abbreviation: Adv Physiol Educ Subsets: MEDLINE
Imprint Name(s): Original Publication: Bethesda, MD : American Physiological Society, c1989-
MeSH Terms: Artificial Intelligence*/standards , Physiology*/education , Educational Measurement*/methods , Educational Measurement*/standards , Education, Medical, Undergraduate*/methods , Education, Medical, Undergraduate*/standards , Faculty, Medical*/standards , Students, Medical*, Humans ; Psychometrics/methods ; Male ; Female ; Generative Artificial Intelligence
Abstract: Multiple-choice questions (MCQs) are widely used for assessment in medical education. While human-generated MCQs benefit from pedagogical insight, creating high-quality items is time intensive. With the advent of artificial intelligence (AI), tools like DeepSeek R1 offer potential for automated MCQ generation, though their educational validity remains uncertain. With this background, this study compared the psychometric quality of Physiology MCQs generated by faculty and an AI chatbot. A total of 200 MCQs were developed following the standard syllabus and question design guidelines: 100 by the Physiology faculty and 100 by the AI chatbot DeepSeek R1. Fifty questions from each group were randomly selected and administered to undergraduate medical students in 2 hours. Item analysis was conducted postassessment using difficulty index (DIFI), discrimination index (DI), and nonfunctional distractors (NFDs). Statistical comparisons were made using t tests or nonparametric equivalents, with significance at P < 0.05. Chatbot-generated MCQs had a significantly higher DIFI (0.64 ± 0.22) than faculty MCQs (0.47 ± 0.19; P < 0.0001). No significant difference in DI was found between the groups ( P = 0.17). Faculty MCQs had significantly fewer NFDs (median 0) compared to chatbot MCQs (median 1; P = 0.0063). AI-generated MCQs demonstrated comparable discrimination ability but were generally easier and contained more ineffective distractors. While chatbots show promise in MCQ generation, further refinement is needed to improve distractor quality and item difficulty. AI can complement but not yet replace human expertise in assessment design. NEW & NOTEWORTHY This study contributes to the growing research on artificial intelligence (AI)- versus faculty-generated multiple-choice questions in Physiology. Psychometric analysis showed that AI-generated items were generally easier but had comparable discrimination ability to faculty-authored questions, while containing more nonfunctional distractors. By focusing on Physiology, this work offers discipline-specific insights and underscores both the potential and current limitations of AI in assessment development.
Contributed Indexing: Keywords: artificial intelligence; chatbot; item analysis; large language model; multiple choice question
Entry Date(s): Date Created: 20250922 Date Completed: 20251013 Latest Revision: 20251013
Update Code: 20251013
DOI: 10.1152/advan.00197.2025
PMID: 40981738
Database: MEDLINE
Description
Abstract:Multiple-choice questions (MCQs) are widely used for assessment in medical education. While human-generated MCQs benefit from pedagogical insight, creating high-quality items is time intensive. With the advent of artificial intelligence (AI), tools like DeepSeek R1 offer potential for automated MCQ generation, though their educational validity remains uncertain. With this background, this study compared the psychometric quality of Physiology MCQs generated by faculty and an AI chatbot. A total of 200 MCQs were developed following the standard syllabus and question design guidelines: 100 by the Physiology faculty and 100 by the AI chatbot DeepSeek R1. Fifty questions from each group were randomly selected and administered to undergraduate medical students in 2 hours. Item analysis was conducted postassessment using difficulty index (DIFI), discrimination index (DI), and nonfunctional distractors (NFDs). Statistical comparisons were made using t tests or nonparametric equivalents, with significance at P &lt; 0.05. Chatbot-generated MCQs had a significantly higher DIFI (0.64 ± 0.22) than faculty MCQs (0.47 ± 0.19; P &lt; 0.0001). No significant difference in DI was found between the groups ( P = 0.17). Faculty MCQs had significantly fewer NFDs (median 0) compared to chatbot MCQs (median 1; P = 0.0063). AI-generated MCQs demonstrated comparable discrimination ability but were generally easier and contained more ineffective distractors. While chatbots show promise in MCQ generation, further refinement is needed to improve distractor quality and item difficulty. AI can complement but not yet replace human expertise in assessment design. NEW & NOTEWORTHY This study contributes to the growing research on artificial intelligence (AI)- versus faculty-generated multiple-choice questions in Physiology. Psychometric analysis showed that AI-generated items were generally easier but had comparable discrimination ability to faculty-authored questions, while containing more nonfunctional distractors. By focusing on Physiology, this work offers discipline-specific insights and underscores both the potential and current limitations of AI in assessment development.
ISSN:1522-1229
DOI:10.1152/advan.00197.2025