Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study.

Gespeichert in:
Bibliographische Detailangaben
Titel: Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study.
Autoren: Qi B; Beijing Tongren Hospital, Capital Medical University, Key Laboratory of Otolaryngology - Head and Neck Surgery (Capital Medical University), Ministry of Education, Beijing, China, Beijing, China., Zheng Y; Beijing Tongren Hospital, Capital Medical University, Key Laboratory of Otolaryngology - Head and Neck Surgery (Capital Medical University), Ministry of Education, Beijing, China, Beijing, China., Wang Y; Beijing Tongren Hospital, Capital Medical University, Key Laboratory of Otolaryngology - Head and Neck Surgery (Capital Medical University), Ministry of Education, Beijing, China, Beijing, China., Xu L; Department of Hearing, Speech and Language Sciences, Ohio University, Athens, OH, United States.
Quelle: JMIR formative research [JMIR Form Res] 2025 Nov 28; Vol. 9, pp. e79534. Date of Electronic Publication: 2025 Nov 28.
Publikationsart: Journal Article; Observational Study; Comparative Study
Sprache: English
Info zur Zeitschrift: Publisher: JMIR Publications Country of Publication: Canada NLM ID: 101726394 Publication Model: Electronic Cited Medium: Internet ISSN: 2561-326X (Electronic) Linking ISSN: 2561326X NLM ISO Abbreviation: JMIR Form Res Subsets: MEDLINE
Imprint Name(s): Original Publication: Toronto, ON, Canada : JMIR Publications, [2017]-
MeSH-Schlagworte: Audiology*/education , Audiologists*/education , Audiologists*/standards , Educational Measurement*/methods , Educational Measurement*/standards , Artificial Intelligence*, Humans ; Taiwan ; Female ; Male ; Adult ; Generative Artificial Intelligence ; East Asian People
Abstract: Background: Generative artificial intelligence (GenAI), exemplified by ChatGPT and DeepSeek, is rapidly advancing and reshaping human-computer interaction with its growing reasoning capabilities and broad applications across fields such as medicine and education.
Objective: This study aimed to evaluate the performance of 2 GenAI models (ie, GPT-4-turbo and DeepSeek-R1) on a standardized audiologist qualification examination in Chinese and to explore their potential applicability in audiology education and clinical training.
Methods: The 2024 Taiwan Audiologist Qualification Examination, comprising 300 multiple-choice questions across 6 subject areas (ie, basic hearing science, behavioral audiology, electrophysiological audiology, principles and practice of hearing devices, health and rehabilitation of the auditory and balance systems, and hearing and speech communication disorders [including professional ethics]), was used to assess the performance of the 2 GenAI models. The complete answering process and reasoning paths of the models were recorded, and performance was analyzed by overall accuracy, subject-specific scores, and question-type scores. Statistical comparisons were performed at the item level using the McNemar test.
Results: ChatGPT and DeepSeek achieved overall accuracies of 80.3% (241/300) and 79.3% (238/300), respectively, which are higher than the passing criterion of the Taiwan Audiologist Qualification Examination (ie, 60% correct answers). The accuracies for the 6 subject areas were 88% (44/50), 70% (35/50), 86% (43/50), 76% (38/50), 82% (41/50), and 80% (40/50) for ChatGPT and 82% (41/50), 72% (36/50), 78% (39/50), 80% (40/50), 80% (40/50), and 84% (41/50) for DeepSeek. No significant differences were found between the two models at the item level (overall P=.79), with a small effect size (accuracy difference=+1%, Cohen h=0.02, odds ratio 0.90, 95% CI 0.53-1.52) and substantial agreement (κ=0.71). ChatGPT scored highest in basic hearing science (88%), whereas DeepSeek performed the best in hearing and speech communication disorders (84%). Both models scored lowest in behavioral audiology (ChatGPT: 70% and DeepSeek: 72%). Question-type analysis revealed that both models performed well on reverse logic questions (ChatGPT: 79/95, 83%; DeepSeek: 80/95, 84%) but performed moderately on complex multiple-choice questions (ChatGPT: 9/17, 53%; DeepSeek: 11/17, 65%). However, both models performed poorly on graph-based questions (ChatGPT: 2/11, 18%; DeepSeek: 4/11, 36%).
Conclusions: Both GenAI models demonstrated strong professional knowledge and stable reasoning ability, meeting the basic requirements of clinical audiologists and suggesting their potential as supportive tools in audiology education. However, the presence of errors underscores the need for cautious use under educator supervision. Future research should explore their performance in open-ended, real-world clinical scenarios to assess practical applicability and limitations.
(©Beier Qi, Yan Zheng, Yuanyuan Wang, Li Xu. Originally published in JMIR Formative Research (https://formative.jmir.org), 28.11.2025.)
Contributed Indexing: Keywords: AI; ChatGPT; DeepSeek; artificial intelligence; audiology; generative artificial intelligence; medical education
Entry Date(s): Date Created: 20251128 Date Completed: 20251128 Latest Revision: 20251128
Update Code: 20251129
DOI: 10.2196/79534
PMID: 41313805
Datenbank: MEDLINE
Beschreibung
Abstract:Background: Generative artificial intelligence (GenAI), exemplified by ChatGPT and DeepSeek, is rapidly advancing and reshaping human-computer interaction with its growing reasoning capabilities and broad applications across fields such as medicine and education.<br />Objective: This study aimed to evaluate the performance of 2 GenAI models (ie, GPT-4-turbo and DeepSeek-R1) on a standardized audiologist qualification examination in Chinese and to explore their potential applicability in audiology education and clinical training.<br />Methods: The 2024 Taiwan Audiologist Qualification Examination, comprising 300 multiple-choice questions across 6 subject areas (ie, basic hearing science, behavioral audiology, electrophysiological audiology, principles and practice of hearing devices, health and rehabilitation of the auditory and balance systems, and hearing and speech communication disorders [including professional ethics]), was used to assess the performance of the 2 GenAI models. The complete answering process and reasoning paths of the models were recorded, and performance was analyzed by overall accuracy, subject-specific scores, and question-type scores. Statistical comparisons were performed at the item level using the McNemar test.<br />Results: ChatGPT and DeepSeek achieved overall accuracies of 80.3% (241/300) and 79.3% (238/300), respectively, which are higher than the passing criterion of the Taiwan Audiologist Qualification Examination (ie, 60% correct answers). The accuracies for the 6 subject areas were 88% (44/50), 70% (35/50), 86% (43/50), 76% (38/50), 82% (41/50), and 80% (40/50) for ChatGPT and 82% (41/50), 72% (36/50), 78% (39/50), 80% (40/50), 80% (40/50), and 84% (41/50) for DeepSeek. No significant differences were found between the two models at the item level (overall P=.79), with a small effect size (accuracy difference=+1%, Cohen h=0.02, odds ratio 0.90, 95% CI 0.53-1.52) and substantial agreement (κ=0.71). ChatGPT scored highest in basic hearing science (88%), whereas DeepSeek performed the best in hearing and speech communication disorders (84%). Both models scored lowest in behavioral audiology (ChatGPT: 70% and DeepSeek: 72%). Question-type analysis revealed that both models performed well on reverse logic questions (ChatGPT: 79/95, 83%; DeepSeek: 80/95, 84%) but performed moderately on complex multiple-choice questions (ChatGPT: 9/17, 53%; DeepSeek: 11/17, 65%). However, both models performed poorly on graph-based questions (ChatGPT: 2/11, 18%; DeepSeek: 4/11, 36%).<br />Conclusions: Both GenAI models demonstrated strong professional knowledge and stable reasoning ability, meeting the basic requirements of clinical audiologists and suggesting their potential as supportive tools in audiology education. However, the presence of errors underscores the need for cautious use under educator supervision. Future research should explore their performance in open-ended, real-world clinical scenarios to assess practical applicability and limitations.<br /> (©Beier Qi, Yan Zheng, Yuanyuan Wang, Li Xu. Originally published in JMIR Formative Research (https://formative.jmir.org), 28.11.2025.)
ISSN:2561-326X
DOI:10.2196/79534