The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: Experimental Study.
Saved in:
| Title: | The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: Experimental Study. |
|---|---|
| Authors: | Yokose M; Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan., Hirosawa T; Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan., Sakamoto T; Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan., Kawamura R; Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan., Suzuki Y; Department of Internal Medicine, Yamagata Prefectural Kahoku Hospital, Yamagata, Japan., Harada Y; Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan., Shimizu T; Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan. |
| Source: | JMIR formative research [JMIR Form Res] 2025 Dec 04; Vol. 9, pp. e79465. Date of Electronic Publication: 2025 Dec 04. |
| Publication Type: | Journal Article |
| Language: | English |
| Journal Info: | Publisher: JMIR Publications Country of Publication: Canada NLM ID: 101726394 Publication Model: Electronic Cited Medium: Internet ISSN: 2561-326X (Electronic) Linking ISSN: 2561326X NLM ISO Abbreviation: JMIR Form Res Subsets: MEDLINE |
| Imprint Name(s): | Original Publication: Toronto, ON, Canada : JMIR Publications, [2017]- |
| MeSH Terms: | Artificial Intelligence*/standards , Students, Medical*/statistics & numerical data , Educational Measurement*/methods , Educational Measurement*/standards , Clinical Competence*/standards, Humans ; Japan ; Male ; Female ; Reproducibility of Results ; Generative Artificial Intelligence |
| Abstract: | Background: The Objective Structured Clinical Examination (OSCE) has been widely used to evaluate students in medical education. However, it is resource-intensive, presenting challenges in implementation. We hypothesized that generative artificial intelligence (AI) such as ChatGPT-4 could serve as a complementary assessor and alleviate the burden of physicians in evaluating OSCE. Objective: By comparing the evaluation scores between generative AI and physicians, this study aims to evaluate the validity of generative AI as a complementary assessor for OSCE. Methods: This experimental study was conducted at a medical university in Japan. We recruited 11 fifth-year medical students during the general internal medicine clerkship from April 2023 to December 2023. Participants conducted a mock medical interview with a patient experiencing abdominal pain and wrote patient notes. Four physicians independently evaluated the participants by reviewing medical interview videos and patient notes, while ChatGPT-4 was provided with interview transcripts and notes. Evaluations were conducted using the 6-domain rubric (patient care and communication, history taking, physical examination, patient notes, clinical reasoning, and management). Each domain was scored using a 6-point Likert scale, ranging from 1 (very poor) to 6 (excellent). Median scores were compared using the Wilcoxon signed-rank test, and the agreement between ChatGPT-4 and physicians was assessed using intraclass correlation coefficients (ICCs). All P values <.05 were considered statistically significant. Results: Although ChatGPT-4 assigned higher scores than physicians in terms of physical examination (median 4.0, IQR 4.0-5.0 vs median 4.0, IQR 3.0-4.0; P=.02), patient notes (median 6.0, IQR 5.0-6.0 vs median 4.0, IQR 4.0-4.0; P=.002), clinical reasoning (median 5.0, IQR 5.0-5.0 vs median 4.0, IQR 3.0-4.0; P<.001), and management (median 6.0, IQR 5.0-6.0 vs median 4.0, IQR 2.5-4.5; P=.002), there were no significant differences in the scores of patient care and communication (median 5.0, IQR 5.0-5.0 vs median 5.0, IQR 4.0-5.0; P=.06) and history taking (median 5.0, IQR 4.0-5.0 vs median 5.0, IQR 4.0-5.0; P>.99), respectively. ICC values were low in all domains, except for history taking, where the agreement was still poor (ICC=0.36, 95% CI -0.32 to 0.78). Conclusions: ChatGPT-4 produced higher evaluation scores than physicians in several OSCE domains, though the agreement between them was poor. Although these preliminary results suggest that generative AI may be able to support assessment in some domains of OSCE, further research is needed to establish its reproducibility and validity. Generative AI like ChatGPT-4 shows potential as a complementary assessor for OSCE. Trial Registration: University Hospital Medical Information Network Clinical Trials Registry UMIN000050489; https://center6.umin.ac.jp/cgi-open-bin/ctr/ctr_his_list.cgi?recptno=R000057513. (©Masashi Yokose, Takanobu Hirosawa, Tetsu Sakamoto, Ren Kawamura, Yudai Suzuki, Yukinori Harada, Taro Shimizu. Originally published in JMIR Formative Research (https://formative.jmir.org), 04.12.2025.) |
| Contributed Indexing: | Keywords: ChatGPT-4; OSCE; generative artificial intelligence; medical education; medical students; objective structured clinical examination |
| Entry Date(s): | Date Created: 20251204 Date Completed: 20251204 Latest Revision: 20251204 |
| Update Code: | 20251205 |
| DOI: | 10.2196/79465 |
| PMID: | 41343812 |
| Database: | MEDLINE |
| Abstract: | Background: The Objective Structured Clinical Examination (OSCE) has been widely used to evaluate students in medical education. However, it is resource-intensive, presenting challenges in implementation. We hypothesized that generative artificial intelligence (AI) such as ChatGPT-4 could serve as a complementary assessor and alleviate the burden of physicians in evaluating OSCE.<br />Objective: By comparing the evaluation scores between generative AI and physicians, this study aims to evaluate the validity of generative AI as a complementary assessor for OSCE.<br />Methods: This experimental study was conducted at a medical university in Japan. We recruited 11 fifth-year medical students during the general internal medicine clerkship from April 2023 to December 2023. Participants conducted a mock medical interview with a patient experiencing abdominal pain and wrote patient notes. Four physicians independently evaluated the participants by reviewing medical interview videos and patient notes, while ChatGPT-4 was provided with interview transcripts and notes. Evaluations were conducted using the 6-domain rubric (patient care and communication, history taking, physical examination, patient notes, clinical reasoning, and management). Each domain was scored using a 6-point Likert scale, ranging from 1 (very poor) to 6 (excellent). Median scores were compared using the Wilcoxon signed-rank test, and the agreement between ChatGPT-4 and physicians was assessed using intraclass correlation coefficients (ICCs). All P values <.05 were considered statistically significant.<br />Results: Although ChatGPT-4 assigned higher scores than physicians in terms of physical examination (median 4.0, IQR 4.0-5.0 vs median 4.0, IQR 3.0-4.0; P=.02), patient notes (median 6.0, IQR 5.0-6.0 vs median 4.0, IQR 4.0-4.0; P=.002), clinical reasoning (median 5.0, IQR 5.0-5.0 vs median 4.0, IQR 3.0-4.0; P<.001), and management (median 6.0, IQR 5.0-6.0 vs median 4.0, IQR 2.5-4.5; P=.002), there were no significant differences in the scores of patient care and communication (median 5.0, IQR 5.0-5.0 vs median 5.0, IQR 4.0-5.0; P=.06) and history taking (median 5.0, IQR 4.0-5.0 vs median 5.0, IQR 4.0-5.0; P>.99), respectively. ICC values were low in all domains, except for history taking, where the agreement was still poor (ICC=0.36, 95% CI -0.32 to 0.78).<br />Conclusions: ChatGPT-4 produced higher evaluation scores than physicians in several OSCE domains, though the agreement between them was poor. Although these preliminary results suggest that generative AI may be able to support assessment in some domains of OSCE, further research is needed to establish its reproducibility and validity. Generative AI like ChatGPT-4 shows potential as a complementary assessor for OSCE.<br />Trial Registration: University Hospital Medical Information Network Clinical Trials Registry UMIN000050489; https://center6.umin.ac.jp/cgi-open-bin/ctr/ctr_his_list.cgi?recptno=R000057513.<br /> (©Masashi Yokose, Takanobu Hirosawa, Tetsu Sakamoto, Ren Kawamura, Yudai Suzuki, Yukinori Harada, Taro Shimizu. Originally published in JMIR Formative Research (https://formative.jmir.org), 04.12.2025.) |
|---|---|
| ISSN: | 2561-326X |
| DOI: | 10.2196/79465 |
Full Text Finder
Nájsť tento článok vo Web of Science