Sailing the Seven Seas: A Multinational Comparison of ChatGPT’s Performance on Medical Licensing Examinations

Saved in:
Bibliographic Details
Title: Sailing the Seven Seas: A Multinational Comparison of ChatGPT’s Performance on Medical Licensing Examinations
Authors: Alfertshofer, Michael, Hoch, Cosima C., Funk, Paul F., Hollmann, Katharina, Wollenberg, Barbara, Knoedler, Samuel, Knoedler, Leonard
Source: Ann Biomed Eng
Publisher Information: Springer Science and Business Media LLC, 2023.
Publication Year: 2023
Subject Terms: Letter to the Editor, ChatGPT, OpenAI, Artificial intelligence, Medical education, Clinical decision-making, Medical licensing exams, Italy [MeSH], Education, Medical [MeSH], Humans [MeSH], Licensure, Medical/standards [MeSH], Educational Measurement/methods [MeSH], 3. Good health, ddc
Description: Purpose The use of AI-powered technology, particularly OpenAI’s ChatGPT, holds significant potential to reshape healthcare and medical education. Despite existing studies on the performance of ChatGPT in medical licensing examinations across different nations, a comprehensive, multinational analysis using rigorous methodology is currently lacking. Our study sought to address this gap by evaluating the performance of ChatGPT on six different national medical licensing exams and investigating the relationship between test question length and ChatGPT’s accuracy. Methods We manually inputted a total of 1,800 test questions (300 each from US, Italian, French, Spanish, UK, and Indian medical licensing examination) into ChatGPT, and recorded the accuracy of its responses. Results We found significant variance in ChatGPT’s test accuracy across different countries, with the highest accuracy seen in the Italian examination (73% correct answers) and the lowest in the French examination (22% correct answers). Interestingly, question length correlated with ChatGPT’s performance in the Italian and French state examinations only. In addition, the study revealed that questions requiring multiple correct answers, as seen in the French examination, posed a greater challenge to ChatGPT. Conclusion Our findings underscore the need for future research to further delineate ChatGPT’s strengths and limitations in medical test-taking across additional countries and to develop guidelines to prevent AI-assisted cheating in medical examinations.
Document Type: Article
Other literature type
File Description: application/pdf
Language: English
ISSN: 1573-9686
0090-6964
DOI: 10.1007/s10439-023-03338-3
Access URL: https://repository.publisso.de/resource/frl:6496359
https://epub.ub.uni-muenchen.de/116284/
https://mediatum.ub.tum.de/doc/1764177/document.pdf
Rights: CC BY
Accession Number: edsair.doi.dedup.....aca299dad538cf62046210c14c7e5fb4
Database: OpenAIRE
Description
Abstract:Purpose The use of AI-powered technology, particularly OpenAI’s ChatGPT, holds significant potential to reshape healthcare and medical education. Despite existing studies on the performance of ChatGPT in medical licensing examinations across different nations, a comprehensive, multinational analysis using rigorous methodology is currently lacking. Our study sought to address this gap by evaluating the performance of ChatGPT on six different national medical licensing exams and investigating the relationship between test question length and ChatGPT’s accuracy. Methods We manually inputted a total of 1,800 test questions (300 each from US, Italian, French, Spanish, UK, and Indian medical licensing examination) into ChatGPT, and recorded the accuracy of its responses. Results We found significant variance in ChatGPT’s test accuracy across different countries, with the highest accuracy seen in the Italian examination (73% correct answers) and the lowest in the French examination (22% correct answers). Interestingly, question length correlated with ChatGPT’s performance in the Italian and French state examinations only. In addition, the study revealed that questions requiring multiple correct answers, as seen in the French examination, posed a greater challenge to ChatGPT. Conclusion Our findings underscore the need for future research to further delineate ChatGPT’s strengths and limitations in medical test-taking across additional countries and to develop guidelines to prevent AI-assisted cheating in medical examinations.
ISSN:15739686
00906964
DOI:10.1007/s10439-023-03338-3