Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces

Purpose This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI’s GPT-4 variants, Google’s Gemini series, and Anthropic’s Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a lon...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:European archives of oto-rhino-laryngology Ročník 282; číslo 6; s. 3317 - 3328
Hlavní autoři: Hoch, Cosima C., Funk, Paul F., Guntinas-Lichius, Orlando, Volk, Gerd Fabian, Lüers, Jan-Christoffer, Hussain, Timon, Wirth, Markus, Schmidl, Benedikt, Wollenberg, Barbara, Alfertshofer, Michael
Médium: Journal Article
Jazyk:angličtina
Vydáno: Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2025
Springer Nature B.V
Témata:
ISSN:0937-4477, 1434-4726, 1434-4726
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Purpose This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI’s GPT-4 variants, Google’s Gemini series, and Anthropic’s Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time. Methods We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing. Results GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo’s performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models. Conclusion While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo’s performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0937-4477
1434-4726
1434-4726
DOI:10.1007/s00405-025-09404-x