Reasoning-based LLMs surpass average human performance on medical social skills.
Uloženo v:
| Název: | Reasoning-based LLMs surpass average human performance on medical social skills. |
|---|---|
| Autoři: | Alohali KI; College of Medicine, King Saud University, Riyadh, 11461, Saudi Arabia. khalid.i.alohali@gmail.com., Almusaeeb LA; College of Medicine, King Saud University, Riyadh, 11461, Saudi Arabia., Almubarak AA; College of Medicine, King Saud University, Riyadh, 11461, Saudi Arabia., Alohali AI; College of Computer and Information Sciences, King Saud University, Riyadh, 11461, Saudi Arabia., Muaygil RA; Associate Professor of Healthcare Ethics, Department of Medical Education, College of Medicine, King Saud University, Riyadh, Saudi Arabia. |
| Zdroj: | Scientific reports [Sci Rep] 2025 Oct 17; Vol. 15 (1), pp. 36453. Date of Electronic Publication: 2025 Oct 17. |
| Způsob vydávání: | Journal Article |
| Jazyk: | English |
| Informace o časopise: | Publisher: Nature Publishing Group Country of Publication: England NLM ID: 101563288 Publication Model: Electronic Cited Medium: Internet ISSN: 2045-2322 (Electronic) Linking ISSN: 20452322 NLM ISO Abbreviation: Sci Rep Subsets: MEDLINE |
| Imprint Name(s): | Original Publication: London : Nature Publishing Group, copyright 2011- |
| Výrazy ze slovníku MeSH: | Social Skills* , Artificial Intelligence* , Licensure, Medical* , Language*, Humans ; United States ; Communication |
| Abstrakt: | Competing Interests: Declarations. Competing interests: The authors declare no competing interests. A significant portion of medical licensing examinations assesses key social skills such as communication, ethics, and professionalism, which are vital for quality patient care. Artificial intelligence (AI) has been increasingly integrated into healthcare systems in recent years, raising concerns among regulators, providers, and patients regarding AI's capacity to handle complex, human-centered scenarios. Previous work has shown that large language models (LLMs) like GPT-3.5 and GPT-4 perform well on social skills questions from the United States Medical Licensing Examination (USMLE). However, newer models like GPT-4o, Gemini 1.5 Pro, and o1 have been introduced, with the latter designed to mimic human thinking through a "chain of thought" reasoning, unlike other LLMs that provide instantaneous answers. The impact of reasoning on LLMs' ability to navigate scenarios requiring social skills remains unclear. Here, we evaluate five LLMs: GPT-4, GPT-4o, Gemini 1.5 Pro, and o1-preview, and its full version, o1; using forty USMLE-style social skills questions from the UWORLD question bank covering several categories: communication & interpersonal skills, healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. After each LLM answered, it was subjected to an "Are you sure?" follow-up prompt to test consistency. Our results show that o1, the reasoning model, came out on top with 39 out of 40 correct final answers (97.5%). GPT-4o and Gemini 1.5 Pro (87.5%) tied in second place, followed by o1-preview (77.5%) and lastly GPT-4 (75%). All LLMs surpassed the UWORLD question bank's 64% average. Domain-specific analysis revealed that despite having equal overall scores, GPT-4o and Gemini 1.5 Pro -developed by two different companies- had varying strengths. GPT-4o demonstrated its greatest strengths in communication & interpersonal skills and patient safety, while Gemini 1.5 Pro achieved perfect scores in healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. Although o1-preview demonstrated strong initial performance, its inconsistency under skepticism; changing answers frequently, primarily to incorrect ones, reduced its overall ranking from second to fourth. This phenomenon was not observed in any other model, including the final o1 release, which maintained consistent, high-level performance. These findings, along with prior work, highlight the potential of LLMs to demonstrate effectiveness at answering knowledge-based social skills questions in a medical context, sometimes surpassing average human performance. As LLMs continue to grow in size and sophistication, their performance is expected to improve further. In particular, the strong performance of reasoning-based LLMs suggests that such architectures hold significant promise for advancing AI's role in socially oriented tasks. These results demonstrate the growing potential for reasoning-based LLMs to complement and enhance clinical training, medical education, and patient care. (© 2025. The Author(s).) |
| References: | Sci Rep. 2024 Nov 10;14(1):27449. (PMID: 39523436) JAMA Intern Med. 2023 Jun 1;183(6):589-596. (PMID: 37115527) BMC Med Educ. 2024 Sep 16;24(1):1013. (PMID: 39285377) Acad Med. 2024 Mar 1;99(3):325-330. (PMID: 37816217) JMIR Med Educ. 2024 Nov 6;10:e63430. (PMID: 39504445) Nature. 2023 Sep 14;:. (PMID: 37704854) BMC Med Educ. 2019 Oct 23;19(1):389. (PMID: 31647012) BMC Med Educ. 2024 Jun 26;24(1):694. (PMID: 38926809) Cureus. 2024 Oct 1;16(10):e70640. (PMID: 39359332) Patterns (N Y). 2023 Aug 04;4(9):100804. (PMID: 37720327) Science. 2019 Oct 25;366(6464):447-453. (PMID: 31649194) Science. 2023 Jun 16;380(6650):1108-1109. (PMID: 37319216) Sci Rep. 2023 Oct 1;13(1):16492. (PMID: 37779171) PLOS Digit Health. 2023 Feb 9;2(2):e0000198. (PMID: 36812645) |
| Contributed Indexing: | Keywords: Artificial intelligence; Large language models (LLMs); Medical education; Medical ethics; Social skills |
| Entry Date(s): | Date Created: 20251017 Date Completed: 20251017 Latest Revision: 20251020 |
| Update Code: | 20251020 |
| PubMed Central ID: | PMC12534372 |
| DOI: | 10.1038/s41598-025-20496-7 |
| PMID: | 41107409 |
| Databáze: | MEDLINE |
| Abstrakt: | Competing Interests: Declarations. Competing interests: The authors declare no competing interests.<br />A significant portion of medical licensing examinations assesses key social skills such as communication, ethics, and professionalism, which are vital for quality patient care. Artificial intelligence (AI) has been increasingly integrated into healthcare systems in recent years, raising concerns among regulators, providers, and patients regarding AI's capacity to handle complex, human-centered scenarios. Previous work has shown that large language models (LLMs) like GPT-3.5 and GPT-4 perform well on social skills questions from the United States Medical Licensing Examination (USMLE). However, newer models like GPT-4o, Gemini 1.5 Pro, and o1 have been introduced, with the latter designed to mimic human thinking through a "chain of thought" reasoning, unlike other LLMs that provide instantaneous answers. The impact of reasoning on LLMs' ability to navigate scenarios requiring social skills remains unclear. Here, we evaluate five LLMs: GPT-4, GPT-4o, Gemini 1.5 Pro, and o1-preview, and its full version, o1; using forty USMLE-style social skills questions from the UWORLD question bank covering several categories: communication & interpersonal skills, healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. After each LLM answered, it was subjected to an "Are you sure?" follow-up prompt to test consistency. Our results show that o1, the reasoning model, came out on top with 39 out of 40 correct final answers (97.5%). GPT-4o and Gemini 1.5 Pro (87.5%) tied in second place, followed by o1-preview (77.5%) and lastly GPT-4 (75%). All LLMs surpassed the UWORLD question bank's 64% average. Domain-specific analysis revealed that despite having equal overall scores, GPT-4o and Gemini 1.5 Pro -developed by two different companies- had varying strengths. GPT-4o demonstrated its greatest strengths in communication & interpersonal skills and patient safety, while Gemini 1.5 Pro achieved perfect scores in healthcare policy & economics, system-based practice & quality improvement, and medical ethics & jurisprudence. Although o1-preview demonstrated strong initial performance, its inconsistency under skepticism; changing answers frequently, primarily to incorrect ones, reduced its overall ranking from second to fourth. This phenomenon was not observed in any other model, including the final o1 release, which maintained consistent, high-level performance. These findings, along with prior work, highlight the potential of LLMs to demonstrate effectiveness at answering knowledge-based social skills questions in a medical context, sometimes surpassing average human performance. As LLMs continue to grow in size and sophistication, their performance is expected to improve further. In particular, the strong performance of reasoning-based LLMs suggests that such architectures hold significant promise for advancing AI's role in socially oriented tasks. These results demonstrate the growing potential for reasoning-based LLMs to complement and enhance clinical training, medical education, and patient care.<br /> (© 2025. The Author(s).) |
|---|---|
| ISSN: | 2045-2322 |
| DOI: | 10.1038/s41598-025-20496-7 |
Full Text Finder
Nájsť tento článok vo Web of Science