ChatGPT and Claude in Hand Surgery: An Explanatory Evaluation of Clinical Decision Support on Common Surgical Cases

Uloženo v:
Podrobná bibliografie
Název: ChatGPT and Claude in Hand Surgery: An Explanatory Evaluation of Clinical Decision Support on Common Surgical Cases
Autoři: Gorgos, Pierre, Ternell, Kristian Heder, Hammarstrand, Casper, Wallmon, Anders, Brogren, Elisabeth, Björkman, Anders, Horvath, Alexandra
Přispěvatelé: Lund University, Faculty of Medicine, Department of Translational Medicine, Hand Surgery, Malmö, Lunds universitet, Medicinska fakulteten, Institutionen för translationell medicin, Handkirurgi, Malmö, Originator
Zdroj: Hand Surgery and Rehabilitation.
Témata: Medical and Health Sciences, Clinical Medicine, Orthopaedics, Medicin och hälsovetenskap, Klinisk medicin, Ortopedi
Popis: INTRODUCTION: Large language models (LLMs) have gained increasing popularity in several medical disciplines. In orthopedic research however, their integration into routine practice have been questioned as they do not seem to outperform experienced clinicians. Conversely, research on the role of artificial intelligence in hand surgery remains limited. This study aims to evaluate two common LLMs in medicine, Generative Pre-trained Transformer (ChatGPT) and Claude in the clinical hand surgery setting. METHODS: Ten questions pertinent to common hand surgical diagnosis were formulated as prompts and entered into ChatGPT and Claude in a systematic manner. The generated responses were anonymously evaluated by hand surgeons, who assessed the quality of the responses according to the QUEST criteria. Gwet's AC2 was used to evaluate the agreement between raters. RESULTS: In general, ChatGPT and Claude performed statistically similar according to the dimensions of QUEST including (1) Quality of information, 2) Understanding and reasoning, 3) Expression style and persona, 4) Safety and harm and 5) Trust and confidence although with relatively modest scores. Agreement between hand surgeons across all measurements was low according to Gwet's AC2 (0.29). CONCLUSIONS: ChatGPT and Claude perform similarly when provided with various common hand surgery related questions. However, they demonstrate significant limitations pertaining to clinical accuracy and reliability that are the core foundation for patient safety, treatment efficiency and evidence-based practice. Furthermore, as the function of ChatGPT and Claude seem to differ between individual hand surgeons, these LLMs in their current state are not suitable for routine clinical use in hand surgery. LEVEL OF EVIDENCE: V.
Přístupová URL adresa: https://doi.org/10.1016/j.hansur.2025.102530
Databáze: SwePub
Popis
Abstrakt:INTRODUCTION: Large language models (LLMs) have gained increasing popularity in several medical disciplines. In orthopedic research however, their integration into routine practice have been questioned as they do not seem to outperform experienced clinicians. Conversely, research on the role of artificial intelligence in hand surgery remains limited. This study aims to evaluate two common LLMs in medicine, Generative Pre-trained Transformer (ChatGPT) and Claude in the clinical hand surgery setting. METHODS: Ten questions pertinent to common hand surgical diagnosis were formulated as prompts and entered into ChatGPT and Claude in a systematic manner. The generated responses were anonymously evaluated by hand surgeons, who assessed the quality of the responses according to the QUEST criteria. Gwet's AC2 was used to evaluate the agreement between raters. RESULTS: In general, ChatGPT and Claude performed statistically similar according to the dimensions of QUEST including (1) Quality of information, 2) Understanding and reasoning, 3) Expression style and persona, 4) Safety and harm and 5) Trust and confidence although with relatively modest scores. Agreement between hand surgeons across all measurements was low according to Gwet's AC2 (0.29). CONCLUSIONS: ChatGPT and Claude perform similarly when provided with various common hand surgery related questions. However, they demonstrate significant limitations pertaining to clinical accuracy and reliability that are the core foundation for patient safety, treatment efficiency and evidence-based practice. Furthermore, as the function of ChatGPT and Claude seem to differ between individual hand surgeons, these LLMs in their current state are not suitable for routine clinical use in hand surgery. LEVEL OF EVIDENCE: V.
ISSN:24681229
24681210
DOI:10.1016/j.hansur.2025.102530