Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study

Uloženo v:
Podrobná bibliografie
Název: Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
Autoři: Pagano, Stefano, Strumolo, Luigi, Michalk, Katrin, Schiegl, Julia, Pulido, Loreto C., Reinhard, Jan, Maderbacher, Guenther, Renkawitz, Tobias, Schuster, Marie
Zdroj: Comput Struct Biotechnol J
Computational and Structural Biotechnology Journal, Vol 28, Iss, Pp 9-15 (2025)
Informace o vydavateli: Elsevier BV, 2025.
Rok vydání: 2025
Témata: ddc:610, ChatGPT, Llama, Gemma 2, 610 Medizin, Large Language Models (LLMs), GPT-4o, TP248.13-248.65, Gemini, Biotechnology, Research Article, Large Language Models (LLMs), GPT-4o, ChatGPT, Gemini, Llama, Gemma 2, Mistral-Nemo, Hip osteoarthritis, Knee osteoarthritis, Diagnostic sensitivity, Musculoskeletal disorders, Orthopaedic diagnostics, Patient-reported data, Artificial intelligence in healthcare
Popis: Background: Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or knee osteoarthritis (OA) using only patient-reported data collected via a structured questionnaire, without prior medical consultation. Methods: A prospective observational study was conducted at an orthopaedic outpatient clinic specialized in hip and knee OA treatment. A total of 115 patients completed a paper-based questionnaire covering symptoms, medical history, and demographic information. The diagnostic performance of five different LLMs—including four versions of ChatGPT, two of Gemini, Llama, Gemma 2, and Mistral-Nemo—was analysed. Model-generated diagnoses were compared against those provided by experienced orthopaedic clinicians, which served as the reference standard. Results: GPT-4o achieved the highest diagnostic sensitivity at 92.3 %, significantly outperforming other LLMs. The completeness of patient responses to symptom-related questions was the strongest predictor of accuracy for GPT-4o (p < 0.001). Inter-model agreement was moderate among GPT-4 versions, whereas models such as Llama-3.1 demonstrated notably lower accuracy and concordance. Conclusions: GPT-4o demonstrated high accuracy and consistency in diagnosing OA based solely on patientreported questionnaires, underscoring its potential as a supplementary diagnostic tool in clinical settings. Nevertheless, the reliance on patient-reported data without direct physician involvement highlights the critical need for medical oversight to ensure diagnostic accuracy. Further research is needed to refine LLM capabilities and expand their utility in broader diagnostic applications.
Druh dokumentu: Article
Other literature type
Popis souboru: application/pdf
Jazyk: English
ISSN: 2001-0370
DOI: 10.1016/j.csbj.2024.12.013
DOI: 10.5283/epub.74732
DOI: 10.5283/epub.7473210.1016/j.csbj.2024.12.013
Přístupová URL adresa: https://doaj.org/article/05d6d2abb5c54877a32dcafcd7fbb479
https://epub.uni-regensburg.de/74732/
Rights: CC BY
Přístupové číslo: edsair.doi.dedup.....439c6bfb10075513e2fd91f39ee369bb
Databáze: OpenAIRE
Popis
Abstrakt:Background: Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or knee osteoarthritis (OA) using only patient-reported data collected via a structured questionnaire, without prior medical consultation. Methods: A prospective observational study was conducted at an orthopaedic outpatient clinic specialized in hip and knee OA treatment. A total of 115 patients completed a paper-based questionnaire covering symptoms, medical history, and demographic information. The diagnostic performance of five different LLMs—including four versions of ChatGPT, two of Gemini, Llama, Gemma 2, and Mistral-Nemo—was analysed. Model-generated diagnoses were compared against those provided by experienced orthopaedic clinicians, which served as the reference standard. Results: GPT-4o achieved the highest diagnostic sensitivity at 92.3 %, significantly outperforming other LLMs. The completeness of patient responses to symptom-related questions was the strongest predictor of accuracy for GPT-4o (p < 0.001). Inter-model agreement was moderate among GPT-4 versions, whereas models such as Llama-3.1 demonstrated notably lower accuracy and concordance. Conclusions: GPT-4o demonstrated high accuracy and consistency in diagnosing OA based solely on patientreported questionnaires, underscoring its potential as a supplementary diagnostic tool in clinical settings. Nevertheless, the reliance on patient-reported data without direct physician involvement highlights the critical need for medical oversight to ensure diagnostic accuracy. Further research is needed to refine LLM capabilities and expand their utility in broader diagnostic applications.
ISSN:20010370
DOI:10.1016/j.csbj.2024.12.013