Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs.

Saved in:
Bibliographic Details
Title: Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs.
Authors: Güler, Ibrahim, Grieb, Gerrit, Kraus, Armin, Lautenbach, Martin, Stelling, Henrik
Source: Diagnostics (2075-4418); Feb2026, Vol. 16 Issue 3, p424, 15p
Subject Terms: RADIOGRAPHS, HAND injuries, LANGUAGE models, DIAGNOSIS, COMPUTER-aided diagnosis, DIAGNOSTIC imaging
Abstract: Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs. Methods: In total, images of hand radiographs of 65 adult patients with confirmed hand fractures (30 phalangeal, 30 metacarpal, 5 scaphoid) were evaluated by four models: GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1. Each image was independently analyzed five times per model using identical zero-shot prompts (1300 total inferences). Diagnostic accuracy, inter-run reliability (Fleiss' κ), case-level agreement profiles, subgroup performance, and exploratory demographic inference (age, sex) were assessed. Results: GPT-5 Pro achieved the highest accuracy (64.3%) and consistency (κ = 0.71), followed by Gemini 2.5 Pro (56.9%, κ = 0.57). Mistral Medium 3.1 exhibited high agreement (κ = 0.88) despite low accuracy (38.5%), indicating systematic error ("confident hallucination"). Claude Sonnet 4.5 showed low accuracy (33.8%) and consistency (κ = 0.33), reflecting instability. While phalangeal fractures were reliably detected by top models, scaphoid fractures remained challenging. Demographic analysis revealed poor capabilities, with age estimation errors exceeding 12 years and sex prediction accuracy near random chance. Conclusions: Diagnostic accuracy and consistency are distinct performance dimensions; high intra-model agreement does not imply correctness. While GPT-5 Pro demonstrated the most favorable balance of accuracy and stability, other models exhibited critical failure modes ranging from systematic bias to random instability. At present, MLLMs should be regarded as experimental diagnostic reasoning systems rather than reliable standalone tools for clinical fracture detection. [ABSTRACT FROM AUTHOR]
Copyright of Diagnostics (2075-4418) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Biomedical Index
Description
Abstract:Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs. Methods: In total, images of hand radiographs of 65 adult patients with confirmed hand fractures (30 phalangeal, 30 metacarpal, 5 scaphoid) were evaluated by four models: GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1. Each image was independently analyzed five times per model using identical zero-shot prompts (1300 total inferences). Diagnostic accuracy, inter-run reliability (Fleiss' κ), case-level agreement profiles, subgroup performance, and exploratory demographic inference (age, sex) were assessed. Results: GPT-5 Pro achieved the highest accuracy (64.3%) and consistency (κ = 0.71), followed by Gemini 2.5 Pro (56.9%, κ = 0.57). Mistral Medium 3.1 exhibited high agreement (κ = 0.88) despite low accuracy (38.5%), indicating systematic error ("confident hallucination"). Claude Sonnet 4.5 showed low accuracy (33.8%) and consistency (κ = 0.33), reflecting instability. While phalangeal fractures were reliably detected by top models, scaphoid fractures remained challenging. Demographic analysis revealed poor capabilities, with age estimation errors exceeding 12 years and sex prediction accuracy near random chance. Conclusions: Diagnostic accuracy and consistency are distinct performance dimensions; high intra-model agreement does not imply correctness. While GPT-5 Pro demonstrated the most favorable balance of accuracy and stability, other models exhibited critical failure modes ranging from systematic bias to random instability. At present, MLLMs should be regarded as experimental diagnostic reasoning systems rather than reliable standalone tools for clinical fracture detection. [ABSTRACT FROM AUTHOR]
ISSN:20754418
DOI:10.3390/diagnostics16030424