View in EDS

A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context.

Saved in:

Bibliographic Details
Title:	A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context.
Authors:	Asiri, Rayah, Ishaqui, Azfar Athar, Ahmad, Salman Ashfaq, Imran, Muhammad, Orayj, Khalid, Iqbal, Adnan
Source:	Diagnostics (2075-4418); Feb2026, Vol. 16 Issue 3, p388, 16p
Subject Terms:	DIAGNOSIS, INTERNAL medicine, CHATGPT, MEDICAL records, DIAGNOSTIC imaging, LANGUAGE models, GEMINI (Chatbot)
Abstract:	Background and Aim: Large language models (LLMs) demonstrate significant potential in assisting with medical image interpretation. However, the diagnostic accuracy of general-purpose LLMs on image-based internal medicine cases and the added value of brief clinical history remain unclear. This study evaluated three general-purpose LLMs (ChatGPT, Gemini, and DeepSeek) on expert-curated cases to quantify diagnostic accuracy with image-only input versus image plus brief clinical context. Methods: We conducted a comparative evaluation using 138 expert-curated cases from Harrison's Visual Case Challenge. Each case was presented to the models in two distinct phases: Phase 1 (image only) and Phase 2 (image plus a brief clinical history). The primary endpoint was top-1 diagnostic accuracy for the textbook diagnosis, comparing performance with versus without a brief clinical history. Secondary/Exploratory analyses compared models and assessed agreement between model-generated differential lists and the textbook differential. Statistical analysis included Wilson 95% confidence intervals, McNemar's tests, Cochran's Q with Benjamini–Hochberg correction, and Wilcoxon signed-rank tests. Results: The inclusion of clinical history substantially improved diagnostic accuracy for all models. ChatGPT's accuracy increased from 50.7% in Phase 1 to 80.4% in Phase 2. Gemini's accuracy improved from 39.9% to 72.5%, and DeepSeek's accuracy rose from 30.4% to 75.4%. In Phase 2, diagnostic accuracy reached at least 65% across most disease nature and organ system categories. However, agreement with the reference differential diagnoses remained modest, with average overlap rates of 6.99% for ChatGPT, 36.39% for Gemini, and 32.74% for DeepSeek. Conclusions: The provision of brief clinical history significantly enhances the diagnostic accuracy of large language models on visual internal medicine cases. In this benchmark, performance differences between models were smaller in Phase 2 than in Phase 1. While diagnostic precision improves markedly, the models' ability to generate comprehensive differential diagnoses that align with expert consensus is still limited. These findings underscore the utility of context-aware, multimodal LLMs for educational support and structured diagnostic practice in supervised settings while also highlighting the need for more sophisticated, semantics-sensitive benchmarks for evaluating diagnostic reasoning. [ABSTRACT FROM AUTHOR]
	Copyright of Diagnostics (2075-4418) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Biomedical Index

Full Text Finder

Nájsť tento článok vo Web of Science

Description
Abstract:	Background and Aim: Large language models (LLMs) demonstrate significant potential in assisting with medical image interpretation. However, the diagnostic accuracy of general-purpose LLMs on image-based internal medicine cases and the added value of brief clinical history remain unclear. This study evaluated three general-purpose LLMs (ChatGPT, Gemini, and DeepSeek) on expert-curated cases to quantify diagnostic accuracy with image-only input versus image plus brief clinical context. Methods: We conducted a comparative evaluation using 138 expert-curated cases from Harrison's Visual Case Challenge. Each case was presented to the models in two distinct phases: Phase 1 (image only) and Phase 2 (image plus a brief clinical history). The primary endpoint was top-1 diagnostic accuracy for the textbook diagnosis, comparing performance with versus without a brief clinical history. Secondary/Exploratory analyses compared models and assessed agreement between model-generated differential lists and the textbook differential. Statistical analysis included Wilson 95% confidence intervals, McNemar's tests, Cochran's Q with Benjamini–Hochberg correction, and Wilcoxon signed-rank tests. Results: The inclusion of clinical history substantially improved diagnostic accuracy for all models. ChatGPT's accuracy increased from 50.7% in Phase 1 to 80.4% in Phase 2. Gemini's accuracy improved from 39.9% to 72.5%, and DeepSeek's accuracy rose from 30.4% to 75.4%. In Phase 2, diagnostic accuracy reached at least 65% across most disease nature and organ system categories. However, agreement with the reference differential diagnoses remained modest, with average overlap rates of 6.99% for ChatGPT, 36.39% for Gemini, and 32.74% for DeepSeek. Conclusions: The provision of brief clinical history significantly enhances the diagnostic accuracy of large language models on visual internal medicine cases. In this benchmark, performance differences between models were smaller in Phase 2 than in Phase 1. While diagnostic precision improves markedly, the models' ability to generate comprehensive differential diagnoses that align with expert consensus is still limited. These findings underscore the utility of context-aware, multimodal LLMs for educational support and structured diagnostic practice in supervised settings while also highlighting the need for more sophisticated, semantics-sensitive benchmarks for evaluating diagnostic reasoning. [ABSTRACT FROM AUTHOR]
ISSN:	20754418
DOI:	10.3390/diagnostics16030388