From Prompts to Practice: Evaluating ChatGPT, Gemini, and Grok Against Plastic Surgeons in Local Flap Decision-Making.

Saved in:
Bibliographic Details
Title: From Prompts to Practice: Evaluating ChatGPT, Gemini, and Grok Against Plastic Surgeons in Local Flap Decision-Making.
Authors: Marcaccini, Gianluca, Corradini, Luca, Shadid, Omar, Seth, Ishith, Rozen, Warren M., Grimaldi, Luca, Cuomo, Roberto
Source: Diagnostics (2075-4418); Oct2025, Vol. 15 Issue 20, p2646, 13p
Subject Terms: PLASTIC surgery, GROK (Chatbot), SURGICAL flaps, CHATGPT, GENERATIVE artificial intelligence, GEMINI (Chatbot), DECISION support systems
Company/Entity: OPENAI Inc.
Abstract: Background: Local flaps are a cornerstone of reconstructive plastic surgery for oncological skin defects, ensuring functional recovery and aesthetic integration. Their selection, however, varies with surgeon experience. Generative artificial intelligence has emerged as a potential decision-support tool, although its clinical role remains uncertain. Methods: We evaluated three generative AI platforms (ChatGPT-5 by OpenAI, Grok by xAI, and Gemini by Google DeepMind) in their free-access versions available in September 2025. Ten preoperative photographs of suspected cutaneous neoplastic lesions from diverse facial and limb sites were submitted to each platform in a two-step task: concise description of site, size, and tissue involvement, followed by the single most suitable local flap for reconstruction. Outputs were compared with the unanimous consensus of experienced plastic surgeons. Results: Performance differed across models. ChatGPT-5 consistently described lesion size accurately and achieved complete concordance with surgeons in flap selection. Grok showed intermediate performance, tending to recognise tissue planes better than lesion size and proposing flaps that were often acceptable but not always the preferred choice. Gemini estimated size well, yet was inconsistent for anatomical site, tissue involvement, and flap recommendation. When partially correct answers were considered acceptable, differences narrowed but the overall ranking remained unchanged. Conclusion: Generative AI can support reconstructive reasoning from clinical images with variable reliability. In this series, ChatGPT-5 was the most dependable for local flap planning, suggesting a potential role in education and preliminary decision-making. Larger studies using standardised image acquisition and explicit uncertainty reporting are needed to confirm clinical applicability and safety. [ABSTRACT FROM AUTHOR]
Copyright of Diagnostics (2075-4418) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Biomedical Index
Description
Abstract:Background: Local flaps are a cornerstone of reconstructive plastic surgery for oncological skin defects, ensuring functional recovery and aesthetic integration. Their selection, however, varies with surgeon experience. Generative artificial intelligence has emerged as a potential decision-support tool, although its clinical role remains uncertain. Methods: We evaluated three generative AI platforms (ChatGPT-5 by OpenAI, Grok by xAI, and Gemini by Google DeepMind) in their free-access versions available in September 2025. Ten preoperative photographs of suspected cutaneous neoplastic lesions from diverse facial and limb sites were submitted to each platform in a two-step task: concise description of site, size, and tissue involvement, followed by the single most suitable local flap for reconstruction. Outputs were compared with the unanimous consensus of experienced plastic surgeons. Results: Performance differed across models. ChatGPT-5 consistently described lesion size accurately and achieved complete concordance with surgeons in flap selection. Grok showed intermediate performance, tending to recognise tissue planes better than lesion size and proposing flaps that were often acceptable but not always the preferred choice. Gemini estimated size well, yet was inconsistent for anatomical site, tissue involvement, and flap recommendation. When partially correct answers were considered acceptable, differences narrowed but the overall ranking remained unchanged. Conclusion: Generative AI can support reconstructive reasoning from clinical images with variable reliability. In this series, ChatGPT-5 was the most dependable for local flap planning, suggesting a potential role in education and preliminary decision-making. Larger studies using standardised image acquisition and explicit uncertainty reporting are needed to confirm clinical applicability and safety. [ABSTRACT FROM AUTHOR]
ISSN:20754418
DOI:10.3390/diagnostics15202646