Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists. In our study we explored and analyzed the generat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org
Hauptverfasser: Ziegelmayer, Sebastian, Marka, Alexander W, Lenhart, Nicolas, Nehls, Nadja, Reischl, Stefan, Harder, Felix, Sauter, Andreas, Makowski, Marcus, Graf, Markus, Gawlitza, Joshua
Format: Paper
Sprache:Englisch
Veröffentlicht: Ithaca Cornell University Library, arXiv.org 12.11.2023
Schlagworte:
ISSN:2331-8422
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists. In our study we explored and analyzed the generative abilities of GPT-4 for Chest X-ray impression generation. To generate and evaluate impressions of chest X-rays based on different input modalities (image, text, text and image), a blinded radiological report was written for 25-cases of the publicly available NIH-dataset. GPT-4 was given image, finding section or both sequentially to generate an input dependent impression. In a blind randomized reading, 4-radiologists rated the impressions and were asked to classify the impression origin (Human, AI), providing justification for their decision. Lastly text model evaluation metrics and their correlation with the radiological score (summation of the 4 dimensions) was assessed. According to the radiological score, the human-written impression was rated highest, although not significantly different to text-based impressions. The automated evaluation metrics showed moderate to substantial correlations to the radiological score for the image impressions, however individual scores were highly divergent among inputs, indicating insufficient representation of radiological quality. Detection of AI-generated impressions varied by input and was 61% for text-based impressions. Impressions classified as AI-generated had significantly worse radiological scores even when written by a radiologist, indicating potential bias. Our study revealed significant discrepancies between a radiological assessment and common automatic evaluation metrics depending on the model input. The detection of AI-generated findings is subject to bias that highly rated impressions are perceived as human-written.
AbstractList The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists. In our study we explored and analyzed the generative abilities of GPT-4 for Chest X-ray impression generation. To generate and evaluate impressions of chest X-rays based on different input modalities (image, text, text and image), a blinded radiological report was written for 25-cases of the publicly available NIH-dataset. GPT-4 was given image, finding section or both sequentially to generate an input dependent impression. In a blind randomized reading, 4-radiologists rated the impressions and were asked to classify the impression origin (Human, AI), providing justification for their decision. Lastly text model evaluation metrics and their correlation with the radiological score (summation of the 4 dimensions) was assessed. According to the radiological score, the human-written impression was rated highest, although not significantly different to text-based impressions. The automated evaluation metrics showed moderate to substantial correlations to the radiological score for the image impressions, however individual scores were highly divergent among inputs, indicating insufficient representation of radiological quality. Detection of AI-generated impressions varied by input and was 61% for text-based impressions. Impressions classified as AI-generated had significantly worse radiological scores even when written by a radiologist, indicating potential bias. Our study revealed significant discrepancies between a radiological assessment and common automatic evaluation metrics depending on the model input. The detection of AI-generated findings is subject to bias that highly rated impressions are perceived as human-written.
Author Ziegelmayer, Sebastian
Nehls, Nadja
Marka, Alexander W
Lenhart, Nicolas
Reischl, Stefan
Sauter, Andreas
Gawlitza, Joshua
Harder, Felix
Graf, Markus
Makowski, Marcus
Author_xml – sequence: 1
  givenname: Sebastian
  surname: Ziegelmayer
  fullname: Ziegelmayer, Sebastian
– sequence: 2
  givenname: Alexander
  surname: Marka
  middlename: W
  fullname: Marka, Alexander W
– sequence: 3
  givenname: Nicolas
  surname: Lenhart
  fullname: Lenhart, Nicolas
– sequence: 4
  givenname: Nadja
  surname: Nehls
  fullname: Nehls, Nadja
– sequence: 5
  givenname: Stefan
  surname: Reischl
  fullname: Reischl, Stefan
– sequence: 6
  givenname: Felix
  surname: Harder
  fullname: Harder, Felix
– sequence: 7
  givenname: Andreas
  surname: Sauter
  fullname: Sauter, Andreas
– sequence: 8
  givenname: Marcus
  surname: Makowski
  fullname: Makowski, Marcus
– sequence: 9
  givenname: Markus
  surname: Graf
  fullname: Graf, Markus
– sequence: 10
  givenname: Joshua
  surname: Gawlitza
  fullname: Gawlitza, Joshua
BookMark eNotj09PAjEQxRujiYh8AG9NPC92-2c79UYIogmJHvbgjQzdWYRAd21ZIt-eRT29vLzfm8m7Y9ehCcTYQy7GGowRTxh_NsexVHk-FgXk5ooNpFJ5BlrKWzZKaSuEkIWVxqgBW8-OuOvwsGkCb2o-_ygzzesmcv9F6cA_s4gnvtm3kVK6MGsKFH_xZz7hkbCiyNOhq068T1uKfXePwRPHUF28p_ZC37ObGneJRv86ZOXLrJy-Zov3-dt0ssjQSMi8sMVKgAVAAegIBEifazLkhXHeFVXta7RWeu9EtaokYWG1WJnaGKu0UkP2-He2jc131y9Ybpsuhv7jUgI460A7UGdR_VpJ
ContentType Paper
Copyright 2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID 8FE
8FG
ABJCF
ABUWG
AFKRA
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L6V
M7S
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
DOI 10.48550/arxiv.2311.06815
DatabaseName ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
Technology collection
ProQuest One Community College
ProQuest Central
SciTech Premium Collection
ProQuest Engineering Collection
Engineering Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering collection
DatabaseTitle Publicly Available Content Database
Engineering Database
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
Engineering Collection
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: PIMPY
  name: Publicly Available Content Database
  url: http://search.proquest.com/publiccontent
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 2331-8422
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FG
ABJCF
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
FRJ
HCIFZ
L6V
M7S
M~E
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
ID FETCH-LOGICAL-a528-c076b08788a08a9e8082c14e5ec059c96dfcfa772cc90dbd2ea6740b5f5573433
IEDL.DBID BENPR
IngestDate Mon Jun 30 09:22:11 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a528-c076b08788a08a9e8082c14e5ec059c96dfcfa772cc90dbd2ea6740b5f5573433
Notes SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
OpenAccessLink https://www.proquest.com/docview/2889798498?pq-origsite=%requestingapplication%
PQID 2889798498
PQPubID 2050157
ParticipantIDs proquest_journals_2889798498
PublicationCentury 2000
PublicationDate 20231112
PublicationDateYYYYMMDD 2023-11-12
PublicationDate_xml – month: 11
  year: 2023
  text: 20231112
  day: 12
PublicationDecade 2020
PublicationPlace Ithaca
PublicationPlace_xml – name: Ithaca
PublicationTitle arXiv.org
PublicationYear 2023
Publisher Cornell University Library, arXiv.org
Publisher_xml – name: Cornell University Library, arXiv.org
SSID ssj0002672553
Score 1.8504095
SecondaryResourceType preprint
Snippet The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological...
SourceID proquest
SourceType Aggregation Database
SubjectTerms Bias
X-rays
Title Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception
URI https://www.proquest.com/docview/2889798498
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV09TwMhGCbaauLkd_yoDYMr7ZWDO3Axalp1sLlohzo1fF3TwWu9q43-ewFpO5i4OBIYCISX5315eB4ALqkRkc3ROJIiNYhwqZBFrRHihuNca0I41t5sIu332XDIs1BwqwKtchkTfaDWU-Vq5G3MGE85I5xdz96Rc41yr6vBQmMT1J1SGamB-m23nz2vqiw4SS1mjn-eM714V1uUn5NFy8KaTitKWIf-CsL-Zunt_ndOe6CeiZkp98GGKQ7Atmd0quoQjLsrHW84zeF9NkAEWoQKvUMWHKJSfMHJW-DBFnDs9afd8Ct4A0vPcIZefBba3tn6fwEUhXbtwIc5AoNed3D3gIKrAhIUM6SiNJERs5mviJjghlkMoDrEUKMs0lI80bnKhcXcSvFIS42NSOzmSZpTmsYkjo9BrZgW5gRAmasI5zYBNMoQzWMueW5sBHRKR5RjeQoay2UbhZNRjdZrdvZ39znYcdbu7t9fBzdAbV5-mAuwpRbzSVU2w0Y3HVfzxbayx6fs9RtSEbgc
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTwIxEG4IaPTkOz5Qe9BjZSldtjUxxqgIQQgHDngi3bZLOLjggig_yv_otOziwcQbB4-b7qHtTKffTGfmQ-jCN9IDH02QUAaGMBEqAqjVI8IIGmnNmKDakU0E7Tbv9UQnh76yWhibVpnZRGeo9UjZGHmJci4CwZngt-M3Ylmj7OtqRqGxUIummX-Ayza5aTyAfC8prT127-skZRUg0qecKHDcQ4-D5yc9LoXhcAeqMjO-UYA0lKjqSEUSMKdSwtOhpkZWYfKhH_l-UGE2_gkWv8BA13keFTqNVudlGdSh1QAgemXxeup6hZVk8jmcXQGKKl95VV72f9l8d5HVtv7ZFmzD0uXYJDsoZ-JdtO7yVdVkDw0el13K8SjCT50uYRjwN3b8X7hHEjnHw9c0yzfGA9dd2_5-je9w4vK3sWuti2F0_FM9gWWs7Xea7bOPuqtY2gHKx6PYHCIcRsqjEbi3RhmmRUWEIjJg320fJ1_Q8AgVMyn103M_6f-I6Pjv4XO0Ue-2nvvPjXbzBG1aEntb4VimRZSfJu_mFK2p2XQ4Sc5SHcOov2KRfgNZ9xEe
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Evaluation+of+GPT-4+for+chest+X-ray+impression+generation%3A+A+reader+study+on+performance+and+perception&rft.jtitle=arXiv.org&rft.au=Ziegelmayer%2C+Sebastian&rft.au=Marka%2C+Alexander+W&rft.au=Lenhart%2C+Nicolas&rft.au=Nehls%2C+Nadja&rft.date=2023-11-12&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2311.06815