Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination

Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Radiology Jg. 311; H. 2; S. e232715
Hauptverfasser: Krishna, Satheesh, Bhambra, Nishaant, Bleakney, Robert, Bhayana, Rajesh
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States 01.05.2024
Schlagworte:
ISSN:1527-1315, 1527-1315
Online-Zugang:Weitere Angaben
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively ( = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively ( = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 See also the editorial by Ballard in this issue.
AbstractList Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively ( = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively ( = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 See also the editorial by Ballard in this issue.
Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.
Author Bleakney, Robert
Krishna, Satheesh
Bhayana, Rajesh
Bhambra, Nishaant
Author_xml – sequence: 1
  givenname: Satheesh
  orcidid: 0000-0001-6603-7621
  surname: Krishna
  fullname: Krishna, Satheesh
  organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.)
– sequence: 2
  givenname: Nishaant
  orcidid: 0000-0003-4966-6578
  surname: Bhambra
  fullname: Bhambra, Nishaant
  organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.)
– sequence: 3
  givenname: Robert
  orcidid: 0009-0000-4493-2676
  surname: Bleakney
  fullname: Bleakney, Robert
  organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.)
– sequence: 4
  givenname: Rajesh
  orcidid: 0000-0002-8352-7953
  surname: Bhayana
  fullname: Bhayana, Rajesh
  organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.)
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38771184$$D View this record in MEDLINE/PubMed
BookMark eNpNUMlOwzAUtFARXeDIFfnIoSnxluUIVSlIlUBVOUdekZFjlzhB5Man05YiOL0Zzbx5TzMGAx-8BuASpTOEaHHTcGWDm2GCc8ROwAgxnCeIIDb4h4dgHONbmiLKivwMDEmR5wgVdAS-Fh_cdby1wcNg4Fo7y4V1tu2nO7LVvP2jQXSx9TrGKeRewXnwxirtpd5vLp83CZmxg7LHFO4SOVwf3guvPbwLvFFJbHun4eKT19Yfrp6DU8Nd1BfHOQEv94vN_CFZPS0f57erRBJWtkkmleQZYTolKuMkMxJlVBTYGGUwyQVDJVUCU1QyIiiSUkikmMIm15KZtMATcP2Tu23Ce6djW9U2Su0c9zp0sSIpy7OSUJbtrFdHaydqraptY2ve9NVva_gby4pxDQ
CitedBy_id crossref_primary_10_2214_AJR_24_32341
crossref_primary_10_1016_j_xcrm_2025_101988
crossref_primary_10_3348_kjr_2025_0599
crossref_primary_10_1148_radiol_241173
crossref_primary_10_1016_j_media_2025_103749
crossref_primary_10_1111_hel_13131
crossref_primary_10_1148_radiol_241532
crossref_primary_10_1148_radiol_241554
crossref_primary_10_7759_cureus_77550
crossref_primary_10_1111_liv_70115
crossref_primary_10_1148_radiol_241711
crossref_primary_10_1093_jamiaopen_ooaf073
crossref_primary_10_1007_s12149_024_01992_8
crossref_primary_10_1007_s11604_024_01718_w
crossref_primary_10_1148_radiol_241516
crossref_primary_10_1007_s10278_024_01233_4
crossref_primary_10_1093_bjr_tqae236
crossref_primary_10_3390_jcm13216512
crossref_primary_10_1016_j_acra_2024_11_028
crossref_primary_10_2196_64348
crossref_primary_10_3390_knowledge5010004
crossref_primary_10_4258_hir_2025_31_3_295
crossref_primary_10_3348_kjr_2024_1096
crossref_primary_10_1055_s_0044_1793914
crossref_primary_10_1111_iej_14217
crossref_primary_10_1148_radiol_242134
crossref_primary_10_1148_radiol_242454
crossref_primary_10_1148_radiol_241766
crossref_primary_10_1016_j_nlp_2025_100145
crossref_primary_10_1097_RCT_0000000000001709
crossref_primary_10_1016_j_acra_2024_09_005
crossref_primary_10_1055_a_2641_3059
crossref_primary_10_2196_69313
crossref_primary_10_1148_radiol_232346
crossref_primary_10_1007_s11604_024_01633_0
crossref_primary_10_1002_INMD_20240063
crossref_primary_10_1111_jerd_13447
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1148/radiol.232715
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Medicine
EISSN 1527-1315
ExternalDocumentID 38771184
Genre Journal Article
GroupedDBID ---
.55
.GJ
123
18M
1CY
1KJ
29P
2WC
34G
39C
4.4
53G
5RE
6NX
6PF
7FM
AAEJM
AAQQT
AAWTL
ABDPE
ABHFT
ABOCM
ACFQH
ACGFO
ACJAN
ADBBV
AENEX
AENYM
AFFNX
AFOSN
AJJEV
AJWWR
ALMA_UNASSIGNED_HOLDINGS
BAWUL
CGR
CS3
CUY
CVF
DIK
DU5
E3Z
EBS
ECM
EIF
EJD
F5P
F9R
GX1
H13
J5H
KO8
L7B
LMP
LSO
MJL
MV1
N4W
NPM
OK1
P2P
R.V
RKKAF
RXW
SJN
TAE
TR2
TRS
TWZ
W8F
WH7
WOQ
X7M
YQI
YQJ
ZGI
ZVN
ZXP
7X8
ID FETCH-LOGICAL-c359t-6cdca635e03d6a36fc164b82ffdf237b5194db241953b41ccbc1d5d2f7ec5f082
IEDL.DBID 7X8
ISICitedReferencesCount 45
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001250822700012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1527-1315
IngestDate Fri Sep 05 06:26:32 EDT 2025
Mon Jul 21 05:58:35 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c359t-6cdca635e03d6a36fc164b82ffdf237b5194db241953b41ccbc1d5d2f7ec5f082
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0003-4966-6578
0000-0001-6603-7621
0009-0000-4493-2676
0000-0002-8352-7953
PMID 38771184
PQID 3057693456
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3057693456
pubmed_primary_38771184
PublicationCentury 2000
PublicationDate 2024-05-00
20240501
PublicationDateYYYYMMDD 2024-05-01
PublicationDate_xml – month: 05
  year: 2024
  text: 2024-05-00
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Radiology
PublicationTitleAlternate Radiology
PublicationYear 2024
SSID ssj0014587
Score 2.6045063
Snippet Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage e232715
SubjectTerms Artificial Intelligence
Clinical Competence
Educational Measurement - methods
Humans
Prospective Studies
Radiology
Reproducibility of Results
Specialty Boards
Title Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination
URI https://www.ncbi.nlm.nih.gov/pubmed/38771184
https://www.proquest.com/docview/3057693456
Volume 311
WOSCitedRecordID wos001250822700012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3NS8MwFA_qRLz4_TG_iOBxmWuTNO1JVDa9bAyZsNvIRyOD0c61E3fzT_el7dxJELyUhpI2TV5efi_v5f0QupGAMaRtURKAeUVgJvpE6ZASDdBW-UoxFqmCbEL0euFwGPWrDbesCqtc6sRCUZtUuz3yW5BLR9sH6_3d9J041ijnXa0oNNZRjQKUcVIthisvAuMFQZ5jbiUe9fgyxyYLb2fSjNNJE_CE8Pjv6LJYZTq7_23fHtqp8CW-LwViH63FyQHa6lYe9EP01f5J741Ti11Ecpmpe9GAwhQ086qYqnmWO1XYwDIx2B0OLClIXc2n_oDQJi-euHuG4Y0SvxR_nL4t8EPq4nGzfDGJcftTupAb99Uj9NppDx6fScXCQDTlUU4CbbQEWBK3qAkkDawGC0uFvrXG-lQogIDMKAACEaeKeVor7RlufCtizS0gjGO0kaRJfIpwaI0ySsctKRULZAhYyadG6IhbG_ra1tH1sm9HIOXOdSGTOJ1no1Xv1tFJOUCjaZmOY0RDIcBMYmd_qH2Otn1AJWXE4gWqWZjj8SXa1B_5OJtdFeID116_-w2de9Dg
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Evaluation+of+Reliability%2C+Repeatability%2C+Robustness%2C+and+Confidence+of+GPT-3.5+and+GPT-4+on+a+Radiology+Board-style+Examination&rft.jtitle=Radiology&rft.au=Krishna%2C+Satheesh&rft.au=Bhambra%2C+Nishaant&rft.au=Bleakney%2C+Robert&rft.au=Bhayana%2C+Rajesh&rft.date=2024-05-01&rft.eissn=1527-1315&rft.volume=311&rft.issue=2&rft.spage=e232715&rft_id=info:doi/10.1148%2Fradiol.232715&rft_id=info%3Apmid%2F38771184&rft_id=info%3Apmid%2F38771184&rft.externalDocID=38771184
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1527-1315&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1527-1315&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1527-1315&client=summon