Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination
Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeat...
Gespeichert in:
| Veröffentlicht in: | Radiology Jg. 311; H. 2; S. e232715 |
|---|---|
| Hauptverfasser: | , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
01.05.2024
|
| Schlagworte: | |
| ISSN: | 1527-1315, 1527-1315 |
| Online-Zugang: | Weitere Angaben |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (
= .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (
= .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively;
= .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively;
< .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively;
= .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024
See also the editorial by Ballard in this issue. |
|---|---|
| AbstractList | Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (
= .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (
= .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively;
= .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively;
< .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively;
= .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024
See also the editorial by Ballard in this issue. Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods In this exploratory prospective study, 150 radiology board-style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue. |
| Author | Bleakney, Robert Krishna, Satheesh Bhayana, Rajesh Bhambra, Nishaant |
| Author_xml | – sequence: 1 givenname: Satheesh orcidid: 0000-0001-6603-7621 surname: Krishna fullname: Krishna, Satheesh organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.) – sequence: 2 givenname: Nishaant orcidid: 0000-0003-4966-6578 surname: Bhambra fullname: Bhambra, Nishaant organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.) – sequence: 3 givenname: Robert orcidid: 0009-0000-4493-2676 surname: Bleakney fullname: Bleakney, Robert organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.) – sequence: 4 givenname: Rajesh orcidid: 0000-0002-8352-7953 surname: Bhayana fullname: Bhayana, Rajesh organization: From the University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Toronto, ON, Canada M5G 24C (S.K., R. Bleakney, R. Bhayana); Department of Medical Imaging, University of Toronto, Toronto, ON, Canada (S.K., R. Bleakney, R. Bhayana); and Department of Family Medicine, University of Ottawa, Ottawa, ON, Canada (N.B.) |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/38771184$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNUMlOwzAUtFARXeDIFfnIoSnxluUIVSlIlUBVOUdekZFjlzhB5Man05YiOL0Zzbx5TzMGAx-8BuASpTOEaHHTcGWDm2GCc8ROwAgxnCeIIDb4h4dgHONbmiLKivwMDEmR5wgVdAS-Fh_cdby1wcNg4Fo7y4V1tu2nO7LVvP2jQXSx9TrGKeRewXnwxirtpd5vLp83CZmxg7LHFO4SOVwf3guvPbwLvFFJbHun4eKT19Yfrp6DU8Nd1BfHOQEv94vN_CFZPS0f57erRBJWtkkmleQZYTolKuMkMxJlVBTYGGUwyQVDJVUCU1QyIiiSUkikmMIm15KZtMATcP2Tu23Ce6djW9U2Su0c9zp0sSIpy7OSUJbtrFdHaydqraptY2ve9NVva_gby4pxDQ |
| CitedBy_id | crossref_primary_10_2214_AJR_24_32341 crossref_primary_10_1016_j_xcrm_2025_101988 crossref_primary_10_3348_kjr_2025_0599 crossref_primary_10_1148_radiol_241173 crossref_primary_10_1016_j_media_2025_103749 crossref_primary_10_1111_hel_13131 crossref_primary_10_1148_radiol_241532 crossref_primary_10_1148_radiol_241554 crossref_primary_10_7759_cureus_77550 crossref_primary_10_1111_liv_70115 crossref_primary_10_1148_radiol_241711 crossref_primary_10_1093_jamiaopen_ooaf073 crossref_primary_10_1007_s12149_024_01992_8 crossref_primary_10_1007_s11604_024_01718_w crossref_primary_10_1148_radiol_241516 crossref_primary_10_1007_s10278_024_01233_4 crossref_primary_10_1093_bjr_tqae236 crossref_primary_10_3390_jcm13216512 crossref_primary_10_1016_j_acra_2024_11_028 crossref_primary_10_2196_64348 crossref_primary_10_3390_knowledge5010004 crossref_primary_10_4258_hir_2025_31_3_295 crossref_primary_10_3348_kjr_2024_1096 crossref_primary_10_1055_s_0044_1793914 crossref_primary_10_1111_iej_14217 crossref_primary_10_1148_radiol_242134 crossref_primary_10_1148_radiol_242454 crossref_primary_10_1148_radiol_241766 crossref_primary_10_1016_j_nlp_2025_100145 crossref_primary_10_1097_RCT_0000000000001709 crossref_primary_10_1016_j_acra_2024_09_005 crossref_primary_10_1055_a_2641_3059 crossref_primary_10_2196_69313 crossref_primary_10_1148_radiol_232346 crossref_primary_10_1007_s11604_024_01633_0 crossref_primary_10_1002_INMD_20240063 crossref_primary_10_1111_jerd_13447 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1148/radiol.232715 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Medicine |
| EISSN | 1527-1315 |
| ExternalDocumentID | 38771184 |
| Genre | Journal Article |
| GroupedDBID | --- .55 .GJ 123 18M 1CY 1KJ 29P 2WC 34G 39C 4.4 53G 5RE 6NX 6PF 7FM AAEJM AAQQT AAWTL ABDPE ABHFT ABOCM ACFQH ACGFO ACJAN ADBBV AENEX AENYM AFFNX AFOSN AJJEV AJWWR ALMA_UNASSIGNED_HOLDINGS BAWUL CGR CS3 CUY CVF DIK DU5 E3Z EBS ECM EIF EJD F5P F9R GX1 H13 J5H KO8 L7B LMP LSO MJL MV1 N4W NPM OK1 P2P R.V RKKAF RXW SJN TAE TR2 TRS TWZ W8F WH7 WOQ X7M YQI YQJ ZGI ZVN ZXP 7X8 |
| ID | FETCH-LOGICAL-c359t-6cdca635e03d6a36fc164b82ffdf237b5194db241953b41ccbc1d5d2f7ec5f082 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 45 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001250822700012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1527-1315 |
| IngestDate | Fri Sep 05 06:26:32 EDT 2025 Mon Jul 21 05:58:35 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c359t-6cdca635e03d6a36fc164b82ffdf237b5194db241953b41ccbc1d5d2f7ec5f082 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0000-0003-4966-6578 0000-0001-6603-7621 0009-0000-4493-2676 0000-0002-8352-7953 |
| PMID | 38771184 |
| PQID | 3057693456 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_3057693456 pubmed_primary_38771184 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-05-00 20240501 |
| PublicationDateYYYYMMDD | 2024-05-01 |
| PublicationDate_xml | – month: 05 year: 2024 text: 2024-05-00 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Radiology |
| PublicationTitleAlternate | Radiology |
| PublicationYear | 2024 |
| SSID | ssj0014587 |
| Score | 2.6045063 |
| Snippet | Background ChatGPT (OpenAI) can pass a text-based radiology board-style examination, but its stochasticity and confident language when it is incorrect may... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | e232715 |
| SubjectTerms | Artificial Intelligence Clinical Competence Educational Measurement - methods Humans Prospective Studies Radiology Reproducibility of Results Specialty Boards |
| Title | Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/38771184 https://www.proquest.com/docview/3057693456 |
| Volume | 311 |
| WOSCitedRecordID | wos001250822700012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3NS8MwFA_qRLz4_TG_iOBxmWuTNO1JVDa9bAyZsNvIRyOD0c61E3fzT_el7dxJELyUhpI2TV5efi_v5f0QupGAMaRtURKAeUVgJvpE6ZASDdBW-UoxFqmCbEL0euFwGPWrDbesCqtc6sRCUZtUuz3yW5BLR9sH6_3d9J041ijnXa0oNNZRjQKUcVIthisvAuMFQZ5jbiUe9fgyxyYLb2fSjNNJE_CE8Pjv6LJYZTq7_23fHtqp8CW-LwViH63FyQHa6lYe9EP01f5J741Ti11Ecpmpe9GAwhQ086qYqnmWO1XYwDIx2B0OLClIXc2n_oDQJi-euHuG4Y0SvxR_nL4t8EPq4nGzfDGJcftTupAb99Uj9NppDx6fScXCQDTlUU4CbbQEWBK3qAkkDawGC0uFvrXG-lQogIDMKAACEaeKeVor7RlufCtizS0gjGO0kaRJfIpwaI0ySsctKRULZAhYyadG6IhbG_ra1tH1sm9HIOXOdSGTOJ1no1Xv1tFJOUCjaZmOY0RDIcBMYmd_qH2Otn1AJWXE4gWqWZjj8SXa1B_5OJtdFeID116_-w2de9Dg |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Evaluation+of+Reliability%2C+Repeatability%2C+Robustness%2C+and+Confidence+of+GPT-3.5+and+GPT-4+on+a+Radiology+Board-style+Examination&rft.jtitle=Radiology&rft.au=Krishna%2C+Satheesh&rft.au=Bhambra%2C+Nishaant&rft.au=Bleakney%2C+Robert&rft.au=Bhayana%2C+Rajesh&rft.date=2024-05-01&rft.eissn=1527-1315&rft.volume=311&rft.issue=2&rft.spage=e232715&rft_id=info:doi/10.1148%2Fradiol.232715&rft_id=info%3Apmid%2F38771184&rft_id=info%3Apmid%2F38771184&rft.externalDocID=38771184 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1527-1315&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1527-1315&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1527-1315&client=summon |