Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge suppl...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:PLOS digital health Ročník 3; číslo 9; s. e0000604
Hlavní autori: Tarabanis, Constantine, Zahid, Sohail, Mamalis, Marios, Zhang, Kevin, Kalampokis, Evangelos, Jankelson, Lior
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States Public Library of Science 01.09.2024
Predmet:
ISSN:2767-3170, 2767-3170
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
AbstractList Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses. Following the recent popularization of large language models (LLMs), medical research is attempting to benchmark LLMs’ medical competency against that of practicing physicians. No published studies have investigated LLM performance on the Internal Medicine (IM) Board Examination required for physicians to certify their IM specialty. We assessed the performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, on 240 IM board-style questions sourced from the Medical Knowledge Self-Assessment Program, a preparatory question bank. GPT-4.0 scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human respondents on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There was an increase in test scores of both GPT-3.5 and GPT-4.0 after model input augmentation using Harrison’s Principles of Internal Medicine, a standard medical textbook. The ability to correctly answer an array of multidisciplinary questions while providing supporting explanations speaks to LLMs’ potential as a study aid and medical assistant. The ability to improve LLM test performance by model input augmentation using a standard medical textbook provides a technical approach to improving their factual accuracy.
Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
Author Jankelson, Lior
Tarabanis, Constantine
Kalampokis, Evangelos
Zahid, Sohail
Mamalis, Marios
Zhang, Kevin
AuthorAffiliation 3 Department of Internal Medicine, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America
2 Information Systems Laboratory, University of Macedonia, Thessaloniki, Greece
Intel Corporation, UNITED STATES OF AMERICA
1 Leon H. Charney Division of Cardiology, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America
AuthorAffiliation_xml – name: Intel Corporation, UNITED STATES OF AMERICA
– name: 1 Leon H. Charney Division of Cardiology, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America
– name: 2 Information Systems Laboratory, University of Macedonia, Thessaloniki, Greece
– name: 3 Department of Internal Medicine, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America
Author_xml – sequence: 1
  givenname: Constantine
  orcidid: 0000-0001-7563-2430
  surname: Tarabanis
  fullname: Tarabanis, Constantine
– sequence: 2
  givenname: Sohail
  surname: Zahid
  fullname: Zahid, Sohail
– sequence: 3
  givenname: Marios
  orcidid: 0009-0000-2680-0442
  surname: Mamalis
  fullname: Mamalis, Marios
– sequence: 4
  givenname: Kevin
  surname: Zhang
  fullname: Zhang, Kevin
– sequence: 5
  givenname: Evangelos
  surname: Kalampokis
  fullname: Kalampokis, Evangelos
– sequence: 6
  givenname: Lior
  surname: Jankelson
  fullname: Jankelson, Lior
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39288137$$D View this record in MEDLINE/PubMed
BookMark eNp9kc1uGyEUhVGUqEkTv0EUjdRNN-MCFw_jbCrX6k8kW0mldo0YYBwiDA7MWPLbl5EdK80iLOBKfOfowPmITn3wBqFrgscEOPnyFPropRtvtF2NcV4VZifogvKKl0A4Pn01n6NRSk-ZoTXBfEo-oHOY0rrORheoeTCxDXEtvTJFaIuHvnFWuV0x20rrZONMsZBxNex-1cs8LIM2LhXBF3e-M0OKYmm0Vdab4luQUZep22XZ796kzgafrtBZK10yo8N5if7--P5n_qtc3P-8m88WpWLAu5JQPWEYN5XWErhmNUxqSirNOAWOOWO4AaZ021BDNQFJaA0cFMeyrVrOKFyir3vfTd-sjVbGd1E6sYl2LeNOBGnF_zfePopV2ApCGOYVQHb4fHCI4XmIL9Y2KeOc9Cb0SQDBFZtMYcoy-ukNeqgkU3RCoIaaD4Y3ryMds7z8fwbYHlAxpBRNe0QIFkPTL7ZiaFocms6y2zcyZTs5_HZ-mHXvi_8BLByyqA
CitedBy_id crossref_primary_10_3390_jcm14176169
crossref_primary_10_1007_s00405_025_09404_x
crossref_primary_10_3748_wjg_v31_i6_102090
crossref_primary_10_1016_j_mcpdig_2025_100241
crossref_primary_10_1016_j_jaci_2025_02_004
crossref_primary_10_1093_jamia_ocaf008
Cites_doi 10.2196/45312
10.1056/NEJMsr2214184
10.1145/3571884.3604316
10.1145/3477495.3532682
10.1371/journal.pdig.0000198
10.1001/jamapediatrics.2023.2373
10.1016/j.ajo.2023.05.024
10.1186/s12909-020-1974-3
ContentType Journal Article
Copyright Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication: https://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
– notice: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication: https://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
NPM
3V.
7X7
7XB
8C1
8FI
8FJ
8FK
ABUWG
AEUYN
AFKRA
AZQEC
BENPR
CCPQU
DWQXO
FYUFA
GHDGH
K9.
M0S
PHGZM
PHGZT
PIMPY
PJZUB
PKEHL
PPXIY
PQEST
PQQKQ
PQUKI
7X8
5PM
DOI 10.1371/journal.pdig.0000604
DatabaseName CrossRef
PubMed
ProQuest Central (Corporate)
Health & Medical Collection
ProQuest Central (purchase pre-March 2016)
Public Health Database
ProQuest Hospital Collection
Hospital Premium Collection (Alumni Edition)
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Central (Alumni)
ProQuest One Sustainability
ProQuest Central UK/Ireland
ProQuest Central Essentials - QC
ProQuest Central
ProQuest One Community College
ProQuest Central Korea
Health Research Premium Collection
Health Research Premium Collection (Alumni)
ProQuest Health & Medical Complete (Alumni)
Health & Medical Collection (Alumni)
Proquest Central Premium
ProQuest One Academic (New)
ProQuest Publicly Available Content Database
ProQuest Health & Medical Research Collection
ProQuest One Academic Middle East (New)
ProQuest One Health & Nursing
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle CrossRef
PubMed
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest Health & Medical Complete (Alumni)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest One Health & Nursing
ProQuest Central
ProQuest One Sustainability
ProQuest Health & Medical Research Collection
Health Research Premium Collection
Health and Medicine Complete (Alumni Edition)
ProQuest Central Korea
Health & Medical Research Collection
ProQuest Central (New)
ProQuest Public Health
ProQuest One Academic Eastern Edition
ProQuest Hospital Collection
Health Research Premium Collection (Alumni)
ProQuest Hospital Collection (Alumni)
ProQuest Health & Medical Complete
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
MEDLINE - Academic
DatabaseTitleList
CrossRef
PubMed
MEDLINE - Academic
Publicly Available Content Database
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: PIMPY
  name: Publicly Available Content Database
  url: http://search.proquest.com/publiccontent
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
DocumentTitleAlternate LLM performance on internal medicine board exam
EISSN 2767-3170
ExternalDocumentID PMC11407633
39288137
10_1371_journal_pdig_0000604
Genre Journal Article
GroupedDBID 53G
7X7
8C1
8FI
8FJ
AAFWJ
AAUCC
AAWOE
AAYXX
ABUWG
ACCTH
AEUYN
AFFHD
AFKRA
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BENPR
CCPQU
CITATION
EIHBH
FPL
FYUFA
GROUPED_DOAJ
HMCUK
M~E
OK1
PGMZT
PHGZM
PHGZT
PIMPY
PJZUB
PPXIY
RPM
UKHRP
NPM
3V.
7XB
8FK
AZQEC
DWQXO
K9.
PKEHL
PQEST
PQQKQ
PQUKI
7X8
PUEGO
5PM
ID FETCH-LOGICAL-c437t-12d5400b6dda37d48358216d4723707440b34cdfb2e2d13a128373c70af6f7423
IEDL.DBID 7X7
ISICitedReferencesCount 7
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001416887500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2767-3170
IngestDate Tue Nov 04 02:05:04 EST 2025
Wed Oct 01 14:24:28 EDT 2025
Tue Oct 07 07:10:13 EDT 2025
Wed Feb 19 02:04:56 EST 2025
Sat Nov 29 06:23:58 EST 2025
Tue Nov 18 22:00:07 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 9
Language English
License Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c437t-12d5400b6dda37d48358216d4723707440b34cdfb2e2d13a128373c70af6f7423
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
The authors have declared that no competing interests exist.
ORCID 0009-0000-2680-0442
0000-0001-7563-2430
OpenAccessLink https://www.proquest.com/docview/3251383873?pq-origsite=%requestingapplication%
PMID 39288137
PQID 3251383873
PQPubID 6980581
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_11407633
proquest_miscellaneous_3106459394
proquest_journals_3251383873
pubmed_primary_39288137
crossref_primary_10_1371_journal_pdig_0000604
crossref_citationtrail_10_1371_journal_pdig_0000604
PublicationCentury 2000
PublicationDate 2024-09-01
PublicationDateYYYYMMDD 2024-09-01
PublicationDate_xml – month: 09
  year: 2024
  text: 2024-09-01
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: San Francisco
– name: San Francisco, CA USA
PublicationTitle PLOS digital health
PublicationTitleAlternate PLOS Digit Health
PublicationYear 2024
Publisher Public Library of Science
Publisher_xml – name: Public Library of Science
References P Lewis (pdig.0000604.ref012) 2020
R Thoppilan (pdig.0000604.ref015) 2022
K Beam (pdig.0000604.ref005) 2023; 177
pdig.0000604.ref007
H Touvron (pdig.0000604.ref016) 2023
pdig.0000604.ref009
FN Mirza (pdig.0000604.ref006) 2023
pdig.0000604.ref014
pdig.0000604.ref011
pdig.0000604.ref013
LZ Cai (pdig.0000604.ref001) 2023; 254
S Rayamajhi (pdig.0000604.ref010) 2020; 20
TH Kung (pdig.0000604.ref004) 2023; 2
P Lee (pdig.0000604.ref002) 2023; 388
A Gilson (pdig.0000604.ref003) 2023; 9
J Loscalzo (pdig.0000604.ref008) 2022
References_xml – volume: 9
  start-page: e45312
  year: 2023
  ident: pdig.0000604.ref003
  article-title: How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment.
  publication-title: JMIR Med Educ
  doi: 10.2196/45312
– volume: 388
  start-page: 1233
  year: 2023
  ident: pdig.0000604.ref002
  article-title: Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. Drazen JM, Kohane IS, Leong T-Y, editors
  publication-title: N Engl J Med
  doi: 10.1056/NEJMsr2214184
– year: 2022
  ident: pdig.0000604.ref008
  article-title: Harrison’s Principles of Internal Medicine.
  publication-title: McGraw Hill / Medical
– ident: pdig.0000604.ref009
  doi: 10.1145/3571884.3604316
– ident: pdig.0000604.ref007
– ident: pdig.0000604.ref011
– year: 2023
  ident: pdig.0000604.ref006
  article-title: Performance of Three Large Language Models on Dermatology Board Examinations
  publication-title: J Invest Dermatol
– ident: pdig.0000604.ref013
  doi: 10.1145/3477495.3532682
– start-page: 9459
  year: 2020
  ident: pdig.0000604.ref012
  article-title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
  publication-title: Adv Neural Inf Process Syst
– year: 2022
  ident: pdig.0000604.ref015
  publication-title: LaMDA: Language Models for Dialog Applications
– ident: pdig.0000604.ref014
– volume: 2
  start-page: e0000198
  year: 2023
  ident: pdig.0000604.ref004
  article-title: Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. Dagan A, editor
  publication-title: PLOS Digit Heal
  doi: 10.1371/journal.pdig.0000198
– volume: 177
  start-page: 977
  year: 2023
  ident: pdig.0000604.ref005
  article-title: Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination
  publication-title: JAMA Pediatr
  doi: 10.1001/jamapediatrics.2023.2373
– volume: 254
  start-page: 141
  year: 2023
  ident: pdig.0000604.ref001
  article-title: Performance of Generative Large Language Models on Ophthalmology Board–Style Questions
  publication-title: Am J Ophthalmol
  doi: 10.1016/j.ajo.2023.05.024
– volume: 20
  start-page: 79
  year: 2020
  ident: pdig.0000604.ref010
  article-title: Do USMLE steps, and ITE score predict the American Board of Internal Medicine Certifying Exam results?
  publication-title: BMC Med Educ.
  doi: 10.1186/s12909-020-1974-3
– year: 2023
  ident: pdig.0000604.ref016
  publication-title: Llama 2: Open Foundation and Fine-Tuned Chat Models
SSID ssj0002810791
Score 2.3186114
Snippet Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations....
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations....
SourceID pubmedcentral
proquest
pubmed
crossref
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
StartPage e0000604
SubjectTerms Accuracy
Application programming interface
Artificial intelligence
Benchmarks
Biology and Life Sciences
Chatbots
Infectious diseases
Internal medicine
Large language models
Medicine
Medicine and Health Sciences
Multiple choice
People and Places
Social Sciences
Standardized tests
Title Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions
URI https://www.ncbi.nlm.nih.gov/pubmed/39288137
https://www.proquest.com/docview/3251383873
https://www.proquest.com/docview/3106459394
https://pubmed.ncbi.nlm.nih.gov/PMC11407633
Volume 3
WOSCitedRecordID wos001416887500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: DOA
  dateStart: 20220101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: M~E
  dateStart: 20220101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
– providerCode: PRVPQU
  databaseName: Health & Medical Collection
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: 7X7
  dateStart: 20220201
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/healthcomplete
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: BENPR
  dateStart: 20220201
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Public Health Database
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: 8C1
  dateStart: 20220201
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/publichealth
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Publicly Available Content Database
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: PIMPY
  dateStart: 20220201
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/publiccontent
  providerName: ProQuest
– providerCode: PRVATS
  databaseName: Public Library of Science (PLoS) Journals Open Access
  customDbUrl:
  eissn: 2767-3170
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002810791
  issn: 2767-3170
  databaseCode: FPL
  dateStart: 20210101
  isFulltext: true
  titleUrlDefault: http://www.plos.org/publications/
  providerName: Public Library of Science
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpR3LbtQw0IIWIS6UNwtlZSSuppvYxPYJtVVXILWrCIG0nCLHjstKUbI020r9e2YcJ31JcODig9_KTDzvGUI-SO1Tl1Ul8_pTyoRwFdPCKsa1N0hiAUVEKDYhFwu1XOo8Kty66FY5vInhoXatRR35HgdCDNKUkvzz-jfDqlFoXY0lNO6TbSybjXgul3LUsaQKhBudxIg5LpO9CKCPa7c6DakLs1ihbaRId9jM296S18jPfOd_L_6EPI6MJ93vMeUpuVc1z8jDk2haf07K_CqEgLae9vq8-pLuX5hVjQFW9Bi9xqHtNZwUy6jVHW0bGtWKNR22owctOuN2m0tYFnSqiN0vyI_50ffDLywWYGBWcLlhSeqAoZuVmXOGSydUCKvNnJAplzNMLVhyYZ0v0yp1CTcJptLhVs6MzzyagF-SraZtqteEgtzktbVSQb9Q1pTWaIHymkFDqvMTwgcgFDZmJ8ciGXURTG4SpJT-OxUIuiKCbkLYuGrdZ-f4x_zdAUbDeFdcAWhC3o_D8Jeh6cQ0VXsOcxLM66e5hi1e9egwHggcplJw5ISoG4gyTsAM3jdHmtWvkMkbhNEZPPD8zd_v9ZY8SoGX6l3bdsnW5uy8ekce2IvNqjubBpwPrYJWHSZTsn1wtMi_TYOCAdp5fgx9-deT_OcfgqQVMg
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VLQIuvCkLBYwEx9CNbWL7gFB5VF11d7WHIpVTSOyErhQlS7Mt2j_Fb2QmcVIKEpx64JJD_Mrjs-c9A_BCmZy7KEuD3LzmgZQuC4y0OhAmT4jEIkRkU2xCzWb66MjMN-BHFwtDbpXdmdgc1K6ypCPfEUiIUZrSSrxdfguoahRZV7sSGi0sDrL1dxTZ6jfjD_h_X3K-9_Hw_X7gqwoEVgq1CkLukEsZpZFziVBO6iZWNHJScaFGlC8vFdK6POUZd6FIQsoPI6waJXmUk10T570Cm5LAPoDN-Xg6_9xrdbhGccqEPkZPqHDHQ-LV0i2-NskSI18TrqeBfzC2v_tn_kLw9m79b5_qNtz0rDXbbffCHdjIyrtwbeqdB-5BOj8PkmBVzlqNZbFmu2fJoqAQMjYhv3i8tjpcRoXiippVJfOK04J107F3Fbkb16s1Dmu0xrR_78OnS3nDBzAoqzJ7CAwlw9xYqzTel9omqU2MJIk0IVOxy4cgup8eW59_ncqAFHFjVFQoh7XfKSaoxB4qQwj6Ucs2_8g_-m93mOja6_gcEEN43jfjOULGoaTMqlPsE1LmQiMMTrHVwq9fEHlorXHJIegLwOw7UI7yiy3l4rjJVY7i9ghJmHj09-d6Btf3D6eTeDKeHTyGGxw5x9aRbxsGq5PT7AlctWerRX3y1O84Bl8uG7k_AUpsaU8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Performance+of+Publicly+Available+Large+Language+Models+on+Internal+Medicine+Board-style+Questions&rft.jtitle=PLOS+digital+health&rft.au=Tarabanis%2C+Constantine&rft.au=Zahid%2C+Sohail&rft.au=Mamalis%2C+Marios&rft.au=Zhang%2C+Kevin&rft.date=2024-09-01&rft.pub=Public+Library+of+Science&rft.eissn=2767-3170&rft.volume=3&rft.issue=9&rft_id=info:doi/10.1371%2Fjournal.pdig.0000604&rft.externalDocID=PMC11407633
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2767-3170&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2767-3170&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2767-3170&client=summon