Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions
Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge suppl...
Uložené v:
| Vydané v: | PLOS digital health Ročník 3; číslo 9; s. e0000604 |
|---|---|
| Hlavní autori: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
United States
Public Library of Science
01.09.2024
|
| Predmet: | |
| ISSN: | 2767-3170, 2767-3170 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses. |
|---|---|
| AbstractList | Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses. Following the recent popularization of large language models (LLMs), medical research is attempting to benchmark LLMs’ medical competency against that of practicing physicians. No published studies have investigated LLM performance on the Internal Medicine (IM) Board Examination required for physicians to certify their IM specialty. We assessed the performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, on 240 IM board-style questions sourced from the Medical Knowledge Self-Assessment Program, a preparatory question bank. GPT-4.0 scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human respondents on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There was an increase in test scores of both GPT-3.5 and GPT-4.0 after model input augmentation using Harrison’s Principles of Internal Medicine, a standard medical textbook. The ability to correctly answer an array of multidisciplinary questions while providing supporting explanations speaks to LLMs’ potential as a study aid and medical assistant. The ability to improve LLM test performance by model input augmentation using a standard medical textbook provides a technical approach to improving their factual accuracy. Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses. Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses. |
| Author | Jankelson, Lior Tarabanis, Constantine Kalampokis, Evangelos Zahid, Sohail Mamalis, Marios Zhang, Kevin |
| AuthorAffiliation | 3 Department of Internal Medicine, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America 2 Information Systems Laboratory, University of Macedonia, Thessaloniki, Greece Intel Corporation, UNITED STATES OF AMERICA 1 Leon H. Charney Division of Cardiology, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America |
| AuthorAffiliation_xml | – name: Intel Corporation, UNITED STATES OF AMERICA – name: 1 Leon H. Charney Division of Cardiology, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America – name: 2 Information Systems Laboratory, University of Macedonia, Thessaloniki, Greece – name: 3 Department of Internal Medicine, NYU Langone Health, New York University School of Medicine, New York, New York, United States of America |
| Author_xml | – sequence: 1 givenname: Constantine orcidid: 0000-0001-7563-2430 surname: Tarabanis fullname: Tarabanis, Constantine – sequence: 2 givenname: Sohail surname: Zahid fullname: Zahid, Sohail – sequence: 3 givenname: Marios orcidid: 0009-0000-2680-0442 surname: Mamalis fullname: Mamalis, Marios – sequence: 4 givenname: Kevin surname: Zhang fullname: Zhang, Kevin – sequence: 5 givenname: Evangelos surname: Kalampokis fullname: Kalampokis, Evangelos – sequence: 6 givenname: Lior surname: Jankelson fullname: Jankelson, Lior |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39288137$$D View this record in MEDLINE/PubMed |
| BookMark | eNp9kc1uGyEUhVGUqEkTv0EUjdRNN-MCFw_jbCrX6k8kW0mldo0YYBwiDA7MWPLbl5EdK80iLOBKfOfowPmITn3wBqFrgscEOPnyFPropRtvtF2NcV4VZifogvKKl0A4Pn01n6NRSk-ZoTXBfEo-oHOY0rrORheoeTCxDXEtvTJFaIuHvnFWuV0x20rrZONMsZBxNex-1cs8LIM2LhXBF3e-M0OKYmm0Vdab4luQUZep22XZ796kzgafrtBZK10yo8N5if7--P5n_qtc3P-8m88WpWLAu5JQPWEYN5XWErhmNUxqSirNOAWOOWO4AaZ021BDNQFJaA0cFMeyrVrOKFyir3vfTd-sjVbGd1E6sYl2LeNOBGnF_zfePopV2ApCGOYVQHb4fHCI4XmIL9Y2KeOc9Cb0SQDBFZtMYcoy-ukNeqgkU3RCoIaaD4Y3ryMds7z8fwbYHlAxpBRNe0QIFkPTL7ZiaFocms6y2zcyZTs5_HZ-mHXvi_8BLByyqA |
| CitedBy_id | crossref_primary_10_3390_jcm14176169 crossref_primary_10_1007_s00405_025_09404_x crossref_primary_10_3748_wjg_v31_i6_102090 crossref_primary_10_1016_j_mcpdig_2025_100241 crossref_primary_10_1016_j_jaci_2025_02_004 crossref_primary_10_1093_jamia_ocaf008 |
| Cites_doi | 10.2196/45312 10.1056/NEJMsr2214184 10.1145/3571884.3604316 10.1145/3477495.3532682 10.1371/journal.pdig.0000198 10.1001/jamapediatrics.2023.2373 10.1016/j.ajo.2023.05.024 10.1186/s12909-020-1974-3 |
| ContentType | Journal Article |
| Copyright | Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication: https://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| Copyright_xml | – notice: Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. – notice: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication: https://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| DBID | AAYXX CITATION NPM 3V. 7X7 7XB 8C1 8FI 8FJ 8FK ABUWG AEUYN AFKRA AZQEC BENPR CCPQU DWQXO FYUFA GHDGH K9. M0S PHGZM PHGZT PIMPY PJZUB PKEHL PPXIY PQEST PQQKQ PQUKI 7X8 5PM |
| DOI | 10.1371/journal.pdig.0000604 |
| DatabaseName | CrossRef PubMed ProQuest Central (Corporate) Health & Medical Collection ProQuest Central (purchase pre-March 2016) Public Health Database ProQuest Hospital Collection Hospital Premium Collection (Alumni Edition) ProQuest Central (Alumni) (purchase pre-March 2016) ProQuest Central (Alumni) ProQuest One Sustainability ProQuest Central UK/Ireland ProQuest Central Essentials - QC ProQuest Central ProQuest One Community College ProQuest Central Korea Health Research Premium Collection Health Research Premium Collection (Alumni) ProQuest Health & Medical Complete (Alumni) Health & Medical Collection (Alumni) Proquest Central Premium ProQuest One Academic (New) ProQuest Publicly Available Content Database ProQuest Health & Medical Research Collection ProQuest One Academic Middle East (New) ProQuest One Health & Nursing ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic (retired) ProQuest One Academic UKI Edition MEDLINE - Academic PubMed Central (Full Participant titles) |
| DatabaseTitle | CrossRef PubMed Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest Health & Medical Complete (Alumni) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest One Health & Nursing ProQuest Central ProQuest One Sustainability ProQuest Health & Medical Research Collection Health Research Premium Collection Health and Medicine Complete (Alumni Edition) ProQuest Central Korea Health & Medical Research Collection ProQuest Central (New) ProQuest Public Health ProQuest One Academic Eastern Edition ProQuest Hospital Collection Health Research Premium Collection (Alumni) ProQuest Hospital Collection (Alumni) ProQuest Health & Medical Complete ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) MEDLINE - Academic |
| DatabaseTitleList | CrossRef PubMed MEDLINE - Academic Publicly Available Content Database |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: PIMPY name: Publicly Available Content Database url: http://search.proquest.com/publiccontent sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Medicine |
| DocumentTitleAlternate | LLM performance on internal medicine board exam |
| EISSN | 2767-3170 |
| ExternalDocumentID | PMC11407633 39288137 10_1371_journal_pdig_0000604 |
| Genre | Journal Article |
| GroupedDBID | 53G 7X7 8C1 8FI 8FJ AAFWJ AAUCC AAWOE AAYXX ABUWG ACCTH AEUYN AFFHD AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS BENPR CCPQU CITATION EIHBH FPL FYUFA GROUPED_DOAJ HMCUK M~E OK1 PGMZT PHGZM PHGZT PIMPY PJZUB PPXIY RPM UKHRP NPM 3V. 7XB 8FK AZQEC DWQXO K9. PKEHL PQEST PQQKQ PQUKI 7X8 PUEGO 5PM |
| ID | FETCH-LOGICAL-c437t-12d5400b6dda37d48358216d4723707440b34cdfb2e2d13a128373c70af6f7423 |
| IEDL.DBID | 7X7 |
| ISICitedReferencesCount | 7 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001416887500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2767-3170 |
| IngestDate | Tue Nov 04 02:05:04 EST 2025 Wed Oct 01 14:24:28 EDT 2025 Tue Oct 07 07:10:13 EDT 2025 Wed Feb 19 02:04:56 EST 2025 Sat Nov 29 06:23:58 EST 2025 Tue Nov 18 22:00:07 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 9 |
| Language | English |
| License | Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c437t-12d5400b6dda37d48358216d4723707440b34cdfb2e2d13a128373c70af6f7423 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 The authors have declared that no competing interests exist. |
| ORCID | 0009-0000-2680-0442 0000-0001-7563-2430 |
| OpenAccessLink | https://www.proquest.com/docview/3251383873?pq-origsite=%requestingapplication% |
| PMID | 39288137 |
| PQID | 3251383873 |
| PQPubID | 6980581 |
| ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_11407633 proquest_miscellaneous_3106459394 proquest_journals_3251383873 pubmed_primary_39288137 crossref_primary_10_1371_journal_pdig_0000604 crossref_citationtrail_10_1371_journal_pdig_0000604 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-09-01 |
| PublicationDateYYYYMMDD | 2024-09-01 |
| PublicationDate_xml | – month: 09 year: 2024 text: 2024-09-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States – name: San Francisco – name: San Francisco, CA USA |
| PublicationTitle | PLOS digital health |
| PublicationTitleAlternate | PLOS Digit Health |
| PublicationYear | 2024 |
| Publisher | Public Library of Science |
| Publisher_xml | – name: Public Library of Science |
| References | P Lewis (pdig.0000604.ref012) 2020 R Thoppilan (pdig.0000604.ref015) 2022 K Beam (pdig.0000604.ref005) 2023; 177 pdig.0000604.ref007 H Touvron (pdig.0000604.ref016) 2023 pdig.0000604.ref009 FN Mirza (pdig.0000604.ref006) 2023 pdig.0000604.ref014 pdig.0000604.ref011 pdig.0000604.ref013 LZ Cai (pdig.0000604.ref001) 2023; 254 S Rayamajhi (pdig.0000604.ref010) 2020; 20 TH Kung (pdig.0000604.ref004) 2023; 2 P Lee (pdig.0000604.ref002) 2023; 388 A Gilson (pdig.0000604.ref003) 2023; 9 J Loscalzo (pdig.0000604.ref008) 2022 |
| References_xml | – volume: 9 start-page: e45312 year: 2023 ident: pdig.0000604.ref003 article-title: How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. publication-title: JMIR Med Educ doi: 10.2196/45312 – volume: 388 start-page: 1233 year: 2023 ident: pdig.0000604.ref002 article-title: Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. Drazen JM, Kohane IS, Leong T-Y, editors publication-title: N Engl J Med doi: 10.1056/NEJMsr2214184 – year: 2022 ident: pdig.0000604.ref008 article-title: Harrison’s Principles of Internal Medicine. publication-title: McGraw Hill / Medical – ident: pdig.0000604.ref009 doi: 10.1145/3571884.3604316 – ident: pdig.0000604.ref007 – ident: pdig.0000604.ref011 – year: 2023 ident: pdig.0000604.ref006 article-title: Performance of Three Large Language Models on Dermatology Board Examinations publication-title: J Invest Dermatol – ident: pdig.0000604.ref013 doi: 10.1145/3477495.3532682 – start-page: 9459 year: 2020 ident: pdig.0000604.ref012 article-title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. publication-title: Adv Neural Inf Process Syst – year: 2022 ident: pdig.0000604.ref015 publication-title: LaMDA: Language Models for Dialog Applications – ident: pdig.0000604.ref014 – volume: 2 start-page: e0000198 year: 2023 ident: pdig.0000604.ref004 article-title: Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. Dagan A, editor publication-title: PLOS Digit Heal doi: 10.1371/journal.pdig.0000198 – volume: 177 start-page: 977 year: 2023 ident: pdig.0000604.ref005 article-title: Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination publication-title: JAMA Pediatr doi: 10.1001/jamapediatrics.2023.2373 – volume: 254 start-page: 141 year: 2023 ident: pdig.0000604.ref001 article-title: Performance of Generative Large Language Models on Ophthalmology Board–Style Questions publication-title: Am J Ophthalmol doi: 10.1016/j.ajo.2023.05.024 – volume: 20 start-page: 79 year: 2020 ident: pdig.0000604.ref010 article-title: Do USMLE steps, and ITE score predict the American Board of Internal Medicine Certifying Exam results? publication-title: BMC Med Educ. doi: 10.1186/s12909-020-1974-3 – year: 2023 ident: pdig.0000604.ref016 publication-title: Llama 2: Open Foundation and Fine-Tuned Chat Models |
| SSID | ssj0002810791 |
| Score | 2.3186114 |
| Snippet | Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations.... Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations.... |
| SourceID | pubmedcentral proquest pubmed crossref |
| SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source |
| StartPage | e0000604 |
| SubjectTerms | Accuracy Application programming interface Artificial intelligence Benchmarks Biology and Life Sciences Chatbots Infectious diseases Internal medicine Large language models Medicine Medicine and Health Sciences Multiple choice People and Places Social Sciences Standardized tests |
| Title | Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/39288137 https://www.proquest.com/docview/3251383873 https://www.proquest.com/docview/3106459394 https://pubmed.ncbi.nlm.nih.gov/PMC11407633 |
| Volume | 3 |
| WOSCitedRecordID | wos001416887500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: DOA dateStart: 20220101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: M~E dateStart: 20220101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre – providerCode: PRVPQU databaseName: Health & Medical Collection customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: 7X7 dateStart: 20220201 isFulltext: true titleUrlDefault: https://search.proquest.com/healthcomplete providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: BENPR dateStart: 20220201 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVPQU databaseName: Public Health Database customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: 8C1 dateStart: 20220201 isFulltext: true titleUrlDefault: https://search.proquest.com/publichealth providerName: ProQuest – providerCode: PRVPQU databaseName: Publicly Available Content Database customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: PIMPY dateStart: 20220201 isFulltext: true titleUrlDefault: http://search.proquest.com/publiccontent providerName: ProQuest – providerCode: PRVATS databaseName: Public Library of Science (PLoS) Journals Open Access customDbUrl: eissn: 2767-3170 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002810791 issn: 2767-3170 databaseCode: FPL dateStart: 20210101 isFulltext: true titleUrlDefault: http://www.plos.org/publications/ providerName: Public Library of Science |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpR3LbtQw0IIWIS6UNwtlZSSuppvYxPYJtVVXILWrCIG0nCLHjstKUbI020r9e2YcJ31JcODig9_KTDzvGUI-SO1Tl1Ul8_pTyoRwFdPCKsa1N0hiAUVEKDYhFwu1XOo8Kty66FY5vInhoXatRR35HgdCDNKUkvzz-jfDqlFoXY0lNO6TbSybjXgul3LUsaQKhBudxIg5LpO9CKCPa7c6DakLs1ihbaRId9jM296S18jPfOd_L_6EPI6MJ93vMeUpuVc1z8jDk2haf07K_CqEgLae9vq8-pLuX5hVjQFW9Bi9xqHtNZwUy6jVHW0bGtWKNR22owctOuN2m0tYFnSqiN0vyI_50ffDLywWYGBWcLlhSeqAoZuVmXOGSydUCKvNnJAplzNMLVhyYZ0v0yp1CTcJptLhVs6MzzyagF-SraZtqteEgtzktbVSQb9Q1pTWaIHymkFDqvMTwgcgFDZmJ8ciGXURTG4SpJT-OxUIuiKCbkLYuGrdZ-f4x_zdAUbDeFdcAWhC3o_D8Jeh6cQ0VXsOcxLM66e5hi1e9egwHggcplJw5ISoG4gyTsAM3jdHmtWvkMkbhNEZPPD8zd_v9ZY8SoGX6l3bdsnW5uy8ekce2IvNqjubBpwPrYJWHSZTsn1wtMi_TYOCAdp5fgx9-deT_OcfgqQVMg |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1Lb9QwEB6VLQIuvCkLBYwEx9CNbWL7gFB5VF11d7WHIpVTSOyErhQlS7Mt2j_Fb2QmcVIKEpx64JJD_Mrjs-c9A_BCmZy7KEuD3LzmgZQuC4y0OhAmT4jEIkRkU2xCzWb66MjMN-BHFwtDbpXdmdgc1K6ypCPfEUiIUZrSSrxdfguoahRZV7sSGi0sDrL1dxTZ6jfjD_h_X3K-9_Hw_X7gqwoEVgq1CkLukEsZpZFziVBO6iZWNHJScaFGlC8vFdK6POUZd6FIQsoPI6waJXmUk10T570Cm5LAPoDN-Xg6_9xrdbhGccqEPkZPqHDHQ-LV0i2-NskSI18TrqeBfzC2v_tn_kLw9m79b5_qNtz0rDXbbffCHdjIyrtwbeqdB-5BOj8PkmBVzlqNZbFmu2fJoqAQMjYhv3i8tjpcRoXiippVJfOK04J107F3Fbkb16s1Dmu0xrR_78OnS3nDBzAoqzJ7CAwlw9xYqzTel9omqU2MJIk0IVOxy4cgup8eW59_ncqAFHFjVFQoh7XfKSaoxB4qQwj6Ucs2_8g_-m93mOja6_gcEEN43jfjOULGoaTMqlPsE1LmQiMMTrHVwq9fEHlorXHJIegLwOw7UI7yiy3l4rjJVY7i9ghJmHj09-d6Btf3D6eTeDKeHTyGGxw5x9aRbxsGq5PT7AlctWerRX3y1O84Bl8uG7k_AUpsaU8 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Performance+of+Publicly+Available+Large+Language+Models+on+Internal+Medicine+Board-style+Questions&rft.jtitle=PLOS+digital+health&rft.au=Tarabanis%2C+Constantine&rft.au=Zahid%2C+Sohail&rft.au=Mamalis%2C+Marios&rft.au=Zhang%2C+Kevin&rft.date=2024-09-01&rft.pub=Public+Library+of+Science&rft.eissn=2767-3170&rft.volume=3&rft.issue=9&rft_id=info:doi/10.1371%2Fjournal.pdig.0000604&rft.externalDocID=PMC11407633 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2767-3170&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2767-3170&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2767-3170&client=summon |