Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions
Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would ther...
Saved in:
| Published in: | Academic medicine Vol. 99; no. 5; p. 508 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
United States
01.05.2024
|
| Subjects: | |
| ISSN: | 1938-808X, 1938-808X |
| Online Access: | Get more information |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans.
The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT.
The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly.
Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated. |
|---|---|
| AbstractList | Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans.PROBLEMCreating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans.The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT.APPROACHThe authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT.The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly.OUTCOMESThe final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly.Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.NEXT STEPSFuture research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated. Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated. |
| Author | Ahmadi, Seifollah Grunwald Kadow, Ilona C Rother, Johanna Flora Laupichler, Matthias Carl Raupach, Tobias |
| Author_xml | – sequence: 1 givenname: Matthias Carl orcidid: 0000-0003-3104-1123 surname: Laupichler fullname: Laupichler, Matthias Carl – sequence: 2 givenname: Johanna Flora orcidid: 0009-0004-3526-3211 surname: Rother fullname: Rother, Johanna Flora – sequence: 3 givenname: Ilona C orcidid: 0000-0002-9085-4274 surname: Grunwald Kadow fullname: Grunwald Kadow, Ilona C – sequence: 4 givenname: Seifollah surname: Ahmadi fullname: Ahmadi, Seifollah – sequence: 5 givenname: Tobias orcidid: 0000-0003-2555-8097 surname: Raupach fullname: Raupach, Tobias |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/38166323$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkF1LwzAUhoNM3If-A5FcetOZNE2aeDdK3YQOFSZ4V9PkdFbadDYt6L93wwk7N-e5eN5z4J2ikWsdIHRNyZwSFd8tkvWcnAwXoThDE6qYDCSRb6MTHqOp9597ScScXaAxk1QIFrIJes90twWcabcd9B7WrYXa48rhNdjK6BqndjC6r1p3j5O22emuclucfOh--bwJcN_i1dBoFyzBQad7sDj91g1-GcAfQv4SnZe69nB13DP0-pBuklWQPS0fk0UWGMaFCAoojA2ZjXlMmSoYcG6EtSxSOjaF5SSSRJdWFKVQLIYyotySUKqCqqgEacIZuv27u-var8PzvKm8gbrWDtrB56EiShBFpdqrN0d1KBqw-a6rGt395P-thL_IO2Zl |
| CitedBy_id | crossref_primary_10_1016_j_arthro_2024_06_021 crossref_primary_10_2196_65726 crossref_primary_10_1093_postmj_qgae065 crossref_primary_10_3389_fphar_2025_1516381 crossref_primary_10_1007_s40670_024_02218_2 crossref_primary_10_1002_ca_24271 crossref_primary_10_1016_j_crphys_2025_100160 crossref_primary_10_1080_0142159X_2025_2478872 crossref_primary_10_5604_01_3001_0054_9192 crossref_primary_10_1080_0142159X_2024_2382860 crossref_primary_10_1080_0142159X_2025_2497891 crossref_primary_10_1515_gme_2024_0021 crossref_primary_10_1186_s12909_024_05239_y crossref_primary_10_1080_08998280_2024_2418752 crossref_primary_10_3389_fpubh_2025_1577076 crossref_primary_10_1016_j_jds_2024_08_020 crossref_primary_10_1007_s42979_024_02963_6 crossref_primary_10_1007_s11934_025_01277_1 crossref_primary_10_2196_67244 crossref_primary_10_3928_01484834_20250414_01 crossref_primary_10_1080_0142159X_2025_2513418 crossref_primary_10_1038_s41746_025_01721_z crossref_primary_10_1177_10815589241257215 crossref_primary_10_1080_10494820_2025_2482588 crossref_primary_10_1109_ACCESS_2025_3590423 crossref_primary_10_1007_s40593_025_00471_z crossref_primary_10_1007_s40670_025_02396_7 crossref_primary_10_1111_eje_70034 crossref_primary_10_3389_fdgth_2024_1458811 crossref_primary_10_1111_eje_70015 crossref_primary_10_1007_s40670_024_02146_1 crossref_primary_10_3748_wjg_v31_i31_109948 crossref_primary_10_1007_s10405_025_00623_x crossref_primary_10_1186_s12909_025_06881_w crossref_primary_10_1016_j_radi_2025_103087 crossref_primary_10_1002_pra2_1054 crossref_primary_10_1186_s12909_025_06862_z crossref_primary_10_1016_j_jmir_2025_101896 crossref_primary_10_1080_0142159X_2025_2451870 crossref_primary_10_1080_10872981_2025_2554678 |
| ContentType | Journal Article |
| Copyright | Copyright © 2023 the Association of American Medical Colleges. |
| Copyright_xml | – notice: Copyright © 2023 the Association of American Medical Colleges. |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1097/ACM.0000000000005626 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Medicine Education |
| EISSN | 1938-808X |
| ExternalDocumentID | 38166323 |
| Genre | Research Support, Non-U.S. Gov't Journal Article Comparative Study |
| GroupedDBID | --- .-D .Z2 01R 0R~ 1J1 23M 2FS 40H 4Q1 4Q2 4Q3 53G 5GY 5RE 5VS 77Y 7O~ AAAAV AAAXR AAFWJ AAGIX AAHPQ AAIQE AAMOA AAMTA AAQKA AARTV AASCR AASOK AAWTL AAXQO ABASU ABBUW ABDIG ABIVO ABJNI ABPXF ABVCZ ABXVJ ABZAD ABZZY ACDDN ACEWG ACGFO ACGFS ACILI ACLDA ACNCT ACNWC ACWDW ACWRI ACXJB ACXNZ ACZKN ADGGA ADHPY AENEX AFBFQ AFDTB AFUWQ AGINI AHJKT AHOMT AHQNM AHVBC AIJEX AINUH AJCLO AJIOK AJNWD AJZMW AKCTQ AKULP ALKUP ALMA_UNASSIGNED_HOLDINGS ALMTX AMJPA AMKUR AMNEI AOHHW AOQMC BOYCO BQLVK C45 CGR CS3 CUY CVF DIWNM E.X EBS ECM EEVPB EIF ERAAH EX3 F2K F2L F2M F2N F5P FCALG FL- GNXGY GQDEL H0~ HLJTE HZ~ IKREB IKYAY IN~ IPNFZ JK3 JK8 K8S KD2 KMI L-C L7B MVM MZP N9A NEJ NPM N~7 N~B O9- OAG OAH OLG OLH OLU OLV OLY OLZ OPUJH OVD OVDNE OVIDH OVLEI OWV OWW OWY OWZ OXXIT P2P P6G RIG RLZ S4R S4S SJN TEORI TR2 TSPGW TWZ V2I VVN W3M WF8 WH7 WOQ WOW X3V X3W XYM YHG YR5 ZFV ZY1 ZZMQN 7X8 |
| ID | FETCH-LOGICAL-c3566-bebcd23d757139b3e55c6dd349a7cbd50480afd6bf6937ef415d0289b194fe8c2 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 52 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001209002500023&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1938-808X |
| IngestDate | Thu Oct 02 11:32:18 EDT 2025 Mon Jul 21 06:06:41 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 5 |
| Language | English |
| License | Copyright © 2023 the Association of American Medical Colleges. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c3566-bebcd23d757139b3e55c6dd349a7cbd50480afd6bf6937ef415d0289b194fe8c2 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0000-0003-3104-1123 0000-0003-2555-8097 0009-0004-3526-3211 0000-0002-9085-4274 |
| PMID | 38166323 |
| PQID | 2909609189 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_2909609189 pubmed_primary_38166323 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-05-01 20240501 |
| PublicationDateYYYYMMDD | 2024-05-01 |
| PublicationDate_xml | – month: 05 year: 2024 text: 2024-05-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Academic medicine |
| PublicationTitleAlternate | Acad Med |
| PublicationYear | 2024 |
| SSID | ssj0006753 |
| Score | 2.6008964 |
| Snippet | Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 508 |
| SubjectTerms | Education, Medical - methods Education, Medical, Undergraduate - methods Educational Measurement - methods Female Humans Language Male Students, Medical - statistics & numerical data |
| Title | Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/38166323 https://www.proquest.com/docview/2909609189 |
| Volume | 99 |
| WOSCitedRecordID | wos001209002500023&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF7UinjxUV_1xQpel7abZpP1IqW0emhLDxVyi_uKFjSppoo_39ns1p4EwRxyWwiTyeTb-Wa_D6FrGgnFYikJU1qRDuxBiMiiFpE8CDUgcsqFqswmovE4ThI-8Q230o9VLmtiVah1oWyPvEl5JY7Wjvnt_I1Y1yjLrnoLjXVUCwDK2KyOkpVaOHMqlIBRYqjEcbI8OsejZrc3ctKF_gIYwH4HmdXPZrD738fcQzseZuKuy4t9tGbyunVo9tMcdbQ18qT6AXoc2mlwPPSdS2zt0V5KPMuxZ3Hwz7ob3HO-hfkT7j2Lxd1kSvCiwBUVQJyENUBY3P8Sr7jqpdqsPkQPg_60d0-88QJRAcA7Io1UmgY6CmELy2VgwlAxrYMOF5GSOrTn0EWmmcwYoBuTAQjQlrGUbd7JTKzoEdrIi9ycICyUYS2lIb4akJfiUmacUdrWIY9hH99qoKtlHFNIbMtWiNwUH2W6imQDHbuXkc6dAkdasZ0BDU7_sPoMbVMAIm5I8RzVMviszQXaVJ-LWfl-WWUM3MeT0Td_FMmP |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Large+Language+Models+in+Medical+Education%3A+Comparing+ChatGPT-+to+Human-Generated+Exam+Questions&rft.jtitle=Academic+medicine&rft.au=Laupichler%2C+Matthias+Carl&rft.au=Rother%2C+Johanna+Flora&rft.au=Grunwald+Kadow%2C+Ilona+C&rft.au=Ahmadi%2C+Seifollah&rft.date=2024-05-01&rft.issn=1938-808X&rft.eissn=1938-808X&rft.volume=99&rft.issue=5&rft.spage=508&rft_id=info:doi/10.1097%2FACM.0000000000005626&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1938-808X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1938-808X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1938-808X&client=summon |