Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would ther...

Full description

Saved in:
Bibliographic Details
Published in:Academic medicine Vol. 99; no. 5; p. 508
Main Authors: Laupichler, Matthias Carl, Rother, Johanna Flora, Grunwald Kadow, Ilona C, Ahmadi, Seifollah, Raupach, Tobias
Format: Journal Article
Language:English
Published: United States 01.05.2024
Subjects:
ISSN:1938-808X, 1938-808X
Online Access:Get more information
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
AbstractList Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans.PROBLEMCreating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans.The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT.APPROACHThe authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT.The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly.OUTCOMESThe final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly.Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.NEXT STEPSFuture research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
Author Ahmadi, Seifollah
Grunwald Kadow, Ilona C
Rother, Johanna Flora
Laupichler, Matthias Carl
Raupach, Tobias
Author_xml – sequence: 1
  givenname: Matthias Carl
  orcidid: 0000-0003-3104-1123
  surname: Laupichler
  fullname: Laupichler, Matthias Carl
– sequence: 2
  givenname: Johanna Flora
  orcidid: 0009-0004-3526-3211
  surname: Rother
  fullname: Rother, Johanna Flora
– sequence: 3
  givenname: Ilona C
  orcidid: 0000-0002-9085-4274
  surname: Grunwald Kadow
  fullname: Grunwald Kadow, Ilona C
– sequence: 4
  givenname: Seifollah
  surname: Ahmadi
  fullname: Ahmadi, Seifollah
– sequence: 5
  givenname: Tobias
  orcidid: 0000-0003-2555-8097
  surname: Raupach
  fullname: Raupach, Tobias
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38166323$$D View this record in MEDLINE/PubMed
BookMark eNpNkF1LwzAUhoNM3If-A5FcetOZNE2aeDdK3YQOFSZ4V9PkdFbadDYt6L93wwk7N-e5eN5z4J2ikWsdIHRNyZwSFd8tkvWcnAwXoThDE6qYDCSRb6MTHqOp9597ScScXaAxk1QIFrIJes90twWcabcd9B7WrYXa48rhNdjK6BqndjC6r1p3j5O22emuclucfOh--bwJcN_i1dBoFyzBQad7sDj91g1-GcAfQv4SnZe69nB13DP0-pBuklWQPS0fk0UWGMaFCAoojA2ZjXlMmSoYcG6EtSxSOjaF5SSSRJdWFKVQLIYyotySUKqCqqgEacIZuv27u-var8PzvKm8gbrWDtrB56EiShBFpdqrN0d1KBqw-a6rGt395P-thL_IO2Zl
CitedBy_id crossref_primary_10_1016_j_arthro_2024_06_021
crossref_primary_10_2196_65726
crossref_primary_10_1093_postmj_qgae065
crossref_primary_10_3389_fphar_2025_1516381
crossref_primary_10_1007_s40670_024_02218_2
crossref_primary_10_1002_ca_24271
crossref_primary_10_1016_j_crphys_2025_100160
crossref_primary_10_1080_0142159X_2025_2478872
crossref_primary_10_5604_01_3001_0054_9192
crossref_primary_10_1080_0142159X_2024_2382860
crossref_primary_10_1080_0142159X_2025_2497891
crossref_primary_10_1515_gme_2024_0021
crossref_primary_10_1186_s12909_024_05239_y
crossref_primary_10_1080_08998280_2024_2418752
crossref_primary_10_3389_fpubh_2025_1577076
crossref_primary_10_1016_j_jds_2024_08_020
crossref_primary_10_1007_s42979_024_02963_6
crossref_primary_10_1007_s11934_025_01277_1
crossref_primary_10_2196_67244
crossref_primary_10_3928_01484834_20250414_01
crossref_primary_10_1080_0142159X_2025_2513418
crossref_primary_10_1038_s41746_025_01721_z
crossref_primary_10_1177_10815589241257215
crossref_primary_10_1080_10494820_2025_2482588
crossref_primary_10_1109_ACCESS_2025_3590423
crossref_primary_10_1007_s40593_025_00471_z
crossref_primary_10_1007_s40670_025_02396_7
crossref_primary_10_1111_eje_70034
crossref_primary_10_3389_fdgth_2024_1458811
crossref_primary_10_1111_eje_70015
crossref_primary_10_1007_s40670_024_02146_1
crossref_primary_10_3748_wjg_v31_i31_109948
crossref_primary_10_1007_s10405_025_00623_x
crossref_primary_10_1186_s12909_025_06881_w
crossref_primary_10_1016_j_radi_2025_103087
crossref_primary_10_1002_pra2_1054
crossref_primary_10_1186_s12909_025_06862_z
crossref_primary_10_1016_j_jmir_2025_101896
crossref_primary_10_1080_0142159X_2025_2451870
crossref_primary_10_1080_10872981_2025_2554678
ContentType Journal Article
Copyright Copyright © 2023 the Association of American Medical Colleges.
Copyright_xml – notice: Copyright © 2023 the Association of American Medical Colleges.
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1097/ACM.0000000000005626
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Medicine
Education
EISSN 1938-808X
ExternalDocumentID 38166323
Genre Research Support, Non-U.S. Gov't
Journal Article
Comparative Study
GroupedDBID ---
.-D
.Z2
01R
0R~
1J1
23M
2FS
40H
4Q1
4Q2
4Q3
53G
5GY
5RE
5VS
77Y
7O~
AAAAV
AAAXR
AAFWJ
AAGIX
AAHPQ
AAIQE
AAMOA
AAMTA
AAQKA
AARTV
AASCR
AASOK
AAWTL
AAXQO
ABASU
ABBUW
ABDIG
ABIVO
ABJNI
ABPXF
ABVCZ
ABXVJ
ABZAD
ABZZY
ACDDN
ACEWG
ACGFO
ACGFS
ACILI
ACLDA
ACNCT
ACNWC
ACWDW
ACWRI
ACXJB
ACXNZ
ACZKN
ADGGA
ADHPY
AENEX
AFBFQ
AFDTB
AFUWQ
AGINI
AHJKT
AHOMT
AHQNM
AHVBC
AIJEX
AINUH
AJCLO
AJIOK
AJNWD
AJZMW
AKCTQ
AKULP
ALKUP
ALMA_UNASSIGNED_HOLDINGS
ALMTX
AMJPA
AMKUR
AMNEI
AOHHW
AOQMC
BOYCO
BQLVK
C45
CGR
CS3
CUY
CVF
DIWNM
E.X
EBS
ECM
EEVPB
EIF
ERAAH
EX3
F2K
F2L
F2M
F2N
F5P
FCALG
FL-
GNXGY
GQDEL
H0~
HLJTE
HZ~
IKREB
IKYAY
IN~
IPNFZ
JK3
JK8
K8S
KD2
KMI
L-C
L7B
MVM
MZP
N9A
NEJ
NPM
N~7
N~B
O9-
OAG
OAH
OLG
OLH
OLU
OLV
OLY
OLZ
OPUJH
OVD
OVDNE
OVIDH
OVLEI
OWV
OWW
OWY
OWZ
OXXIT
P2P
P6G
RIG
RLZ
S4R
S4S
SJN
TEORI
TR2
TSPGW
TWZ
V2I
VVN
W3M
WF8
WH7
WOQ
WOW
X3V
X3W
XYM
YHG
YR5
ZFV
ZY1
ZZMQN
7X8
ID FETCH-LOGICAL-c3566-bebcd23d757139b3e55c6dd349a7cbd50480afd6bf6937ef415d0289b194fe8c2
IEDL.DBID 7X8
ISICitedReferencesCount 52
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001209002500023&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1938-808X
IngestDate Thu Oct 02 11:32:18 EDT 2025
Mon Jul 21 06:06:41 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 5
Language English
License Copyright © 2023 the Association of American Medical Colleges.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c3566-bebcd23d757139b3e55c6dd349a7cbd50480afd6bf6937ef415d0289b194fe8c2
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0003-3104-1123
0000-0003-2555-8097
0009-0004-3526-3211
0000-0002-9085-4274
PMID 38166323
PQID 2909609189
PQPubID 23479
ParticipantIDs proquest_miscellaneous_2909609189
pubmed_primary_38166323
PublicationCentury 2000
PublicationDate 2024-05-01
20240501
PublicationDateYYYYMMDD 2024-05-01
PublicationDate_xml – month: 05
  year: 2024
  text: 2024-05-01
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Academic medicine
PublicationTitleAlternate Acad Med
PublicationYear 2024
SSID ssj0006753
Score 2.6008964
Snippet Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 508
SubjectTerms Education, Medical - methods
Education, Medical, Undergraduate - methods
Educational Measurement - methods
Female
Humans
Language
Male
Students, Medical - statistics & numerical data
Title Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions
URI https://www.ncbi.nlm.nih.gov/pubmed/38166323
https://www.proquest.com/docview/2909609189
Volume 99
WOSCitedRecordID wos001209002500023&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF7UinjxUV_1xQpel7abZpP1IqW0emhLDxVyi_uKFjSppoo_39ns1p4EwRxyWwiTyeTb-Wa_D6FrGgnFYikJU1qRDuxBiMiiFpE8CDUgcsqFqswmovE4ThI-8Q230o9VLmtiVah1oWyPvEl5JY7Wjvnt_I1Y1yjLrnoLjXVUCwDK2KyOkpVaOHMqlIBRYqjEcbI8OsejZrc3ctKF_gIYwH4HmdXPZrD738fcQzseZuKuy4t9tGbyunVo9tMcdbQ18qT6AXoc2mlwPPSdS2zt0V5KPMuxZ3Hwz7ob3HO-hfkT7j2Lxd1kSvCiwBUVQJyENUBY3P8Sr7jqpdqsPkQPg_60d0-88QJRAcA7Io1UmgY6CmELy2VgwlAxrYMOF5GSOrTn0EWmmcwYoBuTAQjQlrGUbd7JTKzoEdrIi9ycICyUYS2lIb4akJfiUmacUdrWIY9hH99qoKtlHFNIbMtWiNwUH2W6imQDHbuXkc6dAkdasZ0BDU7_sPoMbVMAIm5I8RzVMviszQXaVJ-LWfl-WWUM3MeT0Td_FMmP
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Large+Language+Models+in+Medical+Education%3A+Comparing+ChatGPT-+to+Human-Generated+Exam+Questions&rft.jtitle=Academic+medicine&rft.au=Laupichler%2C+Matthias+Carl&rft.au=Rother%2C+Johanna+Flora&rft.au=Grunwald+Kadow%2C+Ilona+C&rft.au=Ahmadi%2C+Seifollah&rft.date=2024-05-01&rft.issn=1938-808X&rft.eissn=1938-808X&rft.volume=99&rft.issue=5&rft.spage=508&rft_id=info:doi/10.1097%2FACM.0000000000005626&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1938-808X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1938-808X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1938-808X&client=summon