Evaluation of AI models for radiology exam preparation: DeepSeek vs. ChatGPT-3.5.

Uloženo v:
Podrobná bibliografie
Název: Evaluation of AI models for radiology exam preparation: DeepSeek vs. ChatGPT-3.5.
Autoři: Hu N; Department of Radiology, The Affiliated Hospital of Guizhou Medical University, Guiyang, Guizhou Province, People's Republic of China., Luo Y; Department of Anesthesiology, Guizhou Provincial People's Hospital, Guiyang, Guizhou Province, People's Republic of China., Lei P; Department of Radiology, The Affiliated Hospital of Guizhou Medical University, Guiyang, Guizhou Province, People's Republic of China.
Zdroj: Medical education online [Med Educ Online] 2025 Dec 31; Vol. 30 (1), pp. 2589679. Date of Electronic Publication: 2025 Nov 28.
Způsob vydávání: Journal Article; Comparative Study
Jazyk: English
Informace o časopise: Publisher: Taylor & Francis Country of Publication: United States NLM ID: 9806550 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1087-2981 (Electronic) Linking ISSN: 10872981 NLM ISO Abbreviation: Med Educ Online Subsets: MEDLINE
Imprint Name(s): Publication: 2016- : Philadelphia, PA : Taylor & Francis
Original Publication: [E. Lansing, MI] : Medical Education Online, [1996-
Výrazy ze slovníku MeSH: Educational Measurement*/methods , Radiology*/education , Artificial Intelligence*, Humans ; Generative Artificial Intelligence
Abstrakt: The rapid advancement of artificial intelligence (AI) chatbots has generated significant interest regarding their potential applications within medical education. This study sought to assess the performance of the open-source large language model DeepSeek-V3 in answering radiology board-style questions and to compare its accuracy with that of ChatGPT-3.5.A total of 161 questions (comprising 207 items) were randomly selected from the Exercise Book for the National Senior Health Professional Qualification Examination: Radiology . The question set included single-choice, multiple-choice, shared-stem, and case analysis questions. Both DeepSeek-V3 and ChatGPT-3.5 were evaluated using the same question set over a seven-day testing period. Response accuracy was systematically assessed, and statistical analyses were performed using Pearson's chi-square test and Fisher's exact test.DeepSeek-V3 achieved an overall accuracy of 72%, which was significantly higher than the 55.6% accuracy achieved by ChatGPT-3.5 ( P  < 0.001). Performance analysis by question type revealed DeepSeek's superior accuracy in single-choice questions (87.1%), though with comparatively lower performance in multiple-choice (55.7%) and case analysis questions (68.0%). Across clinical subspecialties, DeepSeek consistently outperformed ChatGPT, particularly in peripheral nervous system ( P  = 0.003), respiratory system ( P  = 0.008), circulatory system ( P  = 0.012), and musculoskeletal system ( P  = 0.021) domains.In conclusion, DeepSeek demonstrates considerable potential as an educational tool in radiology, particularly for knowledge recall and foundational learning applications. However, its relatively weaker performance on higher-order cognitive tasks and complex question formats suggests the need for further model refinement. Future research should investigate DeepSeek's capability in processing image-based questions and perform comparative analyses with more advanced models (e.g., GPT-5) to better evaluate its potential for medical education.
References: Int J Med Inform. 2025 Jun;198:105871. (PMID: 40107040)
Nature. 2025 Feb;638(8049):13-14. (PMID: 39849139)
Healthcare (Basel). 2023 Mar 19;11(6):. (PMID: 36981544)
J Surg Educ. 2024 Nov;81(11):1645-1649. (PMID: 39284250)
Acad Med. 2024 Feb 1;99(2):192-197. (PMID: 37934828)
BMC Med Educ. 2025 Feb 27;25(1):321. (PMID: 40016760)
J Am Dent Assoc. 2023 Nov;154(11):970-974. (PMID: 37676187)
J Biomed Inform. 2025 Mar;163:104791. (PMID: 39938624)
J Chin Med Assoc. 2025 Apr 1;88(4):338-339. (PMID: 39972548)
JMIR Med Educ. 2023 Feb 8;9:e45312. (PMID: 36753318)
Nature. 2025 Jan 29;:. (PMID: 39881178)
Indian J Radiol Imaging. 2024 Mar 25;34(3):574-575. (PMID: 38912242)
OTO Open. 2023 Nov 29;7(4):e98. (PMID: 38034065)
Med Sci Educ. 2024 Aug 17;34(6):1571-1576. (PMID: 39758489)
Indian J Radiol Imaging. 2023 Dec 29;34(2):276-282. (PMID: 38549897)
Radiology. 2024 Sep;312(3):e240153. (PMID: 39225605)
Lancet Digit Health. 2023 Mar;5(3):e107-e108. (PMID: 36754724)
JMIR Med Educ. 2023 Mar 6;9:e46885. (PMID: 36863937)
Indian J Radiol Imaging. 2024 Nov 04;35(2):287-294. (PMID: 40297110)
JMIR Med Educ. 2023 Aug 14;9:e50945. (PMID: 37578830)
Indian J Radiol Imaging. 2024 Jul 04;34(4):653-660. (PMID: 39318561)
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. (PMID: 36812645)
Contributed Indexing: Keywords: Artificial intelligence; ChatGPT; DeepSeek; medical imaging education; radiology examination
Entry Date(s): Date Created: 20251128 Date Completed: 20251128 Latest Revision: 20251203
Update Code: 20251203
PubMed Central ID: PMC12667340
DOI: 10.1080/10872981.2025.2589679
PMID: 41311245
Databáze: MEDLINE
Popis
Abstrakt:The rapid advancement of artificial intelligence (AI) chatbots has generated significant interest regarding their potential applications within medical education. This study sought to assess the performance of the open-source large language model DeepSeek-V3 in answering radiology board-style questions and to compare its accuracy with that of ChatGPT-3.5.A total of 161 questions (comprising 207 items) were randomly selected from the Exercise Book for the National Senior Health Professional Qualification Examination: Radiology . The question set included single-choice, multiple-choice, shared-stem, and case analysis questions. Both DeepSeek-V3 and ChatGPT-3.5 were evaluated using the same question set over a seven-day testing period. Response accuracy was systematically assessed, and statistical analyses were performed using Pearson's chi-square test and Fisher's exact test.DeepSeek-V3 achieved an overall accuracy of 72%, which was significantly higher than the 55.6% accuracy achieved by ChatGPT-3.5 ( P  &lt; 0.001). Performance analysis by question type revealed DeepSeek's superior accuracy in single-choice questions (87.1%), though with comparatively lower performance in multiple-choice (55.7%) and case analysis questions (68.0%). Across clinical subspecialties, DeepSeek consistently outperformed ChatGPT, particularly in peripheral nervous system ( P  = 0.003), respiratory system ( P  = 0.008), circulatory system ( P  = 0.012), and musculoskeletal system ( P  = 0.021) domains.In conclusion, DeepSeek demonstrates considerable potential as an educational tool in radiology, particularly for knowledge recall and foundational learning applications. However, its relatively weaker performance on higher-order cognitive tasks and complex question formats suggests the need for further model refinement. Future research should investigate DeepSeek's capability in processing image-based questions and perform comparative analyses with more advanced models (e.g., GPT-5) to better evaluate its potential for medical education.
ISSN:1087-2981
DOI:10.1080/10872981.2025.2589679