Evaluation of AI models for radiology exam preparation: DeepSeek vs. ChatGPT-3.5.
Uloženo v:
| Název: | Evaluation of AI models for radiology exam preparation: DeepSeek vs. ChatGPT-3.5. |
|---|---|
| Autoři: | Hu N; Department of Radiology, The Affiliated Hospital of Guizhou Medical University, Guiyang, Guizhou Province, People's Republic of China., Luo Y; Department of Anesthesiology, Guizhou Provincial People's Hospital, Guiyang, Guizhou Province, People's Republic of China., Lei P; Department of Radiology, The Affiliated Hospital of Guizhou Medical University, Guiyang, Guizhou Province, People's Republic of China. |
| Zdroj: | Medical education online [Med Educ Online] 2025 Dec 31; Vol. 30 (1), pp. 2589679. Date of Electronic Publication: 2025 Nov 28. |
| Způsob vydávání: | Journal Article; Comparative Study |
| Jazyk: | English |
| Informace o časopise: | Publisher: Taylor & Francis Country of Publication: United States NLM ID: 9806550 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1087-2981 (Electronic) Linking ISSN: 10872981 NLM ISO Abbreviation: Med Educ Online Subsets: MEDLINE |
| Imprint Name(s): | Publication: 2016- : Philadelphia, PA : Taylor & Francis Original Publication: [E. Lansing, MI] : Medical Education Online, [1996- |
| Výrazy ze slovníku MeSH: | Educational Measurement*/methods , Radiology*/education , Artificial Intelligence*, Humans ; Generative Artificial Intelligence |
| Abstrakt: | The rapid advancement of artificial intelligence (AI) chatbots has generated significant interest regarding their potential applications within medical education. This study sought to assess the performance of the open-source large language model DeepSeek-V3 in answering radiology board-style questions and to compare its accuracy with that of ChatGPT-3.5.A total of 161 questions (comprising 207 items) were randomly selected from the Exercise Book for the National Senior Health Professional Qualification Examination: Radiology . The question set included single-choice, multiple-choice, shared-stem, and case analysis questions. Both DeepSeek-V3 and ChatGPT-3.5 were evaluated using the same question set over a seven-day testing period. Response accuracy was systematically assessed, and statistical analyses were performed using Pearson's chi-square test and Fisher's exact test.DeepSeek-V3 achieved an overall accuracy of 72%, which was significantly higher than the 55.6% accuracy achieved by ChatGPT-3.5 ( P < 0.001). Performance analysis by question type revealed DeepSeek's superior accuracy in single-choice questions (87.1%), though with comparatively lower performance in multiple-choice (55.7%) and case analysis questions (68.0%). Across clinical subspecialties, DeepSeek consistently outperformed ChatGPT, particularly in peripheral nervous system ( P = 0.003), respiratory system ( P = 0.008), circulatory system ( P = 0.012), and musculoskeletal system ( P = 0.021) domains.In conclusion, DeepSeek demonstrates considerable potential as an educational tool in radiology, particularly for knowledge recall and foundational learning applications. However, its relatively weaker performance on higher-order cognitive tasks and complex question formats suggests the need for further model refinement. Future research should investigate DeepSeek's capability in processing image-based questions and perform comparative analyses with more advanced models (e.g., GPT-5) to better evaluate its potential for medical education. |
| References: | Int J Med Inform. 2025 Jun;198:105871. (PMID: 40107040) Nature. 2025 Feb;638(8049):13-14. (PMID: 39849139) Healthcare (Basel). 2023 Mar 19;11(6):. (PMID: 36981544) J Surg Educ. 2024 Nov;81(11):1645-1649. (PMID: 39284250) Acad Med. 2024 Feb 1;99(2):192-197. (PMID: 37934828) BMC Med Educ. 2025 Feb 27;25(1):321. (PMID: 40016760) J Am Dent Assoc. 2023 Nov;154(11):970-974. (PMID: 37676187) J Biomed Inform. 2025 Mar;163:104791. (PMID: 39938624) J Chin Med Assoc. 2025 Apr 1;88(4):338-339. (PMID: 39972548) JMIR Med Educ. 2023 Feb 8;9:e45312. (PMID: 36753318) Nature. 2025 Jan 29;:. (PMID: 39881178) Indian J Radiol Imaging. 2024 Mar 25;34(3):574-575. (PMID: 38912242) OTO Open. 2023 Nov 29;7(4):e98. (PMID: 38034065) Med Sci Educ. 2024 Aug 17;34(6):1571-1576. (PMID: 39758489) Indian J Radiol Imaging. 2023 Dec 29;34(2):276-282. (PMID: 38549897) Radiology. 2024 Sep;312(3):e240153. (PMID: 39225605) Lancet Digit Health. 2023 Mar;5(3):e107-e108. (PMID: 36754724) JMIR Med Educ. 2023 Mar 6;9:e46885. (PMID: 36863937) Indian J Radiol Imaging. 2024 Nov 04;35(2):287-294. (PMID: 40297110) JMIR Med Educ. 2023 Aug 14;9:e50945. (PMID: 37578830) Indian J Radiol Imaging. 2024 Jul 04;34(4):653-660. (PMID: 39318561) PLOS Digit Health. 2023 Feb 9;2(2):e0000198. (PMID: 36812645) |
| Contributed Indexing: | Keywords: Artificial intelligence; ChatGPT; DeepSeek; medical imaging education; radiology examination |
| Entry Date(s): | Date Created: 20251128 Date Completed: 20251128 Latest Revision: 20251203 |
| Update Code: | 20251203 |
| PubMed Central ID: | PMC12667340 |
| DOI: | 10.1080/10872981.2025.2589679 |
| PMID: | 41311245 |
| Databáze: | MEDLINE |
| Abstrakt: | The rapid advancement of artificial intelligence (AI) chatbots has generated significant interest regarding their potential applications within medical education. This study sought to assess the performance of the open-source large language model DeepSeek-V3 in answering radiology board-style questions and to compare its accuracy with that of ChatGPT-3.5.A total of 161 questions (comprising 207 items) were randomly selected from the Exercise Book for the National Senior Health Professional Qualification Examination: Radiology . The question set included single-choice, multiple-choice, shared-stem, and case analysis questions. Both DeepSeek-V3 and ChatGPT-3.5 were evaluated using the same question set over a seven-day testing period. Response accuracy was systematically assessed, and statistical analyses were performed using Pearson's chi-square test and Fisher's exact test.DeepSeek-V3 achieved an overall accuracy of 72%, which was significantly higher than the 55.6% accuracy achieved by ChatGPT-3.5 ( P < 0.001). Performance analysis by question type revealed DeepSeek's superior accuracy in single-choice questions (87.1%), though with comparatively lower performance in multiple-choice (55.7%) and case analysis questions (68.0%). Across clinical subspecialties, DeepSeek consistently outperformed ChatGPT, particularly in peripheral nervous system ( P = 0.003), respiratory system ( P = 0.008), circulatory system ( P = 0.012), and musculoskeletal system ( P = 0.021) domains.In conclusion, DeepSeek demonstrates considerable potential as an educational tool in radiology, particularly for knowledge recall and foundational learning applications. However, its relatively weaker performance on higher-order cognitive tasks and complex question formats suggests the need for further model refinement. Future research should investigate DeepSeek's capability in processing image-based questions and perform comparative analyses with more advanced models (e.g., GPT-5) to better evaluate its potential for medical education. |
|---|---|
| ISSN: | 1087-2981 |
| DOI: | 10.1080/10872981.2025.2589679 |
Full Text Finder
Nájsť tento článok vo Web of Science