Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments
Saved in:
| Title: | Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments |
|---|---|
| Authors: | Künzle, Paul, Paris, Sebastian |
| Source: | Clin Oral Investig |
| Publisher Information: | Springer Science and Business Media LLC, 2024. |
| Publication Year: | 2024 |
| Subject Terms: | Surveys and Questionnaires [MeSH], Endodontics/education [MeSH], Humans [MeSH], Artificial intelligence, Education, Dental/methods [MeSH], Dentistry, Operative/education [MeSH], GenAI, Artificial Intelligence [MeSH], ChatGPT, Students, Dental [MeSH], Gemini, Research, Natural language processing, Clinical Competence [MeSH], Educational Measurement/methods [MeSH], Artificial Intelligence, Dentistry, Operative, Surveys and Questionnaires, Students, Dental, Humans, Educational Measurement, Clinical Competence, Education, Dental, Endodontics |
| Description: | Objectives The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions. Materials and methods 151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16. Results The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics. Conclusions Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum. Clinical relevance While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed. |
| Document Type: | Article Other literature type |
| Language: | English |
| ISSN: | 1436-3771 |
| DOI: | 10.1007/s00784-024-05968-w |
| Access URL: | https://pubmed.ncbi.nlm.nih.gov/39373739 https://repository.publisso.de/resource/frl:6522168 |
| Rights: | CC BY |
| Accession Number: | edsair.doi.dedup.....54f3947e6ddb338f4a99785ee6892f3b |
| Database: | OpenAIRE |
| Abstract: | Objectives The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions. Materials and methods 151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16. Results The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics. Conclusions Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum. Clinical relevance While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed. |
|---|---|
| ISSN: | 14363771 |
| DOI: | 10.1007/s00784-024-05968-w |
Full Text Finder
Nájsť tento článok vo Web of Science