Aim: The objective of the present study was to investigate the clinical understanding and reasoning abilities of large language models (LLMs); namely, ChatGPT, GPT-4, and New Bing, by evaluating their performance in the NDLE (National Dental Licensing Examination) in China.
Materials and methods: Questions from the NDLE from 2020 to 2022 were selected based on subject weightings. Standardized prompts were utilized to regulate the output of LLMs for acquiring more precise answers. The performance of each model across each subject category and for the subjects overall was analyzed employing the McNemar’s test.
Results: The percentage scores obtained by ChatGPT, GPT-4, and New Bing were 42.6% (138/324), 63.0% (204/324), and 72.5% (235/324), respectively. Significant variance was seen between the performance of New Bing compared with ChatGPT and GPT-4. GPT-4 and New Bing outperformed ChatGPT across all subjects, with New Bing surpassing GPT-4 in most subjects.
Conclusion: GPT-4 and New Bing exhibited promising capabilities in the NDLE. However, their performance in specific subjects such as prosthodontics and oral and maxillofacial surgery requires improvement. This performance gap can be attributed to limited dental training data and the inherent complexity of these subjects.
Keywords: artificial intelligence, big data, evidence-based dental/health care, dental education, deep learning/machine learning