Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata
Abstract
Background Artificial intelligence (AI), particularly large language models (LLMs), has emerged as a promising tool in healthcare, with potential applications in clinical decision support and dental education. Despite increasing interest, evidence regarding the performance of LLMs in periodontology—especially in clinically oriented, scenario-based assessments—remains limited. This study aimed to evaluate and compare the accuracy of multiple LLMs in answering knowledge-based and scenario-based multiple-choice questions (MCQs) in periodontology across different difficulty levels.Methods A total of 100 periodontology MCQs were selected from validated academic sources and divided into two categories: knowledge-based questions (n = 50) and scenario-based questions (n = 50). Each category was further stratified into easy and moderate–difficult levels (25 questions each) based on expert consensus. Four publicly available LLMs (GPT-4o, Gemini 1.5 Flash, DeepSeek-V3, and Microsoft Copilot) were evaluated using a standardized prompting framework. Model responses were assessed for accuracy against verified answer keys. Statistical analysis was performed using Pearson’s Chi-square test, with significance set at p < 0.05.Results Overall accuracy ranged from 63% to 71%, with Gemini achieving the highest overall performance (71%), followed by GPT (70%), DeepSeek (65%), and Copilot (63%), without statistically significant differences (p = 0.80). All models demonstrated higher accuracy in scenario-based MCQs compared to knowledge-based questions, with statistically significant improvements observed for GPT (p < 0.001), Gemini (p = 0.014), and DeepSeek (p = 0.022). Accuracy decreased with increasing question difficulty, with significant performance declines observed for Gemini (p = 0.015) and Copilot (p = 0.022), while GPT and DeepSeek showed more stable performance.Conclusions LLMs demonstrate baseline competency in periodontology and show improved performance in context-rich, scenario-based questions. However, their accuracy remains variable and task-dependent, particularly under increasing difficulty. While these models may serve as useful adjuncts in dental education and clinical support, they are not yet reliable as standalone tools for clinical decision-making.
Keywords
Citation Information
@article{helmimostafaabdaljabbarkhatib2026,
title={Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata},
author={Helmi Mostafa Abdaljabbar Khatib},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9468440/v1}
}
SinoXiv