Research Article 2026-04-21 posted v1

Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata

Helmi Mostafa Abdaljabbar Khatib An-Najah National University

Download PDF View Original Citation

Abstract

Background Artificial intelligence (AI), particularly large language models (LLMs), has emerged as a promising tool in healthcare, with potential applications in clinical decision support and dental education. Despite increasing interest, evidence regarding the performance of LLMs in periodontology—especially in clinically oriented, scenario-based assessments—remains limited. This study aimed to evaluate and compare the accuracy of multiple LLMs in answering knowledge-based and scenario-based multiple-choice questions (MCQs) in periodontology across different difficulty levels.Methods A total of 100 periodontology MCQs were selected from validated academic sources and divided into two categories: knowledge-based questions (n = 50) and scenario-based questions (n = 50). Each category was further stratified into easy and moderate–difficult levels (25 questions each) based on expert consensus. Four publicly available LLMs (GPT-4o, Gemini 1.5 Flash, DeepSeek-V3, and Microsoft Copilot) were evaluated using a standardized prompting framework. Model responses were assessed for accuracy against verified answer keys. Statistical analysis was performed using Pearson’s Chi-square test, with significance set at p < 0.05.Results Overall accuracy ranged from 63% to 71%, with Gemini achieving the highest overall performance (71%), followed by GPT (70%), DeepSeek (65%), and Copilot (63%), without statistically significant differences (p = 0.80). All models demonstrated higher accuracy in scenario-based MCQs compared to knowledge-based questions, with statistically significant improvements observed for GPT (p < 0.001), Gemini (p = 0.014), and DeepSeek (p = 0.022). Accuracy decreased with increasing question difficulty, with significant performance declines observed for Gemini (p = 0.015) and Copilot (p = 0.022), while GPT and DeepSeek showed more stable performance.Conclusions LLMs demonstrate baseline competency in periodontology and show improved performance in context-rich, scenario-based questions. However, their accuracy remains variable and task-dependent, particularly under increasing difficulty. While these models may serve as useful adjuncts in dental education and clinical support, they are not yet reliable as standalone tools for clinical decision-making.

Keywords

Artificial intelligence Large language models Periodontology Dental education Clinical reasoning Multiple-choice questions

Citation Information

@article{helmimostafaabdaljabbarkhatib2026,
  title={Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata},
  author={Helmi Mostafa Abdaljabbar Khatib},
  journal={Research Square},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9468440/v1}
}

Helmi Mostafa Abdaljabbar Khatib et al. (2026). Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata. Research Square. https://doi.org/10.21203/rs.3.rs-9468440/v1

Helmi Mostafa Abdaljabbar Khatib, et al. \"Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata.\" Research Square, 2026.

[1]Helmi Mostafa Abdaljabbar Khatib.Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata[Research Article].Research Square,2026.

Paper Details

Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata

Abstract

Keywords

Citation Information

Related Papers

Welcome to SinoXiv

Paper Details

Are Large Language Models Ready for Specialty-Level Periodontology? A Comparative Evaluation Across Question Types and Difficulty Strata

Abstract

Keywords

Citation Information

Related Papers