๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - January 17, 2026

Leading large language models on a periodontology knowledge test.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluated the performance of large language models (LLMs) in a specialized dental domain, specifically periodontology, using a validated set of 50 multiple-choice questions. The overall accuracy across all models was 65.0%, indicating moderate domain knowledge but insufficient reliability for unsupervised clinical decision support.

๐Ÿ” Key Details

  • ๐Ÿ“Š Models Tested: ChatGPT-4o, DeepSeek-R1, Consensus, Perplexity
  • ๐Ÿงฉ Question Set: 50 validated multiple-choice questions in periodontology
  • โš™๏ธ Methodology: Five independent trials under primed and non-primed conditions
  • ๐Ÿ† Overall Accuracy: 65.0% (95% CI: 63.4-66.6%)

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“Š LLMs show moderate performance in periodontology knowledge tests.
  • ๐Ÿ’ก No significant differences were found between the models (p = 0.336).
  • ๐Ÿ‘ฉโ€๐Ÿ”ฌ Role-specific priming did not enhance performance (p = 0.836).
  • ๐Ÿ† Four questions were never answered correctly by any model.
  • ๐Ÿค– Several questions were answered correctly in fewer than 13% of trials.
  • ๐ŸŒ Findings suggest LLMs are not yet reliable for unsupervised clinical decision support.
  • ๐Ÿ†” Study published in the Swiss Dent J, 2026.

๐Ÿ“š Background

The integration of large language models (LLMs) into clinical and educational settings is on the rise. However, there is a notable lack of data regarding their performance in specialized fields such as dentistry. Understanding how these models perform in specific domains is crucial for their effective application in clinical decision-making and education.

๐Ÿ—’๏ธ Study

This study aimed to assess the performance of four different LLMs on a validated periodontology knowledge test. Conducted with a set of 50 multiple-choice questions, the models underwent five independent trials, both with and without specific instructions to respond as board-certified periodontists. The results were benchmarked against a validated answer key.

๐Ÿ“ˆ Results

The overall accuracy of the models was found to be 65.0%, with a standard deviation of 3.0. Statistical analyses revealed no significant differences in performance among the models, indicating that they all exhibited similar levels of domain knowledge. Notably, role-specific priming did not yield any improvement in accuracy.

๐ŸŒ Impact and Implications

The findings of this study highlight the current limitations of LLMs in providing reliable clinical decision support in periodontology. While these models demonstrate some level of domain knowledge, their performance is not yet sufficient for unsupervised use in clinical settings. This underscores the need for further research and development to enhance the capabilities of LLMs in specialized fields.

๐Ÿ”ฎ Conclusion

This study sheds light on the moderate performance of large language models in periodontology. While they show promise, the current level of accuracy is not adequate for clinical decision-making without supervision. Continued advancements in LLM technology and further research are essential to improve their reliability and applicability in specialized medical domains.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of large language models in specialized fields like dentistry? We would love to hear your insights! ๐Ÿ’ฌ Leave your comments below or connect with us on social media:

Leading large language models on a periodontology knowledge test.

Abstract

Large language models (LLMs) are increasingly used in clinical and educational settings. However, there is a paucity of data on LLMs’ performance in specialized dental domains. This study assessed the performance of four LLMs, including two general-purpose models, ChatGPT-4o and DeepSeek-R1, and two research-focused models, Consensus and Perplexity, using a validated set of 50 multiple-choice questions in periodontology. Each LLM completed five independent trials encompassing the full question set under both primed and non-primed conditions. A validated answer key served as the benchmark. Performance was analyzed using one-way and two-way analysis of variance and independent-samples t-tests, with additional item-level analyses to identify questions that were consistently difficult. Overall accuracy across all models and trials was 65.0% (95% confidence interval: 63.4-66.6%) with a standard deviation of 3.0. There were no significant differences between models (p = 0.336). Role-specific priming, in which models were instructed to respond as board-certified periodontists, did not improve performance (p = 0.836). At the item level, four questions were never answered correctly, and several others were answered correctly in fewer than 13% of trials. These difficult items generally required detailed procedural knowledge, rare factual recall, or application of classification frameworks. Overall, these findings suggest that current LLMs demonstrate moderate domain knowledge in periodontology but fall short of the reliability required for unsupervised clinical decision support.

Author: [‘Rusa AM’, ‘Schmidlin PR’, ‘Sarkar P’, ‘Eggmann F’]

Journal: Swiss Dent J

Citation: Rusa AM, et al. Leading large language models on a periodontology knowledge test. Leading large language models on a periodontology knowledge test. 2026; 135:37-51. doi: 10.61872/sdj-2025-04-04

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.