โก Quick Summary
This study compared the performance of six large language models (LLMs) on the prosthodontics section of the Korean Dental Licensing Examination (KDLE), revealing that the Chain-of-Thought (CoT) reasoning model outperformed others with an accuracy of 80.54%, closely aligning with the human average of 79.51%.
๐ Key Details
- ๐ Dataset: 161 Korean-language multiple-choice questions from the KDLE (2020-2024)
- ๐งฉ Features used: Questions classified into five domains of prosthodontics
- โ๏ธ Technology: Six LLMs including ChatGPT-o1, ChatGPT-4o, and CLOVA X
- ๐ Performance: ChatGPT-o1: 80.54%, Human average: 79.51%
๐ Key Takeaways
- ๐ CoT reasoning models demonstrated superior performance in specialized clinical domains.
- ๐ก ChatGPT-o1 achieved the highest accuracy, surpassing the human average.
- ๐ฉโ๐ฌ Other models like ChatGPT-4o and DeepSeek-R1 also performed well, but not as effectively as CoT models.
- ๐ CLOVA X scored the lowest, highlighting the challenges of localized language optimization.
- ๐ Performance varied significantly based on the architectural philosophy of the models.
- ๐๏ธ Questions were categorized into five domains, reflecting the complexity of prosthodontics.
- ๐ Statistical analysis confirmed significant differences in model performance (P<.001).
- ๐ Contextual benchmarking against human performance provides valuable insights for future AI applications.

๐ Background
The integration of artificial intelligence (AI) in clinical settings has been a topic of growing interest, particularly in specialized fields like dentistry. However, there remains a critical gap in understanding how different architectural philosophies of large language models (LLMs) perform in non-Indo-European languages. This study aims to bridge that gap by evaluating the effectiveness of CoT reasoning models compared to those optimized for local languages in the context of the Korean Dental Licensing Examination.
๐๏ธ Study
Conducted with a focus on the prosthodontics section of the KDLE, the study involved presenting 161 Korean-language multiple-choice questions to six LLMs. Each model was tested multiple times, and the questions were categorized into five distinct domains: diagnosis and treatment planning, mandibular movements and occlusal relationships, removable complete denture, removable partial denture, and fixed prosthodontics. The performance of each model was measured by percentage accuracy and analyzed statistically.
๐ Results
The results indicated significant performance differences among the models (P<.001). The CoT-based model, ChatGPT-o1, achieved an impressive accuracy of 80.54%, which was statistically comparable to the human average of 79.51%. Other models, such as ChatGPT-4o and DeepSeek-R1, also demonstrated passing-level performance, while the Korean-optimized model, CLOVA X, lagged significantly with an accuracy of only 34.37%.
๐ Impact and Implications
The findings of this study have important implications for the future of AI in clinical education and practice. The ability of CoT reasoning models to achieve passing-level accuracy on non-English dental licensing examinations suggests that AI can be effectively integrated into specialized fields. However, the variability in performance based on model architecture highlights the need for careful consideration when selecting AI tools for clinical applications. This research opens the door for further exploration into how AI can enhance educational outcomes and clinical decision-making in dentistry and beyond.
๐ฎ Conclusion
This study underscores the potential of Chain-of-Thought reasoning models in achieving high accuracy in specialized clinical domains, particularly in non-English contexts. As AI continues to evolve, understanding the strengths and limitations of different architectural philosophies will be crucial for optimizing their use in healthcare. The future of AI in dentistry looks promising, and ongoing research will be essential to harness its full potential.
๐ฌ Your comments
What are your thoughts on the use of AI in dental education and practice? We would love to hear your insights! ๐ฌ Leave your comments below or connect with us on social media:
Chain-of-Thought reasoning versus linguistic optimization for artificial intelligence models on the prosthodontics section of a dental licensing examination.
Abstract
STATEMENT OF PROBLEM: A critical gap remains in understanding how the architectural philosophies of different large language models (LLMs) perform in specialized clinical domains conducted in non-Indo-European languages, particularly regarding the effectiveness of Chain-of-Thought (CoT) reasoning models compared with those optimized for local languages.
PURPOSE: The purpose of this study was to compare the performance of 6 LLMs with distinct architectural philosophies (CoT-reasoning, general-purpose, and Korean-optimized) on the prosthodontics section of the Korean Dental Licensing Examination (KDLE) and to contextualize their performance with published human averages.
MATERIAL AND METHODS: A total of 161 Korean-language, text-only multiple-choice questions from the prosthodontics section of the KDLE (2020-2024) were presented to 6 LLMs (ChatGPT-o1, ChatGPT-4o, DeepSeek-R1, DeepSeek-V3, Gemini 1.5 Flash, and CLOVA X). Each test set was posed 6 times. The questions were further classified into 5 domains: diagnosis and treatment planning, mandibular movements and occlusal relationships, removable complete denture, removable partial denture, and fixed prosthodontics. Performance was measured by percentage accuracy and analyzed using the Cochran Q and post hoc McNemar tests (ฮฑ=.05). LLM scores were contextually benchmarked against the average performance of human examinees.
RESULTS: Significant performance differences were observed among the models (P<.001). The CoT-based model, ChatGPT-o1, achieved the highest overall accuracy (80.54%); the total human average (79.51%) fell within this LLM's 95% confidence interval. ChatGPT-4o (71.84%) and DeepSeek-R1 (70.19%) also demonstrated consistent passing-level performance. The Korean-language-optimized model, CLOVA X, obtained the lowest score (34.37%). The performance ranking of the models within individual domains generally mirrored the overall ranking.
CONCLUSIONS: LLMs with CoT-reasoning architectures can achieve passing-level accuracy on non-English dental licensing examinations at a level contextually comparable to the human average, but performance varied significantly by architecture, and localized language optimization did not ensure domain expertise.
Author: [‘Hlaing NHMM’, ‘Park K’, ‘Hahn S’, ‘Lee SY’, ‘Yeo IL’, ‘Lee JH’]
Journal: J Prosthet Dent
Citation: Hlaing NHMM, et al. Chain-of-Thought reasoning versus linguistic optimization for artificial intelligence models on the prosthodontics section of a dental licensing examination. Chain-of-Thought reasoning versus linguistic optimization for artificial intelligence models on the prosthodontics section of a dental licensing examination. 2025; (unknown volume):(unknown pages). doi: 10.1016/j.prosdent.2025.10.004