🧑🏼‍💻 Research - May 3, 2026

Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability.

['Li J', 'Ni J', 'Zheng T', 'Zuo Z', 'Wang Y']

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

⚡ Quick Summary

This study evaluated the performance of five large language models (LLMs) in providing orthodontic consultation for patients with periodontitis. Notably, Grok-3 excelled in reliability and quality, while DeepSeek-V3 was the most readable, highlighting the potential of AI in patient education.

🔍 Key Details

📊 Models evaluated: ChatGPT-4o, DeepSeek-V3, Claude-Sonnet-4, Gemini-2.0 Flash, Grok-3
🔍 Evaluation metrics: Reliability (mDISCERN), Quality (GQS), Readability (FRE, FKGL)
📅 Study duration: Responses to 30 frequently asked questions sourced from social media and health websites
📈 Statistical analysis: Linear mixed-effects models with Bonferroni’s adjustment

🔑 Key Takeaways

📊 Grok-3 achieved the highest reliability (mDISCERN: 4.20) and quality (GQS: 4.38).
📉 Claude-Sonnet-4 scored the lowest in both reliability (mDISCERN: 3.54) and quality (GQS: 3.63).
📖 DeepSeek-V3 was rated as the most readable (FRE: 33.61; FKGL: 10.10).
📉 Claude-Sonnet-4 had the least readability (FRE: 4.73; FKGL: 13.72).
📚 All models produced responses with university-level readability.
⚠️ Caution advised: These models should be considered supplementary resources due to potential misinformation risks.
🌍 Study published in: BMC Oral Health.
🆔 PMID: 42062980.

📚 Background

The integration of large language models (LLMs) into healthcare has opened new avenues for patient education, particularly in specialized fields like orthodontics. Patients with periodontitis often seek guidance on orthodontic treatment, making it crucial to evaluate the reliability and quality of AI-generated responses to ensure they receive accurate and comprehensible information.

🗒️ Study

This study aimed to assess the performance of five publicly accessible LLM-based chatbots in addressing common inquiries from patients with periodontitis. By compiling thirty frequently asked questions from social media and health-related websites, researchers evaluated each model’s responses for reliability, quality, and readability using established metrics.

📈 Results

The findings revealed significant performance differences among the evaluated models (P < 0.001). Grok-3 emerged as the top performer in both reliability and quality, while DeepSeek-V3 was noted for its superior readability. All models, however, maintained a university-level readability standard, indicating their potential utility in patient education.

🌍 Impact and Implications

The results of this study underscore the potential of LLMs in enhancing patient education, particularly for complex topics like orthodontic treatment for periodontitis. While these models can provide valuable information, it is essential to approach their use with caution, recognizing their limitations and the risks of misinformation. As AI continues to evolve, integrating these technologies into patient care could significantly improve the quality of information available to patients.

🔮 Conclusion

This study highlights the promising role of large language models in orthodontic consultations for patients with periodontitis. With Grok-3 demonstrating higher reliability and quality, and DeepSeek-V3 offering more readable responses, these tools can serve as valuable supplementary resources. However, ongoing research and careful implementation are necessary to ensure their safe and effective use in healthcare settings.

💬 Your comments

What are your thoughts on the use of AI in patient education? Do you believe these models can enhance the quality of information provided to patients? 💬 Share your insights in the comments below or connect with us on social media:

Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability.

Abstract

BACKGROUND: This study aimed to evaluate and compare the performance of five publicly accessible large language models (LLMs)-based chatbots, ChatGPT-4o, DeepSeek-V3, Claude-Sonnet-4, Gemini-2.0 Flash, and Grok-3, in addressing inquiries from patients with periodontitis seeking orthodontic treatment. The primary objective was to assess the reliability, quality, and readability of the LLM-generated responses.
METHODS: Thirty frequently asked questions regarding orthodontic treatment for patients with periodontitis were sourced from social media platforms and health-related websites and compiled for this study. Each LLM response was evaluated for reliability using the modified DISCERN (mDISCERN) tool, quality using the Global Quality Score (GQS), and readability using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Differences among models were analysed using linear mixed-effects models, with model treated as a fixed effect and question as a random effect. Post-hoc pairwise comparisons of estimated marginal means were performed with Bonferroni’s adjustment. Significance was set at P < 0.05. RESULTS: Among the evaluated LLMs, significant performance differences were observed across all metrics (P < 0.001). Grok-3 provided the highest reliability and quality (mDISCERN: 4.20 ± 0.48; GQS: 4.38 ± 0.61), whereas Claude-Sonnet-4 scored the lowest (mDISCERN: 3.54 ± 0.50; GQS: 3.63 ± 0.59). DeepSeek-V3 was rated as most readable (FRE: 33.61 ± 6.11; FKGL: 10.10 ± 1.14), whereas Claude-Sonnet-4 was the least readable (FRE: 4.73 ± 4.14; FKGL: 13.72 ± 1.22). All models produced responses with university-level readability. CONCLUSIONS: Grok-3 demonstrates higher reliability and quality, whereas DeepSeek-V3 generates more readable responses. All models exceed recommended readability thresholds for patient education. However, given the risks of misinformation and readability limitations, these should be considered supplementary educational resources, rather than primary sources of medical information.

Author: [‘Li J’, ‘Ni J’, ‘Zheng T’, ‘Zuo Z’, ‘Wang Y’]

Journal: BMC Oral Health

Citation: Li J, et al. Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability. Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability. 2026; (unknown volume):(unknown pages). doi: 10.1186/s12903-026-08448-7

🧑🏼‍💻 Research - May 3, 2026

Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability.

['Li J', 'Ni J', 'Zheng T', 'Zuo Z', 'Wang Y']

⚡ Quick Summary

🔍 Key Details

🔑 Key Takeaways

📚 Background

🗒️ Study

📈 Results

🌍 Impact and Implications

🔮 Conclusion

💬 Your comments

Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability.

Abstract

Leave a ReplyCancel reply