๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - May 6, 2026

Evaluating the clinical safety of large language models in oral cancer-related patient communication: a repeated-prompt observational study.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluated the clinical safety of two large language models (LLMs), Google Gemini and xAI Grok, in providing information related to oral cancer. The findings revealed that while both models demonstrated acceptable levels of scientific accuracy and referral safety, variability in their responses highlighted potential clinical risks.

๐Ÿ” Key Details

  • ๐Ÿ“Š Study Duration: 7 days
  • ๐Ÿงฉ Patient Scenarios: 20 standardized Turkish-language scenarios related to oral cancer
  • โš™๏ธ Models Evaluated: Google Gemini (Pro version) and xAI Grok (Grok-1)
  • ๐Ÿ† Evaluation Metrics: Scientific accuracy, completeness, readability, and referral safety

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“Š Both models showed comparable scientific accuracy (Gemini: 3.52; Grok: 3.39).
  • ๐Ÿ’ก Referral safety was high, with Gemini at 90.0% and Grok at 92.1%.
  • ๐Ÿ‘ฉโ€๐Ÿ”ฌ Grok consistently recommended referral in all high-risk scenarios, unlike Gemini.
  • ๐Ÿ† Readability scores were similar, but Grok produced longer sentences, indicating more complexity.
  • ๐Ÿค– Internal consistency was high for both models (Gemini ฮฑ = 0.942; Grok ฮฑ = 0.886).
  • ๐ŸŒ Inter-model agreement was moderate (ICC: 0.50-0.58).
  • โš ๏ธ Variability in referral behavior raises concerns about clinical risks in high-stakes scenarios.
  • ๐Ÿ” Further research is needed to enhance AI-assisted healthcare communication.

๐Ÿ“š Background

As patients increasingly turn to large language models (LLMs) for health-related information, it is crucial to assess the clinical safety of these AI-generated responses, especially in sensitive areas like oral oncology. Despite the growing interest in AI applications in medicine, there is limited evidence regarding the consistency and safety of these responses in patient communication.

๐Ÿ—’๏ธ Study

This observational study was conducted over a week, where researchers submitted 20 standardized Turkish-language scenarios related to suspected oral cancer to both Google Gemini and xAI Grok. A total of 280 responses were generated, which were then evaluated for scientific accuracy, completeness, readability, and referral safety by two independent oral and maxillofacial radiologists.

๐Ÿ“ˆ Results

The results indicated that both models had comparable levels of scientific accuracy and completeness. However, while the overall referral safety was high, the Gemini model did not recommend professional consultation in two high-risk scenarios involving suspected malignancy, raising concerns about its clinical applicability. Grok, on the other hand, consistently recommended referral across all scenarios.

๐ŸŒ Impact and Implications

The findings from this study underscore the potential of LLMs to assist in patient education and triage in oral cancer-related communication. However, the variability in referral behavior and the occasional under-referral in high-risk scenarios highlight the need for caution. These systems should be viewed as adjunctive tools rather than replacements for professional evaluation, emphasizing the importance of further research to optimize their use in healthcare communication.

๐Ÿ”ฎ Conclusion

This study illustrates the generally acceptable accuracy of contemporary LLMs in oral cancer-related communication, while also revealing significant areas for improvement. As AI continues to evolve, it is essential to balance clinical caution with the need for appropriate guidance in AI-assisted healthcare communication. The future of AI in healthcare holds promise, but ongoing research is vital to ensure patient safety and effective communication.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of AI in healthcare communication? Do you believe LLMs can enhance patient education while ensuring safety? Let’s start a conversation! ๐Ÿ’ฌ Leave your thoughts in the comments below or connect with us on social media:

Evaluating the clinical safety of large language models in oral cancer-related patient communication: a repeated-prompt observational study.

Abstract

BACKGROUND: As patients increasingly consult large language models (LLMs) for health-related information, evaluating the clinical safety of AI-generated responses has become essential, particularly in high-risk domains such as oral oncology. Despite growing interest in AI applications in medicine, evidence regarding response consistency and referral safety in patient communication remains limited. This study aimed to assess the clinical safety of two contemporary LLMs in oral cancer-related patient scenarios using a multidimensional evaluation framework.
METHODS: This repeated-prompt observational study evaluated Google Gemini (Pro version) and xAI Grok (Grok-1) over a 7-day period. Twenty standardized Turkish-language patient scenarios related to suspected oral cancer were submitted daily to each model, generating 280 responses. Scientific accuracy and completeness were assessed using a 5-point Likert scale by two independent oral and maxillofacial radiologists. Readability was evaluated using validated Turkish indices (AteลŸman and Bezirci-Yฤฑlmaz). Referral safety was assessed as a binary outcome. Internal consistency across repeated prompts was measured using Cronbach’s alpha, and inter-model agreement was analyzed using intraclass correlation coefficients (ICC).
RESULTS: Both models demonstrated comparable levels of scientific accuracy (Gemini: 3.52โ€‰ยฑโ€‰0.57; Grok: 3.39โ€‰ยฑโ€‰0.68; pโ€‰=โ€‰0.072) and completeness (3.40โ€‰ยฑโ€‰0.70 vs. 3.25โ€‰ยฑโ€‰0.78; pโ€‰=โ€‰0.091). Overall referral safety was high (Gemini: 90.0%; Grok: 92.1%), although the Gemini model failed to recommend professional consultation in two high-risk scenarios involving suspected malignancy. In contrast, Grok consistently recommended referral across all scenarios. Readability scores were similar between models; however, Grok generated significantly longer sentences (pโ€‰=โ€‰0.0005; Cohen’s dโ€‰=โ€‰2.50), indicating increased linguistic complexity. Internal consistency was high for both models (Gemini ฮฑโ€‰=โ€‰0.942; Grok ฮฑโ€‰=โ€‰0.886), whereas inter-model agreement was moderate (ICC: 0.50-0.58).
CONCLUSIONS: Contemporary LLMs demonstrate generally acceptable accuracy and a precautionary approach in oral cancer-related communication. However, variability in referral behavior and linguistic structure, along with occasional under-referral in high-risk scenarios, highlights potential clinical risks. While these systems may support patient education and triage, they should be considered adjunctive tools and not substitutes for professional evaluation. Further research is needed to optimize the balance between clinical caution and appropriate guidance in AI-assisted healthcare communication.

Author: [‘Kollayan BY’, ‘Cebeci T’]

Journal: BMC Oral Health

Citation: Kollayan BY and Cebeci T. Evaluating the clinical safety of large language models in oral cancer-related patient communication: a repeated-prompt observational study. Evaluating the clinical safety of large language models in oral cancer-related patient communication: a repeated-prompt observational study. 2026; (unknown volume):(unknown pages). doi: 10.1186/s12903-026-08530-0

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.