๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - May 15, 2026

Assessment of frontier Large Language Models in sleep medicine.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study assessed the performance of two frontier large language models (LLMs), ChatGPT-5 and Grok-4, in the field of sleep medicine. Both models demonstrated high accuracy in diagnostic tasks, achieving a final diagnosis accuracy of 92.4%, but struggled with generating comprehensive differential diagnoses.

๐Ÿ” Key Details

  • ๐Ÿ“Š Dataset: 79 clinical vignettes and 897 multiple-choice questions (MCQs)
  • ๐Ÿงฉ Evaluation Metrics: Concept-level exact match, precision, recall, F1-score
  • โš™๏ธ Models Tested: ChatGPT-5 and Grok-4
  • ๐Ÿ† Performance: Final diagnosis accuracy: 92.4%; MCQs: ChatGPT-5: 93.0%, Grok-4: 92.8%

๐Ÿ”‘ Key Takeaways

  • ๐Ÿค– Both models achieved high accuracy in sleep medicine tasks requiring knowledge recall.
  • ๐Ÿ“‰ Performance in generating comprehensive differential diagnoses was suboptimal.
  • ๐Ÿ“Š F1-scores for differential diagnosis: ChatGPT-5 (0.55) and Grok-4 (0.59).
  • ๐Ÿ” No significant differences were found between the two models (p > 0.05).
  • ๐Ÿ’ก Current models may be more reliable for focused knowledge support than broad hypothesis generation.
  • ๐Ÿ”ฎ Future research should explore domain-adapted models for improved diagnostic usefulness.

๐Ÿ“š Background

The integration of artificial intelligence in healthcare is rapidly evolving, with large language models (LLMs) emerging as potential tools for enhancing diagnostic reasoning. In the specialty of sleep medicine, accurate diagnosis is crucial, and the ability of LLMs to assist in this process could significantly impact patient care.

๐Ÿ—’๏ธ Study

This study aimed to evaluate the diagnostic reasoning capabilities of ChatGPT-5 and Grok-4 using a set of clinical vignettes and multiple-choice questions derived from established board review materials. The researchers employed rigorous scoring methods to assess the models’ performance in both final diagnosis and differential diagnosis tasks.

๐Ÿ“ˆ Results

Both ChatGPT-5 and Grok-4 achieved a commendable final diagnosis accuracy of 92.4%. In terms of MCQs, ChatGPT-5 scored 93.0% while Grok-4 followed closely at 92.8%. However, their performance in generating comprehensive differential diagnoses was less impressive, with modest F1-scores indicating room for improvement.

๐ŸŒ Impact and Implications

The findings of this study highlight the potential of LLMs in enhancing diagnostic accuracy in sleep medicine. While they excel in knowledge recall and pattern recognition, their limitations in complex clinical reasoning suggest that further advancements are necessary. The exploration of domain-adapted models or clinician-in-the-loop workflows could pave the way for more effective and safe applications in real-world settings.

๐Ÿ”ฎ Conclusion

This study underscores the promising role of frontier LLMs in the field of sleep medicine, particularly in tasks requiring knowledge recall. However, the challenges in generating comprehensive differential diagnoses indicate that ongoing research and development are essential. As we look to the future, the integration of AI in clinical practice holds great potential for improving patient outcomes, but careful consideration of its limitations is crucial.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of large language models in sleep medicine? Do you believe they can enhance diagnostic processes? Let’s engage in a discussion! ๐Ÿ’ฌ Share your insights in the comments below or connect with us on social media:

Assessment of frontier Large Language Models in sleep medicine.

Abstract

STUDY OBJECTIVES: To evaluate and compare the performance of two proprietary frontier large language models (LLMs), ChatGPT-5 and Grok-4, on diagnostic reasoning and foundational knowledge tasks within the specialty domain of sleep medicine.
METHODS: The models were evaluated on two tasks: case-based reasoning using 79 clinical vignettes from the AASM Case Book of Sleep Medicine and knowledge assessment using 897 multiple-choice questions (MCQs) from board review materials. For vignettes, final diagnosis was scored by concept-level exact match, and differential diagnosis (DDx) was scored on a fixed top-5 output using concept-level matching with synonym normalization to compute precision, recall, and F1-score. MCQ performance was the proportion correct. Inter-model performance was compared using the Mann-Whitney U test.
RESULTS: Both models achieved high accuracy for final diagnosis (92.4% for both; 95% CI 86.4, 98.4) and MCQs (ChatGPT-5: 93.0%; Grok-4: 92.8%). However, performance on generating a comprehensive differential diagnosis was suboptimal, with modest F1-scores for both ChatGPT-5 (0.55โ€‰ยฑโ€‰0.20) and Grok-4 (0.59โ€‰ยฑโ€‰0.20). There were no statistically significant differences in performance between the two models across any metric (pโ€‰>โ€‰0.05).
CONCLUSIONS: Frontier LLMs demonstrated high accuracy in sleep medicine tasks requiring knowledge recall and direct pattern recognition but showed more limited performance in complex clinical reasoning tasks such as generating a comprehensive differential diagnosis. These findings suggest that current general-purpose models may be more reliable for focused knowledge support than for broad hypothesis generation. Future studies should evaluate whether domain-adapted models or clinician-in-the-loop workflows can improve real-world diagnostic usefulness and safety.

Author: [‘Patel A’, ‘Contractor H’, ‘Heninger H’, ‘Vallamchetla SK’, ‘Li P’, ‘Tao C’, ‘Cheung J’]

Journal: Front Digit Health

Citation: Patel A, et al. Assessment of frontier Large Language Models in sleep medicine. Assessment of frontier Large Language Models in sleep medicine. 2026; 8:1769386. doi: 10.3389/fdgth.2026.1769386

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.