โก Quick Summary
This study assessed the performance of two frontier large language models (LLMs), ChatGPT-5 and Grok-4, in the field of sleep medicine. Both models demonstrated high accuracy in diagnostic tasks, achieving a final diagnosis accuracy of 92.4%, but struggled with generating comprehensive differential diagnoses.
๐ Key Details
- ๐ Dataset: 79 clinical vignettes and 897 multiple-choice questions (MCQs)
- ๐งฉ Evaluation Metrics: Concept-level exact match, precision, recall, F1-score
- โ๏ธ Models Tested: ChatGPT-5 and Grok-4
- ๐ Performance: Final diagnosis accuracy: 92.4%; MCQs: ChatGPT-5: 93.0%, Grok-4: 92.8%
๐ Key Takeaways
- ๐ค Both models achieved high accuracy in sleep medicine tasks requiring knowledge recall.
- ๐ Performance in generating comprehensive differential diagnoses was suboptimal.
- ๐ F1-scores for differential diagnosis: ChatGPT-5 (0.55) and Grok-4 (0.59).
- ๐ No significant differences were found between the two models (p > 0.05).
- ๐ก Current models may be more reliable for focused knowledge support than broad hypothesis generation.
- ๐ฎ Future research should explore domain-adapted models for improved diagnostic usefulness.

๐ Background
The integration of artificial intelligence in healthcare is rapidly evolving, with large language models (LLMs) emerging as potential tools for enhancing diagnostic reasoning. In the specialty of sleep medicine, accurate diagnosis is crucial, and the ability of LLMs to assist in this process could significantly impact patient care.
๐๏ธ Study
This study aimed to evaluate the diagnostic reasoning capabilities of ChatGPT-5 and Grok-4 using a set of clinical vignettes and multiple-choice questions derived from established board review materials. The researchers employed rigorous scoring methods to assess the models’ performance in both final diagnosis and differential diagnosis tasks.
๐ Results
Both ChatGPT-5 and Grok-4 achieved a commendable final diagnosis accuracy of 92.4%. In terms of MCQs, ChatGPT-5 scored 93.0% while Grok-4 followed closely at 92.8%. However, their performance in generating comprehensive differential diagnoses was less impressive, with modest F1-scores indicating room for improvement.
๐ Impact and Implications
The findings of this study highlight the potential of LLMs in enhancing diagnostic accuracy in sleep medicine. While they excel in knowledge recall and pattern recognition, their limitations in complex clinical reasoning suggest that further advancements are necessary. The exploration of domain-adapted models or clinician-in-the-loop workflows could pave the way for more effective and safe applications in real-world settings.
๐ฎ Conclusion
This study underscores the promising role of frontier LLMs in the field of sleep medicine, particularly in tasks requiring knowledge recall. However, the challenges in generating comprehensive differential diagnoses indicate that ongoing research and development are essential. As we look to the future, the integration of AI in clinical practice holds great potential for improving patient outcomes, but careful consideration of its limitations is crucial.
๐ฌ Your comments
What are your thoughts on the use of large language models in sleep medicine? Do you believe they can enhance diagnostic processes? Let’s engage in a discussion! ๐ฌ Share your insights in the comments below or connect with us on social media:
Assessment of frontier Large Language Models in sleep medicine.
Abstract
STUDY OBJECTIVES: To evaluate and compare the performance of two proprietary frontier large language models (LLMs), ChatGPT-5 and Grok-4, on diagnostic reasoning and foundational knowledge tasks within the specialty domain of sleep medicine.
METHODS: The models were evaluated on two tasks: case-based reasoning using 79 clinical vignettes from the AASM Case Book of Sleep Medicine and knowledge assessment using 897 multiple-choice questions (MCQs) from board review materials. For vignettes, final diagnosis was scored by concept-level exact match, and differential diagnosis (DDx) was scored on a fixed top-5 output using concept-level matching with synonym normalization to compute precision, recall, and F1-score. MCQ performance was the proportion correct. Inter-model performance was compared using the Mann-Whitney U test.
RESULTS: Both models achieved high accuracy for final diagnosis (92.4% for both; 95% CI 86.4, 98.4) and MCQs (ChatGPT-5: 93.0%; Grok-4: 92.8%). However, performance on generating a comprehensive differential diagnosis was suboptimal, with modest F1-scores for both ChatGPT-5 (0.55โยฑโ0.20) and Grok-4 (0.59โยฑโ0.20). There were no statistically significant differences in performance between the two models across any metric (pโ>โ0.05).
CONCLUSIONS: Frontier LLMs demonstrated high accuracy in sleep medicine tasks requiring knowledge recall and direct pattern recognition but showed more limited performance in complex clinical reasoning tasks such as generating a comprehensive differential diagnosis. These findings suggest that current general-purpose models may be more reliable for focused knowledge support than for broad hypothesis generation. Future studies should evaluate whether domain-adapted models or clinician-in-the-loop workflows can improve real-world diagnostic usefulness and safety.
Author: [‘Patel A’, ‘Contractor H’, ‘Heninger H’, ‘Vallamchetla SK’, ‘Li P’, ‘Tao C’, ‘Cheung J’]
Journal: Front Digit Health
Citation: Patel A, et al. Assessment of frontier Large Language Models in sleep medicine. Assessment of frontier Large Language Models in sleep medicine. 2026; 8:1769386. doi: 10.3389/fdgth.2026.1769386