⚡ Quick Summary
This study evaluated the diagnostic accuracy of large language models (LLMs), specifically ChatGPT (GPT-4) and DeepSeek, in diagnosing skeletal dysplasias in pediatric patients. The results indicated that while LLMs performed well, they were outperformed by a clinical expert panel, highlighting the potential for AI to assist in rare disease diagnostics.
🔍 Key Details
- 👶 Patient Cohort: 45 children with confirmed skeletal dysplasias
- 🧠 AI Models: ChatGPT (GPT-4) and DeepSeek
- 👨⚕️ Expert Panel: Pediatric endocrinologist and orthopedic surgeon
- 📊 Diagnostic Accuracy: ChatGPT 62.2%, DeepSeek 64.4%, Expert Panel 82.2%
🔑 Key Takeaways
- 🤖 AI Integration: LLMs can complement human expertise in diagnosing rare diseases.
- 📈 High Intermodel Agreement: ChatGPT and DeepSeek showed a Cohen’s κ of 0.95.
- 🔍 Diagnostic Challenges: LLMs excelled with common disorders but struggled with ultra-rare conditions.
- 🏆 Expert Performance: The clinical expert panel outperformed both LLMs with an accuracy of 82.2%.
- 💡 Unique Case: DeepSeek identified a correct diagnosis in a complex case missed by experts.
- 🌍 Resource Implications: AI tools may enhance diagnostics in under-resourced medical settings.
- 🧬 Molecular Diagnosis: Used as the gold standard for comparison in the study.
📚 Background
Skeletal dysplasias are a group of rare genetic disorders characterized by abnormal bone and cartilage development. These conditions present significant diagnostic challenges due to their heterogeneous nature and overlapping clinical features. Traditional diagnostic methods often rely heavily on clinical expertise, which can lead to delays in diagnosis and treatment, particularly in pediatric patients.
🗒️ Study
This prospective vignette-based benchmarking study was conducted with a cohort of 45 children diagnosed with skeletal dysplasias across two tertiary centers. The researchers aimed to assess the diagnostic capabilities of two advanced LLMs, ChatGPT and DeepSeek, by prompting them with standardized clinical case vignettes. Their outputs were then compared to those of a clinical expert panel, which included a pediatric endocrinologist and a pediatric orthopedic surgeon.
📈 Results
The study found that both ChatGPT and DeepSeek achieved comparable diagnostic top-3 accuracy rates of 62.2% and 64.4%, respectively. The expert panel, however, outperformed both models with an accuracy of 82.2%. Notably, while LLMs performed well with more common disorders, they faced difficulties with ultra-rare and multisystemic conditions. Interestingly, in one complex case that the experts missed, DeepSeek successfully proposed the correct diagnosis.
🌍 Impact and Implications
The findings of this study suggest that LLMs can serve as valuable tools in the diagnostic process for skeletal dysplasias, particularly in settings where resources are limited. By integrating AI into multidisciplinary diagnostic workflows, healthcare providers may enhance early recognition of these rare diseases, ultimately reducing diagnostic delays and improving patient outcomes.
🔮 Conclusion
This study highlights the promising role of artificial intelligence in the diagnosis of rare genetic disorders such as skeletal dysplasias. While LLMs like ChatGPT and DeepSeek show potential as supportive diagnostic tools, the expertise of clinical professionals remains crucial. Future research should focus on refining these AI models and exploring their integration into clinical practice to enhance diagnostic accuracy and patient care.
💬 Your comments
What are your thoughts on the integration of AI in diagnosing rare diseases? We would love to hear your insights! 💬 Share your comments below or connect with us on social media:
The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group.
Abstract
BACKGROUND/OBJECTIVES: Skeletal dysplasias are a heterogeneous group of rare genetic disorders with diverse and overlapping clinical presentations, posing diagnostic challenges even for experienced clinicians. With the increasing availability of artificial intelligence (AI) in healthcare, large language models (LLMs) offer a novel opportunity to assist in rare disease diagnostics. This study aimed to compare the diagnostic accuracy of two advanced LLMs, ChatGPT (version GPT-4) and DeepSeek, with that of a clinical expert panel in a cohort of pediatric patients with genetically confirmed skeletal dysplasias.
METHODS: We designed a prospective vignette-based diagnostic benchmarking study including 45 children with confirmed skeletal dysplasias from two tertiary centers. Both LLMs were prompted to provide primary and differential diagnoses based on standardized clinical case vignettes. Their outputs were compared with those of two human experts (a pediatric endocrinologist and a pediatric orthopedic surgeon), using molecular diagnosis as the gold standard.
RESULTS: ChatGPT and DeepSeek achieved a comparable diagnostic top-3 accuracy (62.2% and 64.4%, respectively), with a high intermodel agreement (Cohen’s κ = 0.95). The expert panel outperformed both models (82.2%). While LLMs performed well on more common disorders, they struggled with ultra-rare and multisystemic conditions. In one complex case missed by experts, the DeepSeek model successfully proposed the correct diagnosis.
CONCLUSIONS: LLMs offer a complementary diagnostic value in skeletal dysplasias, especially in under-resourced medical settings. Their integration as a supportive tool in multidisciplinary diagnostic workflows may enhance early recognition and reduce diagnostic delays in rare disease care.
Author: [‘Ilić N’, ‘Marić N’, ‘Cvetković D’, ‘Bogosavljević M’, ‘Bukara-Radujković G’, ‘Krstić J’, ‘Paunović Z’, ‘Begović N’, ‘Panić Zarić S’, ‘Todorović S’, ‘Mitrović K’, ‘Vlahović A’, ‘Sarajlija A’]
Journal: Genes (Basel)
Citation: Ilić N, et al. The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group. The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group. 2025; 16:(unknown pages). doi: 10.3390/genes16070762