๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - September 27, 2025

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluated the diagnostic capabilities of 21 large language models (LLMs) using a novel LLM-as-a-judge method on real patient data from the MIMIC-IV database. The top performer, Gemini 2.5, achieved an impressive hit rate of 97.4%, highlighting the potential of LLMs in enhancing diagnostic accuracy in healthcare.

๐Ÿ” Key Details

  • ๐Ÿ“Š Dataset: 1,000 randomly selected hospital admissions from MIMIC-IV
  • ๐Ÿงฉ Models evaluated: 21 LLMs from Google, OpenAI, Meta, Mistral, Cohere, and Anthropic
  • โš™๏ธ Methodology: LLM-as-a-judge approach for automated evaluation
  • ๐Ÿ† Top performer: Gemini 2.5 with a hit rate of 97.4% (95% CI 97.0%-97.8%)

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“Š Diagnostic errors are a leading cause of patient mortality, emphasizing the need for improved diagnostic tools.
  • ๐Ÿ’ก LLMs can significantly aid in diagnosing comorbid patients, potentially reducing diagnostic errors.
  • ๐Ÿ† Gemini 2.5 outperformed other models, including GPT-4.1 and Claude-4 Opus.
  • ๐Ÿ” Variability in diagnostic hit rates was noted across different prompts.
  • ๐Ÿ“ˆ Retrieval-augmented generation (RAG) improved the hit rate of GPT-4o 05-13 by an average of 0.8% (P<.006).
  • ๐Ÿค– Model performance varied based on the temperature settings used during evaluation.
  • ๐ŸŒ Collaboration with healthcare professionals is essential for further validation of these models.
  • ๐Ÿ”ฎ Future research should focus on diverse datasets and real-world hospital pilots.

๐Ÿ“š Background

Diagnostic errors are a significant concern in healthcare, with studies indicating that approximately 1 in 10 patients may die due to such errors. The integration of large language models (LLMs) into clinical practice has been proposed as a potential solution to enhance diagnostic accuracy. However, prior to this study, there was a lack of comprehensive research comparing the diagnostic capabilities of various LLMs on real patient data.

๐Ÿ—’๏ธ Study

This comparative study aimed to benchmark the diagnostic abilities of 21 LLMs using the LLM-as-a-judge method. The researchers utilized data from the MIMIC-IV database, which contains detailed patient records, to assess how well these models could predict diagnoses based on available clinical information. The study involved multiple prompts and temperature settings to evaluate the models’ performance comprehensively.

๐Ÿ“ˆ Results

The results revealed that Gemini 2.5 achieved the highest diagnostic hit rate of 97.4%, significantly outperforming other models such as GPT-4.1 and Claude-4 Opus. Notably, the performance of the models varied based on the prompts used, while temperature settings had minimal impact on the outcomes. Additionally, the implementation of RAG led to a meaningful improvement in the hit rate for GPT-4o 05-13.

๐ŸŒ Impact and Implications

The findings from this study underscore the potential of LLMs to transform diagnostic practices in healthcare. By leveraging advanced AI technologies, we can enhance the accuracy of diagnoses, ultimately leading to better patient outcomes. The integration of LLMs into clinical workflows could significantly reduce the incidence of diagnostic errors, thereby improving overall patient safety and care quality.

๐Ÿ”ฎ Conclusion

This study highlights the promising role of large language models in diagnosing comorbid patients. With a top-performing model achieving a hit rate of 97.4%, the future of AI in healthcare looks bright. Continued research and collaboration with healthcare professionals are essential to fully realize the potential of these technologies in clinical settings. We encourage further exploration into the capabilities of LLMs to enhance diagnostic accuracy and patient care.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of AI in healthcare diagnostics? We would love to hear your insights! ๐Ÿ’ฌ Leave your comments below or connect with us on social media:

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.

Abstract

BACKGROUND: On average, 1 in 10 patients die because of a diagnostic error, and medical errors represent the third largest cause of death in the United States. While large language models (LLMs) have been proposed to aid doctors in diagnoses, no research results have been published comparing the diagnostic abilities of many popular LLMs on a large, openly accessible real-patient cohort.
OBJECTIVE: In this study, we set out to compare the diagnostic ability of 18 LLMs from Google, OpenAI, Meta, Mistral, Cohere, and Anthropic, using 3 prompts, 2 temperature settings, and 1000 randomly selected Medical Information Mart for Intensive Care-IV (MIMIC-IV) hospital admissions. We also explore improving the diagnostic hit rate of GPT-4o 05-13 with retrieval-augmented generation (RAG) by utilizing reference ranges provided by the American Board of Internal Medicine.
METHODS: We evaluated the diagnostic ability of 21 LLMs, using an LLM-as-a-judge approach (an automated, LLM-based evaluation) on MIMIC-IV patient records, which contain final diagnostic codes. For each case, a separate assessor LLM (“judge”) compared the predictor LLM’s diagnostic output to the true diagnoses from the patient record. The assessor determined whether each true diagnosis was inferable from the available data and, if so, whether it was correctly predicted (“hit”) or not (“miss”). Diagnoses not inferable from the patient record were excluded from the hit rate analysis. The reported hit rate was defined as the number of hits divided by the total number of hits and misses. The statistical significance of the differences in model performance was assessed using a pooled z-test for proportions.
RESULTS: Gemini 2.5 was the top performer with a hit rate of 97.4% (95% CI 97.0%-97.8%) as assessed by GPT-4.1, significantly outperforming GPT-4.1, Claude-4 Opus, and Claude Sonnet. However, GPT-4.1 ranked the highest in a separate set of experiments evaluated by GPT-4 Turbo, which tended to be less conservative than GPT-4.1 in its assessments. Significant variation in diagnostic hit rates was observed across different prompts, while changes in temperature generally had little effect. Finally, RAG significantly improved the hit rate of GPT-4o 05-13 by an average of 0.8% (P<.006). CONCLUSIONS: While the results are promising, more diverse datasets and hospital pilots, as well as close collaborations with physicians, are needed to obtain a better understanding of the diagnostic abilities of these models.

Author: [‘Sarvari P’, ‘Al-Fagih Z’]

Journal: JMIRx Med

Citation: Sarvari P and Al-Fagih Z. Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method. Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method. 2025; 6:e67661. doi: 10.2196/67661

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.