๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - May 18, 2025

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluated the confidence levels of 12 large language models (LLMs) in answering clinical questions across five medical specialties. The findings revealed a significant inverse correlation between model accuracy and confidence, highlighting potential challenges in their clinical application.

๐Ÿ” Key Details

  • ๐Ÿ“Š Dataset: 1965 multiple-choice questions
  • ๐Ÿงฉ Medical specialties: Internal medicine, obstetrics and gynecology, psychiatry, pediatrics, general surgery
  • โš™๏ธ Models evaluated: 12 LLMs, including GPT-4o and Qwen2-7B
  • ๐Ÿ† Key metrics: Mean accuracy and confidence scores

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“‰ Inverse correlation found between confidence scores and model accuracy (r=-0.40; P=.001).
  • ๐Ÿ… Top-performing model GPT-4o achieved a mean accuracy of 74% with a confidence of 63%.
  • ๐Ÿ“‰ Low-performing model Qwen2-7B had a mean accuracy of 46% but a confidence of 76%.
  • ๐Ÿ” Minimal variation in confidence between correct and incorrect answers (0.6% to 5.4%).
  • ๐Ÿ’ก Highest mean difference in confidence for GPT-4o at 5.4% (SD 2.3%; P=.003).
  • โš ๏ธ Overconfidence in lower-performing models raises concerns for clinical use.
  • ๐Ÿ”„ Recommendations include refining calibration methods and involving human oversight.
  • ๐Ÿ”ฌ Further research is essential for improving LLMs before clinical adoption.

๐Ÿ“š Background

The use of large language models in the biomedical field is rapidly evolving, yet their ability to accurately assess their own confidence in clinical scenarios remains largely unexamined. Understanding how these models gauge their responses is crucial for ensuring their safe and effective integration into healthcare settings.

๐Ÿ—’๏ธ Study

This cross-sectional evaluation study aimed to benchmark the confidence of various LLMs in answering clinical questions. Researchers utilized a comprehensive dataset of 1965 multiple-choice questions spanning five medical specialties, prompting models to provide both answers and confidence scores ranging from 0% to 100%.

๐Ÿ“ˆ Results

The analysis revealed a statistically significant inverse correlation between the mean confidence scores for correct answers and the overall accuracy of the models. For example, while GPT-4o demonstrated a commendable mean accuracy of 74%, it maintained a mean confidence of only 63%. In contrast, the less accurate Qwen2-7B exhibited a higher confidence level of 76%, despite its lower accuracy of 46%.

๐ŸŒ Impact and Implications

The findings of this study underscore the importance of aligning confidence levels with actual performance in LLMs. The tendency for lower-performing models to exhibit higher confidence could pose risks in clinical decision-making. Addressing these discrepancies through improved calibration and oversight is vital for the safe deployment of LLMs in healthcare environments.

๐Ÿ”ฎ Conclusion

This study highlights the critical need for further research into the confidence assessment capabilities of LLMs in clinical contexts. While promising, the current findings indicate that even the most accurate models show limited variation in confidence between correct and incorrect answers. Enhancing these models through targeted strategies will be essential for their broader clinical adoption.

๐Ÿ’ฌ Your comments

What are your thoughts on the confidence levels of large language models in clinical settings? Let’s engage in a discussion! ๐Ÿ’ฌ Share your insights in the comments below or connect with us on social media:

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.

Abstract

BACKGROUND: The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.
OBJECTIVE: This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses.
METHODS: We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.
RESULTS: The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).
CONCLUSIONS: Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

Author: [‘Omar M’, ‘Agbareia R’, ‘Glicksberg BS’, ‘Nadkarni GN’, ‘Klang E’]

Journal: JMIR Med Inform

Citation: Omar M, et al. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. 2025; 13:e66917. doi: 10.2196/66917

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.