๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - November 4, 2025

Evaluating the performance of five large language models in answering Delphi consensus questions relating to patellar instability and medial patellofemoral ligament reconstruction.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluated the performance of five large language models (LLMs) in answering questions related to patellar instability and medial patellofemoral ligament reconstruction. The findings revealed that ChatGPT4o and Claude2 provided the most accurate responses, while Google Gemini significantly underperformed.

๐Ÿ” Key Details

  • ๐Ÿ“Š Models Tested: ChatGPT4o, Perplexity AI, Bing CoPilot, Claude2, Google Gemini
  • ๐Ÿงฉ Questions: Ten questions from an international Delphi Consensus study
  • โš™๏ธ Evaluation Method: Responses assessed using the validated Mika score by eight orthopedic surgeons
  • ๐Ÿ† Statistical Analysis: Kruskal-Wallis test, Dunn’s post-hoc tests, Pearson’s chi-square test

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“ˆ ChatGPT4o and Claude2 achieved the highest percentage of “excellent responses” (47.5%).
  • ๐Ÿ“‰ Google Gemini had the highest percentage of “unsatisfactory responses” (21.3%).
  • ๐Ÿ” Median Mika scores were 2 for ChatGPT4o and Perplexity AI, and 3 for Google Gemini.
  • ๐Ÿค Inter-rater agreement was moderate for all models except Google Gemini.
  • ๐Ÿ’ก Findings support the use of LLMs as adjuncts in patient education regarding patellar instability.

๐Ÿ“š Background

The integration of artificial intelligence (AI) in healthcare has gained momentum, particularly with the rise of large language models (LLMs). These models have the potential to enhance patient communication and education. However, ensuring the accuracy of the information they provide is crucial to prevent the dissemination of misinformation.

๐Ÿ—’๏ธ Study

This study aimed to assess the accuracy of five freely accessible LLMs in responding to questions about patellofemoral instability (PFI). Ten questions were selected from a Delphi Consensus study, and responses were evaluated by eight orthopedic surgeons trained in sports medicine. The study employed rigorous statistical methods to compare the performance of the models.

๐Ÿ“ˆ Results

The results indicated that ChatGPT4o and Claude2 provided the most satisfactory responses, with a significant number classified as “excellent” (Mika score of 1). In contrast, Google Gemini was found to be less accurate, with many responses requiring substantial clarification (Mika score of 4). The inter-rater agreement was moderate for the top-performing models, while Google Gemini showed no agreement.

๐ŸŒ Impact and Implications

The findings of this study highlight the potential of LLMs in assisting healthcare professionals in delivering accurate information to patients regarding patellar instability. By leveraging these technologies, physicians can enhance patient understanding and engagement, ultimately improving the quality of care. However, the underperformance of models like Google Gemini underscores the need for careful selection and validation of AI tools in clinical settings.

๐Ÿ”ฎ Conclusion

This study underscores the importance of evaluating AI tools in healthcare, particularly in the context of patient education. The promising performance of ChatGPT4o and Claude2 suggests that LLMs can serve as valuable adjuncts for healthcare providers. Continued research and validation are essential to ensure that these technologies are reliable and effective in enhancing patient care.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of AI in healthcare communication? Do you believe LLMs can significantly improve patient education? ๐Ÿ’ฌ Share your insights in the comments below or connect with us on social media:

Evaluating the performance of five large language models in answering Delphi consensus questions relating to patellar instability and medial patellofemoral ligament reconstruction.

Abstract

PURPOSE: Artificial intelligence (AI) has become incredibly popular over the past several years, with large language models (LLMs) offering the possibility of revolutionizing the way healthcare information is shared with patients. However, to prevent the spread of misinformation, analyzing the accuracy of answers from these LLMs is essential. This study will aim to assess the accuracy of five freely accessible chatbots by specifically evaluating their responses to questions about patellofemoral instability (PFI). The secondary objective will be to compare the different chatbots, to distinguish which LLM offers the most accurate set of responses.
METHODS: Ten questions were selected from a previously published international Delphi Consensus study pertaining to patellar instability, and posed to ChatGPT4o, Perplexity AI, Bing CoPilot, Claude2, and Google Gemini. Responses were assessed for accuracy using the validated Mika score by eight Orthopedic surgeons who have completed fellowship training in sports-medicine. Median responses amongst the eight reviewers for each question were compared using the Kruskal-Wallis and Dunn’s post-hoc tests. Percentages of each Mika score distribution were compared using Pearson’s chi-square test. P-values less than or equal to 0.05 were considered significant. The Gwet’s AC2 coefficient was calculated to assess for inter-rater agreement, corrected for chance and employing quadratic weights.
RESULTS: ChatGPT4o and Claude2 had the highest percentage of reviews (38/80, 47.5%) considered to be an “excellent response not requiring classification”, or a Mika score of 1. Google Gemini had the highest percentage of reviews (17/80, 21.3%) considered to be “unsatisfactory requiring substantial clarification”, or a Mika score of 4 (pโ€‰<โ€‰0.001). The medianโ€‰ยฑโ€‰interquartile range (IQR) Mika scores was 2 (1) for ChatGPT4o and Perplexity AI, 2 (2) for Bing CoPilot and Claude2, and 3 (2) for Google Gemini. Median responses were not significantly different between ChatGPT4o, Perplexity AI, Bing CoPilot, and Claude2, however all four statistically outperformed Google Gemini (pโ€‰<โ€‰0.05). Inter-rater agreement was classified as moderate (0.40โ€‰>โ€‰AC2โ€‰โ‰ฅโ€‰0.60) for ChatGPT, Perplexity AI, Bing CoPilot, and Claude2, while there was no agreement for Google Gemini (AC2โ€‰<โ€‰0). CONCLUSION: Current free access LLMs (ChatGPT4o, Perplexity AI, Bing CoPilot, and Claude2) predominantly provide satisfactory responses requiring minimal clarification to standardized questions relating to patellar instability. Google Gemini statistically underperformed in accuracy relative to the other four LLMs, with most answers requiring moderate clarification. Furthermore, inter-rater agreement was moderate for all LLMs apart from Google Gemini, which had no agreement. These findings advocate for the utility of existing LLMs in serving as an adjunct to physicians and surgeons in providing patients information pertaining to patellar instability. LEVEL OF EVIDENCE: V.

Author: [‘Vivekanantha P’, ‘Cohen D’, ‘Slawaska-Eng D’, ‘Nagai K’, ‘Tarchala M’, ‘Matache B’, ‘Hiemstra L’, ‘Longstaffe R’, ‘Lesniak B’, ‘Meena A’, ‘Tapasvi S’, ‘Sillanpรคa P’, ‘Grzela P’, ‘Lamanna D’, ‘Samuelsson K’, ‘de Sa D’]

Journal: BMC Musculoskelet Disord

Citation: Vivekanantha P, et al. Evaluating the performance of five large language models in answering Delphi consensus questions relating to patellar instability and medial patellofemoral ligament reconstruction. Evaluating the performance of five large language models in answering Delphi consensus questions relating to patellar instability and medial patellofemoral ligament reconstruction. 2025; 26:1022. doi: 10.1186/s12891-025-09227-1

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.