Follow us
🧑🏼‍💻 Research - December 24, 2024

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

🌟 Stay Updated!
Join Dr. Ailexa’s channels to receive the latest insights in health and AI.

⚡ Quick Summary

This study evaluated the performance of GPT-4 1106 Vision Preview and Bard Gemini Pro in answering medical visual questions, revealing that GPT-4 outperformed Bard with an accuracy of 56.9% compared to Bard’s 44.6%. Both models showed a preference for answering questions in German over English.

🔍 Key Details

  • 📊 Dataset: 1070 image-based multiple-choice questions from AMBOSS
  • 🧩 Features used: Medical images and customized prompts
  • ⚙️ Technology: GPT-4 1106 Vision Preview and Bard Gemini Pro
  • 🏆 Performance: GPT-4: 56.9%, Bard: 44.6%

🔑 Key Takeaways

  • 📊 GPT-4 1106 Vision Preview outperformed Bard Gemini Pro in medical visual question answering.
  • 💡 GPT-4 achieved an accuracy of 67.8% when considering only answered questions.
  • 🌍 Both models performed better in German than in English.
  • 📉 GPT-4 left 16.1% of questions unanswered, significantly higher than Bard’s 4.1%.
  • 🏆 Student majority vote achieved an overall accuracy of 94.5%, outperforming both AI models.
  • 🔍 Statistical significance was noted in the differences in performance (P<.001).
  • 🗣️ Language-specific analysis revealed a consistent trend favoring German responses.
  • 🔄 Further optimization of these models is needed for diverse linguistic contexts.

📚 Background

The advent of large language models (LLMs) like OpenAI’s ChatGPT has transformed medical research and education. These models have shown promise in various applications, including radiological imaging interpretation and assisting with medical licensing examinations. The integration of image recognition capabilities into LLMs marks a significant advancement in their utility for medical diagnostics and training.

🗒️ Study

This study aimed to critically assess the effectiveness of GPT-4 1106 Vision Preview and Bard Gemini Pro in answering image-based questions from medical licensing examinations. A total of 1070 questions were analyzed, with a focus on their accuracy and utility in medical education. The questions were sourced from the AMBOSS learning platform and included both English and German prompts.

📈 Results

The results indicated that GPT-4 1106 Vision Preview achieved an accuracy of 56.9% (609/1070), significantly outperforming Bard Gemini Pro, which had an accuracy of 44.6% (477/1070). Notably, when only answered questions were considered, GPT-4’s accuracy rose to 67.8% (609/898), surpassing both Bard and the student passed mean of 63% (674/1070).

🌍 Impact and Implications

The findings of this study highlight the potential of LLMs like GPT-4 and Bard Gemini Pro in enhancing medical education and diagnostics. Their varying performance based on language suggests that while these models can serve as valuable tools for students, further optimization is necessary to address their limitations, particularly in non-English contexts. The high accuracy of the student majority vote underscores the importance of human expertise in medical decision-making.

🔮 Conclusion

This study demonstrates the promising capabilities of GPT-4 1106 Vision Preview and Bard Gemini Pro in medical visual question-answering tasks. While GPT-4 shows superior performance, both models have room for improvement, especially in their handling of diverse linguistic content. Continued research and development in this area could lead to significant advancements in medical education and practice.

💬 Your comments

What are your thoughts on the use of AI in medical education? Do you believe these models can effectively support students in their learning journey? 💬 Share your insights in the comments below or connect with us on social media:

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

Abstract

BACKGROUND: The rapid development of large language models (LLMs) such as OpenAI’s ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities.
OBJECTIVE: This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations.
METHODS: This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the “student passed mean” and “majority vote.” Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization.
RESULTS: GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard’s 44.6% (477/1070), a statistically significant difference (χ2₁=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard's 4.1% (44/1070; χ2₁=83.1, P<.001). When considering only answered questions, GPT-4 1106's accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; χ2₁=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; χ2₁=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; χ2₁=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; χ2₁=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: χ2₁=408.5, P<.001; Bard Gemini Pro: χ2₁=626.6, P<.001). CONCLUSIONS: Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical.

Author: [‘Roos J’, ‘Martin R’, ‘Kaczmarczyk R’]

Journal: JMIR Form Res

Citation: Roos J, et al. Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study. Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study. 2024; 8:e57592. doi: 10.2196/57592

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.