🧑🏼‍💻 Research - November 30, 2025

When AI models take the exam: large language models vs medical students on multiple-choice course exams.

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

⚡ Quick Summary

A recent study compared the performance of large language models (LLMs) with that of medical students on multiple-choice exams across various clinical courses. The results revealed that LLMs consistently outperformed students, achieving mean scores ranging from 7.46 to 9.88 compared to student scores of 4.28 to 7.32.

🔍 Key Details

  • 📊 Study Location: Miguel Hernández University, Spain
  • 🧑‍🎓 Participants: 442 medical students
  • ⚙️ LLMs Tested: OpenAI o1, GPT-4o, DeepSeek R1, Microsoft Copilot, Google Gemini 1.5 Flash
  • 🏆 Performance Metrics: Mean LLM scores: 8.75; Mean student scores: 5.76
  • 📅 Year of Study: 2025

🔑 Key Takeaways

  • 🤖 LLMs outperformed medical students on MCQs with negative marking.
  • 📈 High reproducibility was observed in LLM performance (AC1 0.79-1.00).
  • 🏆 OpenAI o1 achieved the highest mean scores in three out of four courses.
  • 💡 Microsoft Copilot led in Cardiovascular Medicine.
  • 📚 All LLMs answered every MCQ correctly, showcasing their reliability.
  • 🌍 Findings support the cautious use of LLMs as adjuncts in medical education.
  • 🔍 Further research is needed across different institutions and languages.

📚 Background

The integration of artificial intelligence in healthcare and medical education is rapidly evolving. As LLMs become more prevalent, understanding their capabilities and limitations in academic settings is crucial. This study aims to shed light on how these models perform in comparison to traditional medical students, particularly in the context of multiple-choice assessments.

🗒️ Study

Conducted in 2025 at Miguel Hernández University, this comparative cross-sectional study evaluated the performance of five contemporary LLMs against that of enrolled medical students. The exams covered four clinical courses: Infectious Diseases, Neurology, Respiratory Medicine, and Cardiovascular Medicine, all administered in Spanish under routine conditions.

📈 Results

The results were striking: LLMs achieved mean scores ranging from 7.46 to 9.88, while students averaged between 4.28 and 7.32. Notably, LLMs not only exceeded the median scores of students but also matched or surpassed the highest student scores in several instances. The overall average score for LLMs was 8.75 compared to 5.76 for students, indicating a significant performance gap.

🌍 Impact and Implications

These findings suggest that LLMs could serve as valuable tools in medical education, particularly for automated pretesting and providing feedback. However, the authors emphasize the need for cautious, faculty-supervised implementation. The potential for LLMs to enhance learning and assessment in medical education is promising, but further validation across different contexts is essential.

🔮 Conclusion

This study highlights the remarkable capabilities of large language models in performing academic assessments traditionally reserved for medical students. As we explore the integration of AI in education, it is vital to continue evaluating its impact on learning outcomes and to ensure that these technologies are used responsibly and effectively. The future of medical education may very well include LLMs as supportive tools, enhancing both teaching and assessment methodologies.

💬 Your comments

What are your thoughts on the role of AI in medical education? Do you believe LLMs could enhance learning experiences for students? 💬 Share your insights in the comments below or connect with us on social media:

When AI models take the exam: large language models vs medical students on multiple-choice course exams.

Abstract

Large language models (LLMs) are increasingly used in healthcare and medical education, but their performance on institution-authored multiple-choice questions (MCQs), particularly with negative marking, remains unclear. To compare the examination performance of five contemporary LLMs with enrolled medical students on final multiple-choice (MCQ-style) course exams across four clinical courses. We conducted a comparative cross-sectional study at Miguel Hernández University (Spain) in 2025. Final exams in Infectious Diseases, Neurology, Respiratory Medicine, and Cardiovascular Medicine were administered under routine conditions in Spanish. Five LLMs (OpenAI o1, GPT-4o, DeepSeek R1, Microsoft Copilot, and Google Gemini 1.5 Flash) completed all MCQs in two independent runs. Scores were averaged and test-retest was estimated with Gwet’s AC1. Student scores (n = 442) were summarized as mean ± SD or median (IQR). Pairwise differences between models were explored with McNemar’s test; student-LLM contrasts were descriptive. Across courses, LLMs consistently exceeded the student median and, in several instances, the highest student score. Mean LLM courses scores ranged 7.46-9.88, versus student means 4.28-7.32. OpenAI o1 achieved the highest mean in three courses; Copilot led in Cardiovascular Medicine (text-only subset due to image limitations). All LLMs answered every MCQ and short term test-retest agreement was high (AC1 0.79-1.00). Aggregated across courses, LLMs averaged 8.75 compared with 5.76 for students. On department-set Spanish MCQ exams with negative marking, LLMs outperformed enrolled medical students, answered every item, and showed high short-term reproducibility. These findings support cautious, faculty-supervised use of LLMs as adjuncts to MCQ assessment (e.g. automated pretesting, feedback). Confirmation across institutions, languages, and image-rich formats, and evaluation of educational impact beyond accuracy are needed.

Author: [‘Ros-Arlanzón P’, ‘Gutarra-Ávila R’, ‘Arrarte-Esteban V’, ‘Bertomeu-González V’, ‘Hernández-Blasco L’, ‘Masiá M’, ‘Navarro-Canto L’, ‘Nieto-Navarro J’, ‘Abarca J’, ‘Sempere AP’]

Journal: Med Educ Online

Citation: Ros-Arlanzón P, et al. When AI models take the exam: large language models vs medical students on multiple-choice course exams. When AI models take the exam: large language models vs medical students on multiple-choice course exams. 2025; 30:2592430. doi: 10.1080/10872981.2025.2592430

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.