โก Quick Summary
This study evaluated the oncological knowledge of four large language models (LLMs) using the Turkish Society of Medical Oncology’s board examination questions. The results revealed that Claude 3.5 Sonnet outperformed the others, achieving an average score of 77.6%, while ChatGPT 4o followed with 67.8%.
๐ Key Details
- ๐ Dataset: 790 multiple-choice questions from 2016 to 2024
- ๐งฉ Models tested: Claude 3.5 Sonnet, ChatGPT 4o, Llama-3, Gemini 1.5
- โ๏ธ Language: Turkish
- ๐ Performance metrics: Average scores and pass rates across exams
๐ Key Takeaways
- ๐ Claude 3.5 Sonnet achieved the highest average score of 77.6%.
- ๐ก ChatGPT 4o passed seven out of eight exams with an average score of 67.8%.
- ๐ Llama-3 and Gemini 1.5 showed lower performance, passing only four and three exams, respectively.
- ๐ Recent performance decline noted, particularly in the 2024 exam.
- ๐ Statistical significance found among model performances (Fโ=โ17.39, pโ<โ0.001).
- ๐ Advanced LLMs have potential as tools in oncology education and decision support.
- ๐ Regular updates are necessary to maintain relevance and accuracy in medical knowledge.
๐ Background
The field of oncology is rapidly evolving, with an increasing complexity in patient care and a vast amount of medical literature. As healthcare professionals seek efficient tools to assist in clinical decision-making and education, large language models (LLMs) have emerged as promising candidates. However, their effectiveness in oncology education and knowledge assessment has not been thoroughly explored until now.
๐๏ธ Study
This study aimed to benchmark the oncological knowledge of four LLMsโClaude 3.5 Sonnet, ChatGPT 4o, Llama-3, and Gemini 1.5โusing standardized questions from the Turkish Society of Medical Oncology’s annual board examinations. A total of 790 valid multiple-choice questions covering various oncology topics were analyzed to assess each model’s performance in answering these questions in Turkish.
๐ Results
The results indicated that Claude 3.5 Sonnet was the top performer, successfully passing all eight exams with an average score of 77.6%. In contrast, ChatGPT 4o passed seven exams with an average score of 67.8%. Both Llama-3 and Gemini 1.5 demonstrated lower performance, with average scores below 50%. The study highlighted significant differences in performance among the models, emphasizing the need for regular updates to ensure accuracy.
๐ Impact and Implications
The findings of this study suggest that advanced LLMs like Claude 3.5 Sonnet and ChatGPT 4o could serve as valuable tools in oncology education and clinical decision support. However, to maximize their potential, it is crucial to implement regular updates and enhancements to incorporate the latest medical advancements and maintain their relevance in a rapidly changing field.
๐ฎ Conclusion
This study underscores the significant differences in oncological knowledge among various LLMs, with Claude 3.5 Sonnet and ChatGPT 4o leading the pack. As we continue to explore the integration of AI in healthcare, it is essential to ensure that these models are regularly updated to reflect the latest knowledge and practices in oncology. The future of LLMs in medical education looks promising, and further research is encouraged to enhance their capabilities.
๐ฌ Your comments
What are your thoughts on the use of LLMs in oncology education? Do you believe they can significantly impact clinical decision-making? ๐ฌ Share your insights in the comments below or connect with us on social media:
Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions.
Abstract
BACKGROUND: Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions.
METHODS: We assessed the performance of four LLMs-Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)-using the Turkish Society of Medical Oncology’s annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA.
RESULTS: Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models’ performances (Fโ=โ17.39, pโ<โ0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data.
CONCLUSIONS: Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.
Author: [‘Erdat EC’, ‘Kavak EE’]
Journal: BMC Cancer
Citation: Erdat EC and Kavak EE. Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions. Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions. 2025; 25:197. doi: 10.1186/s12885-025-13596-0