โก Quick Summary
This study assessed the performance of large language models (LLMs) in answering clinical quizzes related to clinical chemistry and laboratory management. Notably, GPT-4o achieved the highest accuracy at 81.7%, demonstrating significant potential for enhancing clinical decision-making in healthcare.
๐ Key Details
- ๐ Dataset: 109 clinical problem-based quizzes from peer-reviewed articles
- ๐งฉ Topics covered: Clinical chemistry, toxicology, laboratory management
- โ๏ธ Models evaluated: GPT-4o, GPT-4 Turbo, Claude 3 Opus, Gemini 1.5 Pro, and others
- ๐ Top performer: GPT-4o with 81.7% accuracy
๐ Key Takeaways
- ๐ค LLMs can effectively answer specialized medical quizzes without fine-tuning.
- ๐ GPT-4o outperformed other models in various quiz types.
- ๐ Performance metrics: GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), Gemini 1.5 Pro (69.7%), Gemini 1.0 Pro (51.4%).
- ๐ก Strengths: Particularly effective in quizzes involving figures, tables, or calculations.
- ๐ฅ Implications: Potential utility in assisting healthcare professionals with clinical decision-making.
- ๐ Study published in the Scandinavian Journal of Clinical Laboratory Investigation.
- ๐ PMID: 39970086.
๐ Background
The integration of artificial intelligence in healthcare is rapidly evolving, with large language models emerging as powerful tools capable of understanding and generating human language. However, their application in specialized fields like clinical chemistry and laboratory management has not been extensively explored, prompting the need for this study.
๐๏ธ Study
This research evaluated the performance of nine LLMs using a zero-shot prompting approach on a set of 109 clinical quizzes sourced from the Laboratory Medicine Online (LMO) database. The quizzes were designed to simulate real-world scenarios faced by clinical chemists and laboratory managers, providing a robust framework for assessing the models’ capabilities.
๐ Results
Among the models tested, GPT-4o achieved the highest overall accuracy of 81.7%, showcasing its ability to handle various question types effectively. Other models, such as GPT-4 Turbo and Claude 3 Opus, also performed well, but with lower accuracy rates. The results indicate that LLMs can leverage their pre-existing knowledge to address specialized inquiries in clinical settings.
๐ Impact and Implications
The findings from this study suggest that LLMs, particularly GPT-4o, could play a transformative role in healthcare by assisting professionals in making informed clinical decisions. The ability of these models to accurately interpret and respond to complex medical queries could enhance the efficiency and effectiveness of laboratory management and clinical chemistry practices.
๐ฎ Conclusion
This study highlights the significant potential of large language models in the medical field, particularly in clinical chemistry and laboratory management. As these technologies continue to evolve, they may provide invaluable support to healthcare professionals, ultimately leading to improved patient outcomes and more efficient healthcare delivery. Further research is encouraged to explore the full capabilities of LLMs in various medical domains.
๐ฌ Your comments
What are your thoughts on the use of large language models in healthcare? Do you see potential for these technologies to enhance clinical decision-making? ๐ฌ Share your insights in the comments below or connect with us on social media:
Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.
Abstract
Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.
Author: [‘Heo WY’, ‘Park HD’]
Journal: Scand J Clin Lab Invest
Citation: Heo WY and Park HD. Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence. Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence. 2025; (unknown volume):1-8. doi: 10.1080/00365513.2025.2466054