โก Quick Summary
This study explored the effectiveness of Retrieval Augmented Generation (RAG) in enhancing the performance of Large Language Models (LLMs) for assessing surgical fitness and providing preoperative instructions. The GPT-4 LLM-RAG model demonstrated superior accuracy (96.4%) compared to human-generated responses (86.6%), highlighting its potential in medical applications.
๐ Key Details
- ๐ Models Tested: 10 LLMs including GPT-3.5, GPT-4, Gemini, and Claude
- ๐งฉ Guidelines Used: 35 local and 23 international surgical fitness guidelines
- โ๏ธ Methodology: Comparison of 3,234 LLM-generated responses with 448 human-generated answers
- ๐ Performance: GPT-4 achieved 96.4% accuracy, significantly outperforming humans (p = 0.016)
๐ Key Takeaways
- ๐ค LLMs can be effectively customized for medical applications using RAG.
- ๐ฅ GPT-4 showed the highest accuracy in assessing surgical fitness.
- โฑ๏ธ Response Time: GPT-4 generated answers within 20 seconds.
- ๐ Consistency: The model produced more consistent outputs than human responses.
- ๐ซ Hallucinations: The GPT-4 model exhibited an absence of hallucinations in its responses.
- ๐ Global Relevance: The study included both local and international guidelines, enhancing its applicability.
- ๐ Implications: This technology could transform preoperative assessments in clinical settings.
๐ Background
The integration of Large Language Models (LLMs) into medical practice has been met with enthusiasm due to their potential to streamline processes and enhance decision-making. However, these models often lack the necessary domain-specific knowledge required for accurate medical assessments. The introduction of Retrieval Augmented Generation (RAG) offers a promising solution by allowing LLMs to access and utilize specialized knowledge, thereby improving their performance in medical contexts.
๐๏ธ Study
This study aimed to evaluate the effectiveness of RAG in enhancing the capabilities of ten different LLMs in determining surgical fitness and delivering preoperative instructions. Conducted using a comprehensive set of guidelines, the research involved generating a substantial number of responses and comparing them to those produced by human experts. The focus was on assessing accuracy, consistency, and safety in clinical scenarios.
๐ Results
The results revealed that the GPT-4 LLM-RAG model outperformed human-generated responses with an impressive accuracy rate of 96.4%, compared to 86.6% for humans (p = 0.016). Additionally, the model’s ability to generate responses quicklyโwithin 20 secondsโand its consistency in output further underscore its potential as a reliable tool in medical assessments.
๐ Impact and Implications
The findings from this study suggest that GPT-4-based LLM-RAG models could significantly enhance the accuracy and efficiency of preoperative assessments in clinical settings. By minimizing the risk of errors and providing consistent outputs, these models could improve patient safety and streamline surgical preparations. The implications extend beyond surgical fitness assessments, potentially influencing various areas of healthcare where rapid and accurate information is crucial.
๐ฎ Conclusion
This study highlights the transformative potential of Retrieval Augmented Generation in the realm of medical applications, particularly in enhancing the performance of Large Language Models. The ability of the GPT-4 model to deliver accurate, efficient, and consistent preoperative assessments marks a significant advancement in the integration of AI technologies in healthcare. Continued research and development in this area could pave the way for broader applications and improved patient outcomes.
๐ฌ Your comments
What are your thoughts on the integration of AI in medical assessments? Do you believe technologies like RAG can enhance patient care? ๐ฌ Share your insights in the comments below or connect with us on social media:
Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness.
Abstract
Large Language Models (LLMs) hold promise for medical applications but often lack domain-specific expertise. Retrieval Augmented Generation (RAG) enables customization by integrating specialized knowledge. This study assessed the accuracy, consistency, and safety of LLM-RAG models in determining surgical fitness and delivering preoperative instructions using 35 local and 23 international guidelines. Ten LLMs (e.g., GPT3.5, GPT4, GPT4o, Gemini, Llama2, and Llama3, Claude) were tested across 14 clinical scenarios. A total of 3234 responses were generated and compared to 448 human-generated answers. The GPT4 LLM-RAG model with international guidelines generated answers within 20โs and achieved the highest accuracy, which was significantly better than human-generated responses (96.4% vs. 86.6%, pโ=โ0.016). Additionally, the model exhibited an absence of hallucinations and produced more consistent output than humans. This study underscores the potential of GPT-4-based LLM-RAG models to deliver highly accurate, efficient, and consistent preoperative assessments.
Author: [‘Ke YH’, ‘Jin L’, ‘Elangovan K’, ‘Abdullah HR’, ‘Liu N’, ‘Sia ATH’, ‘Soh CR’, ‘Tung JYM’, ‘Ong JCL’, ‘Kuo CF’, ‘Wu SC’, ‘Kovacheva VP’, ‘Ting DSW’]
Journal: NPJ Digit Med
Citation: Ke YH, et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. 2025; 8:187. doi: 10.1038/s41746-025-01519-z