โก Quick Summary
This study explores the use of large language models (LLMs) to enhance the effectiveness of retrieval-augmented generation (RAG) in oncology, utilizing a retriever encoder fine-tuned with synthetic data. The results indicate a significant improvement in retrieval performance, with our model outperforming others by up to 9% in NDCG and 7% in Precision.
๐ Key Details
- ๐ Dataset: Over 6 million oncology notes from 209,135 patients
- ๐งฉ Features used: Query-passage pairs synthesized by LLMs
- โ๏ธ Technology: Fine-tuned sentence transformer model
- ๐ Performance: 9% improvement in NDCG, 7% in Precision, and 6% in Recall compared to runner-up models
๐ Key Takeaways
- ๐ Enhanced retrieval performance for oncology-specific queries using LLMs.
- ๐ก Innovative use of synthetic data for fine-tuning models.
- ๐ฉโ๐ฌ Model trained on a vast dataset of oncology notes.
- ๐ Significant metrics achieved in NDCG, Precision, and Recall.
- ๐ Potential applications in healthcare where annotated data is scarce.
- ๐๏ธ Categories covered include biomarkers, diagnosis, disease status, and tumor characteristics.
- ๐ Study conducted at City of Hope, a leading cancer research center.
๐ Background
The field of oncology is increasingly reliant on data-driven insights derived from electronic health records (EHRs). However, extracting relevant information from unstructured data remains a challenge. Recent advancements in large language models and retrieval-augmented generation offer promising solutions to enhance the precision and relevance of clinical note retrieval, particularly in oncology.
๐๏ธ Study
This study aimed to improve the effectiveness of RAG by implementing a specialized retriever encoder for oncology EHRs. The model was pretrained on a substantial dataset of oncology notes and fine-tuned using 12,371 query-passage pairs to enhance its performance in answering oncology-related queries.
๐ Results
The results demonstrated that our model outperformed the runner-up by 9% in NDCG, 7% in Precision, and 6% in Recall when evaluated on a test dataset of 53 patients. This indicates a robust retrieval capability across various oncology-specific categories, including tumor characteristics and laboratory results.
๐ Impact and Implications
The findings from this study underscore the potential of fine-tuned embeddings and synthetic data in enhancing the retrieval of pertinent clinical notes from oncology EHRs. This approach not only improves the accuracy of information retrieval but also holds promise for broader applications in healthcare, particularly in areas where annotated data is limited.
๐ฎ Conclusion
This study highlights the transformative potential of large language models in oncology-specific question answering. By leveraging pretrained contextual embeddings and innovative data augmentation techniques, we can significantly enhance the retrieval of relevant clinical information. Continued exploration in this domain could lead to improved patient outcomes and more efficient healthcare delivery.
๐ฌ Your comments
What are your thoughts on the integration of large language models in oncology? We would love to hear your insights! ๐ฌ Leave your comments below or connect with us on social media:
Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data.
Abstract
PURPOSE: The recent advancements of retrieval-augmented generation (RAG) and large language models (LLMs) have revolutionized the extraction of real-world evidence from unstructured electronic health records (EHRs) in oncology. This study aims to enhance RAG’s effectiveness by implementing a retriever encoder specifically designed for oncology EHRs, with the goal of improving the precision and relevance of retrieved clinical notes for oncology-related queries.
METHODS: Our model was pretrained with more than six million oncology notes from 209,135 patients at City of Hope. The model was subsequently fine-tuned into a sentence transformer model using 12,371 query-passage training pairs. Specifically, the passages were obtained from actual patient notes, whereas the query was synthesized by an LLM. We evaluated the retrieval performance of our model by comparing it with six widely used embedding models on 50 oncology questions across 10 categories based on Normalized Discounted Cumulative Gain (NDCG), Precision, and Recall.
RESULTS: In our test data set comprising 53 patients, our model exceeded the performance of the runner-up model by 9% for NDCG (evaluated at the top 10 results), 7% for Precision (top 10), and 6% for Recall (top 10). Our model showed exceptional retrieval performance across all metrics for oncology-specific categories, including biomarkers assessed, current diagnosis, disease status, laboratory results, tumor characteristics, and tumor staging.
CONCLUSION: Our findings highlight the effectiveness of pretrained contextual embeddings and sentence transformers in retrieving pertinent notes from oncology EHRs. The innovative use of LLM-synthesized query-passage pairs for data augmentation was proven to be effective. This fine-tuning approach holds significant promise in specialized fields like health care, where acquiring annotated data is challenging.
Author: [‘Lu KH’, ‘Mehdinia S’, ‘Man K’, ‘Wong CW’, ‘Mao A’, ‘Eftekhari Z’]
Journal: JCO Clin Cancer Inform
Citation: Lu KH, et al. Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data. Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data. 2025; 9:e2500011. doi: 10.1200/CCI-25-00011