โก Quick Summary
This study evaluates the use of large language models (LLMs) for annotating radiology reports across multiple institutions, demonstrating that a human-optimized prompt can significantly enhance the accuracy of diagnosis extraction. The findings reveal that the Llama 3.1 70b model achieved the highest performance in identifying specific findings, showcasing the potential of optimized prompt engineering in radiology.
๐ Key Details
- ๐ Dataset: 500 radiology reports from six institutions, categorized into five distinct groups.
- โ๏ธ Technology: Large language models (LLMs), specifically Llama 3.1 70b.
- ๐ Performance Metric: Accuracy of LLM outputs compared to investigator-provided reference labels.
- ๐ Methodology: A standardized Python script was used for analysis across institutions.
๐ Key Takeaways
- ๐ค LLMs show promise in automating the annotation of radiology reports.
- ๐ก Human-optimized prompts significantly improve the consistency and accuracy of LLM outputs.
- ๐ฅ The study involved collaboration among six institutions, enhancing the robustness of findings.
- ๐ Llama 3.1 70b outperformed other models in accurately identifying specified findings.
- ๐ Results indicate high adaptability of LLMs to different report structures and institutional practices.
- ๐ Future research will focus on refining prompts and testing model robustness across diverse report formats.
๐ Background
The integration of large language models (LLMs) into healthcare, particularly in radiology, presents a transformative opportunity for enhancing diagnostic accuracy. Traditional methods of report annotation can be labor-intensive and prone to human error. By leveraging LLMs, we can streamline the process, potentially leading to improved patient outcomes and more efficient workflows in radiology departments.
๐๏ธ Study
This study was conducted across six institutions, where researchers collected a total of 500 radiology reports, with each institution contributing reports from five different categories. A standardized Python script was utilized to execute a common LLM with a human-optimized prompt, allowing for consistent analysis and comparison of results against reference labels provided by local investigators.
๐ Results
The analysis revealed that the human-optimized prompt led to high consistency in outputs across various sites and pathologies. Notably, the Llama 3.1 70b model achieved the highest accuracy in identifying specified findings, with significant agreement observed between LLM outputs and investigator-provided references. This indicates a promising avenue for the application of LLMs in clinical settings.
๐ Impact and Implications
The findings from this study underscore the potential of optimized prompt engineering in enhancing the utility of LLMs for radiology report labeling. By improving the accuracy and adaptability of these models, we can facilitate better diagnostic processes across institutions. This could lead to more standardized reporting practices and ultimately improve patient care in radiology.
๐ฎ Conclusion
This study highlights the significant advancements that large language models can bring to the field of radiology through optimized prompt engineering. As we continue to refine these technologies, the potential for improved diagnostic accuracy and efficiency in healthcare settings becomes increasingly attainable. Future research will be crucial in exploring the robustness of these models across diverse report structures.
๐ฌ Your comments
What are your thoughts on the integration of LLMs in radiology? Do you see this as a game-changer for diagnostic processes? ๐ฌ Share your insights in the comments below or connect with us on social media:
Cross-Institutional Evaluation of Large Language Models for Radiology Diagnosis Extraction: A Prompt-Engineering Perspective.
Abstract
The rapid evolution of large language models (LLMs) offers promising opportunities for radiology report annotation, aiding in determining the presence of specific findings. This study evaluates the effectiveness of a human-optimized prompt in labeling radiology reports across multiple institutions using LLMs. Six distinct institutions collected 500 radiology reports: 100 in each of 5 categories. A standardized Python script was distributed to participating sites, allowing the use of one common locally executed LLM with a standard human-optimized prompt. The script executed the LLM’s analysis for each report and compared predictions to reference labels provided by local investigators. Models’ performance using accuracy was calculated, and results were aggregated centrally. The human-optimized prompt demonstrated high consistency across sites and pathologies. Preliminary analysis indicates significant agreement between the LLM’s outputs and investigator-provided reference across multiple institutions. At one site, eight LLMs were systematically compared, with Llama 3.1 70b achieving the highest performance in accurately identifying the specified findings. Comparable performance with Llama 3.1 70b was observed at two additional centers, demonstrating the model’s robust adaptability to variations in report structures and institutional practices. Our findings illustrate the potential of optimized prompt engineering in leveraging LLMs for cross-institutional radiology report labeling. This approach is straightforward while maintaining high accuracy and adaptability. Future work will explore model robustness to diverse report structures and further refine prompts to improve generalizability.
Author: [‘Moassefi M’, ‘Houshmand S’, ‘Faghani S’, ‘Chang PD’, ‘Sun SH’, ‘Khosravi B’, ‘Triphati AG’, ‘Rasool G’, ‘Bhatia NK’, ‘Folio L’, ‘Andriole KP’, ‘Gichoya JW’, ‘Erickson BJ’]
Journal: J Imaging Inform Med
Citation: Moassefi M, et al. Cross-Institutional Evaluation of Large Language Models for Radiology Diagnosis Extraction: A Prompt-Engineering Perspective. Cross-Institutional Evaluation of Large Language Models for Radiology Diagnosis Extraction: A Prompt-Engineering Perspective. 2025; (unknown volume):(unknown pages). doi: 10.1007/s10278-025-01523-5