🧑🏼‍💻 Research - June 20, 2025

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

⚡ Quick Summary

This study evaluated the performance of GPT-4o and Llama-3.3-70B in extracting data from stroke CT reports, highlighting the significant impact of annotation guidelines on accuracy. The results showed that incorporating these guidelines improved precision rates, particularly for GPT-4o, which outperformed Llama-3.3-70B consistently.

🔍 Key Details

  • 📊 Datasets: Dataset A (n=200) and Dataset B (n=100) from a single academic stroke center
  • 🧩 Findings extracted: Ten imaging findings from stroke CT reports
  • ⚙️ Technologies used: GPT-4o and Llama-3.3-70B
  • 🏆 Performance metrics: Micro-averaged precision for GPT-4o ranged from 0.83 to 0.95, while Llama-3.3-70B ranged from 0.65 to 0.86

🔑 Key Takeaways

  • 📊 LLMs can effectively extract data from radiology reports.
  • 💡 Annotation guidelines significantly enhance the accuracy of data extraction.
  • 👩‍🔬 GPT-4o consistently outperformed Llama-3.3-70B in precision metrics.
  • 🏆 Precision rates improved notably when guidelines were included in the model prompts.
  • 🌍 Study conducted at a single academic stroke center.
  • 🔍 Inter-annotator disagreements informed the development of the annotation guidelines.
  • 📈 Overall classification performance showed significant differences in five out of six conditions when guidelines were applied.

📚 Background

The extraction of relevant data from medical reports, particularly in the field of radiology, is crucial for improving patient care and optimizing treatment pathways. However, traditional methods often face challenges due to the variability in report formats and the subjective nature of data interpretation. The advent of large language models (LLMs) like GPT-4o and Llama-3.3-70B presents a promising avenue for automating this process, but their effectiveness can be significantly influenced by the clarity of the instructions provided to them.

🗒️ Study

This study aimed to assess the performance of GPT-4o and Llama-3.3-70B in extracting imaging findings from stroke CT reports. Researchers designed an annotation guideline based on a review of cases with inter-annotator disagreements in Dataset A. The models were tested under two conditions: with and without the annotation guideline included in the prompt.

📈 Results

The findings revealed that GPT-4o consistently outperformed Llama-3.3-70B, with micro-averaged precision rates reaching as high as 0.95 for GPT-4o in Dataset B. The incorporation of the annotation guideline led to improved precision rates across both models, while recall rates remained stable. This indicates that clear instructions can enhance the models’ ability to accurately extract relevant data from complex reports.

🌍 Impact and Implications

The implications of this study are significant for the field of radiology and beyond. By demonstrating that well-defined annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, this research paves the way for more efficient data extraction processes. This could ultimately lead to better patient outcomes and more streamlined workflows in clinical settings, enhancing the overall quality of care.

🔮 Conclusion

This study highlights the potential of LLMs like GPT-4o and Llama-3.3-70B in revolutionizing data extraction from medical reports. The evidence that annotation guidelines can significantly enhance extraction accuracy is a crucial insight for future research and application in healthcare. As we continue to explore the capabilities of AI in medicine, the integration of clear guidelines will be essential for maximizing the benefits of these technologies.

💬 Your comments

What are your thoughts on the role of annotation guidelines in improving AI performance in healthcare? We would love to hear your insights! 💬 Leave your comments below or connect with us on social media:

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.

Abstract

BACKGROUND: To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.
METHODS: The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.
RESULTS: GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.
CONCLUSION: GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.
RELEVANCE STATEMENT: Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.
KEY POINTS: LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.

Author: [‘Wihl J’, ‘Rosenkranz E’, ‘Schramm S’, ‘Berberich C’, ‘Griessmair M’, ‘Woźnicki P’, ‘Pinto F’, ‘Ziegelmayer S’, ‘Adams LC’, ‘Bressem KK’, ‘Kirschke JS’, ‘Zimmer C’, ‘Wiestler B’, ‘Hedderich D’, ‘Kim SH’]

Journal: Eur Radiol Exp

Citation: Wihl J, et al. Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines. Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines. 2025; 9:61. doi: 10.1186/s41747-025-00600-2

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.