๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - March 12, 2026

Automated Tumor International Classification of Diseases Coding of Real-World Pathology Reports Using Self-Hosted Large Language Models.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluated the performance of state-of-the-art large language models (LLMs) in automating the coding of pathology reports using International Classification of Diseases for Oncology (ICD-O)-3 codes. The findings suggest that while LLMs show promise, their current performance is not yet adequate for fully automated clinical use.

๐Ÿ” Key Details

  • ๐Ÿ“Š Dataset: 21,364 pathology reports from 10,823 patients
  • ๐Ÿงฉ Models evaluated: Llama-3.3-70B-Instruct, DeepSeek-R1-Distill-Llama (8B and 70B), Qwen3-235B-A22B, Gemma-3-12B-it
  • โš™๏ธ Evaluation metrics: Exact code matches and first three-position matches
  • ๐Ÿ† Best performance: Qwen3-235B-A22B for topography codes (microaverage F1: 71.6%)

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“Š LLMs can assist in pathology coding but are not yet ready for full automation.
  • ๐Ÿ’ก Qwen3-235B-A22B achieved the highest performance for topography code prediction.
  • ๐Ÿ† DeepSeek-R1-Distill-Llama-70B excelled in predicting morphology codes.
  • ๐Ÿ“‰ Performance disparities were noted, particularly for rare conditions.
  • ๐Ÿ” Contextual information is crucial for accurate coding.
  • โš ๏ธ Current limitations include poor generalization to rare conditions and lower scores for morphology classification.
  • ๐ŸŒ Study conducted at a large German hospital from 2013 to 2025.
  • ๐Ÿ†” PMID: 41812101.

๐Ÿ“š Background

The manual coding of pathology reports is a time-consuming and error-prone process that places a significant burden on healthcare institutions. The International Classification of Diseases for Oncology (ICD-O)-3 coding system is essential for accurate cancer diagnosis and treatment, but the traditional methods of coding are often inefficient. The advent of large language models (LLMs) presents a potential solution to streamline this process.

๐Ÿ—’๏ธ Study

This study analyzed a substantial dataset of 21,364 pathology reports from 10,823 patients at a large German hospital. Five different LLMs were evaluated to determine their effectiveness in extracting ICD-O-3 codes from real-world pathology reports. The models were deployed on a secure hospital infrastructure, and various prompts were developed to assess their performance in extracting both topography and morphology codes.

๐Ÿ“ˆ Results

The results indicated that the Qwen3-235B-A22B model achieved the highest performance for exact ICD-O topography code prediction with a microaverage F1 score of 71.6%. In contrast, the Llama-3.3-70B-Instruct model excelled in predicting the first three characters of the codes, achieving a microaverage F1 score of 84.6%. For morphology codes, the DeepSeek-R1-Distill-Llama-70B model outperformed others with an exact microaverage F1 score of 34.7% and a first three characters’ microaverage F1 score of 77.8%.

๐ŸŒ Impact and Implications

The findings from this study highlight the potential of LLMs as support systems for expert-guided pathology coding. However, the current limitations indicate that while these models can assist in the coding process, they are not yet suitable for fully automated use in routine clinical workflows. Addressing the challenges related to rare conditions and contextual dependencies will be crucial for future advancements in this area.

๐Ÿ”ฎ Conclusion

This study underscores the promising capabilities of large language models in the realm of pathology coding. While they show potential as supportive tools, further research and development are necessary to enhance their performance and reliability for clinical applications. The integration of AI in healthcare continues to evolve, and with it, the hope for more efficient and accurate coding practices.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of AI in pathology coding? Do you believe LLMs can eventually replace manual coding? Let’s discuss! ๐Ÿ’ฌ Leave your thoughts in the comments below or connect with us on social media:

Automated Tumor International Classification of Diseases Coding of Real-World Pathology Reports Using Self-Hosted Large Language Models.

Abstract

PURPOSE: Manual coding of pathology reports with International Classification of Diseases for Oncology (ICD-O)-3 codes is time-consuming, error-prone, and resource-intensive for health care institutions. To evaluate the performance of multiple state-of-the-art large language models (LLMs) in extracting ICD-O-3 topography and morphology codes from real-world pathology reports and assess their potential for clinical implementation, this study compares the performance of state-of-the-art open-source models in multiple evaluation setups.
METHODS: We analyzed 21,364 pathology reports from 10,823 patients documented between 2013 and 2025 at a large German hospital. Five LLMs were evaluated: Llama-3.3-70B-Instruct, DeepSeek-R1-Distill-Llama (8B and 70B variants), Qwen3-235B-A22B, and Gemma-3-12B-it. All models were deployed on secured private information technology hospital infrastructure. Three different prompts were developed for topography extraction (with and without anatomic context) and morphology extraction. Performance was evaluated using exact code matches and first three-position matches.
RESULTS: For exact ICD-O topography code prediction, Qwen3-235B-A22B achieved the highest performance (microaverage F1: 71.6%), whereas Llama-3.3-70B-Instruct performed best at predicting the first three characters (micro-average F1: 84.6%). For morphology codes, DeepSeek-R1-Distill-Llama-70B outperformed other models (exact microaverage F1: 34.7%; first three characters’ microaverage F1: 77.8%). Large disparities between micro- and macroaverage F1-scores indicated poor generalization to rare conditions.
CONCLUSION: Although LLMs demonstrate promising capabilities as support systems for expert-guided pathology coding, their performance is not yet sufficient for fully automated, unsupervised use in routine clinical workflows. LLMs showed poor performance on rare conditions, heavy dependence on contextual information, and substantially lower scores for morphology versus topography classification.

Author: [‘Arzideh K’, ‘Hosch R’, ‘Turki A’, ‘Eryilmaz B’, ‘Bahn M’, ‘Schรคfer H’, ‘Idrissi-Yaghir A’, ‘Khattab S’, ‘Dada A’, ‘Baba HA’, ‘Schadendorf D’, ‘Schuler M’, ‘Kleesiek J’, ‘Hartmann S’, ‘Nensa F’, ‘Keyl J’]

Journal: JCO Clin Cancer Inform

Citation: Arzideh K, et al. Automated Tumor International Classification of Diseases Coding of Real-World Pathology Reports Using Self-Hosted Large Language Models. Automated Tumor International Classification of Diseases Coding of Real-World Pathology Reports Using Self-Hosted Large Language Models. 2026; 10:e2500254. doi: 10.1200/CCI-25-00254

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.