🧑🏼‍💻 Research - July 1, 2025

How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.

['Zhu Q', 'Hou B', 'Mathai TS', 'Mukherjee P', 'Jin Q', 'Chen X', 'Wang Z', 'Cheng R', 'Summers RM', 'Lu Z']

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

⚡ Quick Summary

This study introduces the GPTRadScore, a novel evaluation framework designed to assess the performance of multimodal large language models (MLLMs) in interpreting CT scans. The findings reveal that models like GPT-4 with Vision and Gemini Pro Vision show promising results, but significant improvements are still needed in training datasets.

🔍 Key Details

📊 Dataset: Subset of the public DeepLesion dataset
🧩 Models evaluated: GPT-4 with Vision, Gemini Pro Vision, LLaVA-Med, RadFM
⚙️ Framework: GPTRadScore for evaluation
🏆 Performance metrics: Pearson’s correlation coefficients ranging from 0.75 to 0.91

🔑 Key Takeaways

📊 GPTRadScore provides a more clinically informed assessment than traditional metrics.
💡 High correlation with clinician assessments indicates reliability.
🏆 Fine-tuning RadFM led to significant accuracy improvements: location accuracy from 3.41% to 12.8%, body part accuracy from 29.12% to 53%, and type accuracy from 9.24% to 30%.
🤖 GPT-4V and Gemini Pro Vision outperformed other models in interpreting CT scans.
🌍 Limitations in training datasets highlight areas for future improvement.
🔍 Traditional metrics like BLEU, METEOR, and ROUGE were less effective compared to GPTRadScore.
📈 The study reinforces the importance of fine-tuning in enhancing model performance.

📚 Background

The interpretation of CT scans is a critical aspect of radiological diagnostics, often requiring expert analysis to ensure accurate findings. Traditional evaluation methods can be limited in their ability to capture the nuances of clinical accuracy. The emergence of multimodal large language models (MLLMs) offers a new avenue for improving the interpretation of medical imaging, but a robust evaluation framework is essential to assess their effectiveness.

🗒️ Study

This retrospective study utilized a subset of the DeepLesion dataset to evaluate the performance of various MLLMs in generating descriptions of CT scan findings. The newly developed GPTRadScore served as the primary evaluation metric, allowing for a more nuanced assessment of the models’ capabilities compared to traditional language-specific methods.

📈 Results

The evaluations revealed a strong correlation between GPTRadScore and clinician assessments, with Pearson’s correlation coefficients indicating high reliability. Notably, fine-tuning the RadFM model resulted in substantial accuracy gains across various metrics, demonstrating the potential for improved performance through targeted training.

🌍 Impact and Implications

The findings from this study have significant implications for the future of radiological diagnostics. By leveraging advanced evaluation frameworks like GPTRadScore, healthcare professionals can better assess the capabilities of MLLMs in interpreting CT scans. This could lead to enhanced diagnostic accuracy and improved patient outcomes, ultimately transforming the landscape of medical imaging.

🔮 Conclusion

This study highlights the potential of GPTRadScore as a reliable metric for evaluating MLLMs in radiological diagnostics. The results underscore the importance of fine-tuning approaches in enhancing the descriptive accuracy of LLM-generated medical imaging findings. As research in this area continues to evolve, we can anticipate further advancements in the integration of AI technologies in healthcare.

💬 Your comments

What are your thoughts on the use of multimodal large language models in interpreting CT scans? We would love to hear your insights! 💬 Leave your comments below or connect with us on social media:

How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.

Abstract

OBJECTIVE: This study introduces a novel evaluation framework, GPTRadScore, to systematically assess the performance of multimodal large language models (MLLMs) in generating clinically accurate findings from CT imaging. Specifically, GPTRadScore leverages LLMs as an evaluation metric, aiming to provide a more accurate and clinically informed assessment than traditional language-specific methods. Using this framework, we evaluate the capability of several MLLMs, including GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, to interpret findings in CT scans.
METHODS: This retrospective study leverages a subset of the public DeepLesion dataset to evaluate the performance of several multimodal LLMs in describing findings in CT slices. GPTRadScore was developed to assess the generated descriptions (location, body part, and type) using GPT-4, alongside traditional metrics. RadFM was fine-tuned using a subset of the DeepLesion dataset with additional labeled examples targeting complex findings. Post fine-tuning, performance was reassessed using GPTRadScore to measure accuracy improvements.
RESULTS: Evaluations demonstrated a high correlation of GPTRadScore with clinician assessments, with Pearson’s correlation coefficients of 0.87, 0.91, 0.75, 0.90, and 0.89. These results highlight its superiority over traditional metrics, such as BLEU, METEOR, and ROUGE, and indicate that GPTRadScore can serve as a reliable evaluation metric. Using GPTRadScore, it was observed that while GPT-4V and Gemini Pro Vision outperformed other models, significant areas for improvement remain, primarily due to limitations in the datasets used for training. Fine-tuning RadFM resulted in substantial accuracy gains: location accuracy increased from 3.41% to 12.8%, body part accuracy improved from 29.12% to 53%, and type accuracy rose from 9.24% to 30%. These findings reinforce the hypothesis that fine-tuning RadFM can significantly enhance its performance.
CONCLUSION: GPT-4 effectively correlates with expert assessments, validating its use as a reliable metric for evaluating multimodal LLMs in radiological diagnostics. Additionally, the results underscore the efficacy of fine-tuning approaches in improving the descriptive accuracy of LLM-generated medical imaging findings.

Author: [‘Zhu Q’, ‘Hou B’, ‘Mathai TS’, ‘Mukherjee P’, ‘Jin Q’, ‘Chen X’, ‘Wang Z’, ‘Cheng R’, ‘Summers RM’, ‘Lu Z’]

Journal: J Biomed Inform

Citation: Zhu Q, et al. How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses. How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses. 2025; (unknown volume):104864. doi: 10.1016/j.jbi.2025.104864

🧑🏼‍💻 Research - July 1, 2025

How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.

['Zhu Q', 'Hou B', 'Mathai TS', 'Mukherjee P', 'Jin Q', 'Chen X', 'Wang Z', 'Cheng R', 'Summers RM', 'Lu Z']

⚡ Quick Summary

🔍 Key Details

🔑 Key Takeaways

📚 Background

🗒️ Study

📈 Results

🌍 Impact and Implications

🔮 Conclusion

💬 Your comments

How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.

Abstract

Leave a ReplyCancel reply