๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - March 14, 2025

[Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines.].

๐ŸŒŸ Stay Updated!
Join Dr. Ailexa’s channels to receive the latest insights in health and AI.

โšก Quick Summary

A recent study evaluated the accuracy of large language models (LLMs) like ChatGPT, Gemini, and Copilot in answering mammography screening questions in both Italian and English. The findings revealed that while LLMs can provide useful information, they often struggle with highly specialized medical topics, particularly in delivering complete and accurate answers.

๐Ÿ” Key Details

  • ๐Ÿ“Š Study Focus: Accuracy of LLMs in answering breast imaging-related questions.
  • ๐Ÿงฉ Languages Tested: Italian and English.
  • โš™๏ธ Methodology: Responses evaluated by expert radiologists using a Likert scale (1 to 5).
  • ๐Ÿ† Key Findings: Average scores ranged from 3.6 to 4 out of 5.

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“ˆ General Concepts: LLMs performed better on general mammography concepts.
  • โ— Specific Questions: Incomplete responses were noted for specific queries, especially regarding dense breast definitions.
  • ๐Ÿ“š Source Quality: Many sources used, particularly in Italian, were non-specialized in radiology.
  • ๐Ÿค Collaboration Needed: Enhancing LLM accuracy requires collaboration between AI experts and healthcare professionals.
  • ๐ŸŒ Implications: The study highlights the potential and limitations of AI in medical communication.

๐Ÿ“š Background

The integration of artificial intelligence (AI) into healthcare is reshaping how patients access medical information. Large language models (LLMs) are at the forefront of this transformation, providing simplified explanations of complex medical topics. However, the accuracy of these models in delivering reliable medical information remains a critical concern, particularly in specialized fields like breast cancer screening.

๐Ÿ—’๏ธ Study

This study was conducted by a team of five breast radiologists who developed nine questions based on the Eusobi guidelines related to breast cancer screening. These questions were posed to three LLMsโ€”ChatGPT, Gemini, and Copilotโ€”in both Italian and English. The responses were then evaluated by two expert radiologists to assess their accuracy and completeness.

๐Ÿ“ˆ Results

The average scores for the responses from the LLMs were consistent across both languages, falling between 3.6 and 4 out of 5. While general questions about mammography were answered more accurately, specific inquiries, particularly those concerning the definition of dense breast tissue, often yielded incomplete responses. Additionally, the sources referenced by the models, especially in Italian, were frequently non-specialized, indicating a significant limitation in the models’ ability to provide detailed medical information.

๐ŸŒ Impact and Implications

The findings of this study underscore the potential of LLMs as tools for medical communication, yet they also highlight the need for caution when relying on AI for specialized medical information. The collaboration between AI developers and healthcare professionals is essential to enhance the accuracy and reliability of AI-generated medical content, particularly in critical areas such as breast cancer prevention and screening.

๐Ÿ”ฎ Conclusion

This study illustrates the promise and limitations of large language models in the realm of medical communication. While LLMs can serve as valuable resources for general information, their shortcomings in specialized topics necessitate further development and collaboration with healthcare experts. The future of AI in healthcare looks promising, but it is crucial to ensure that the information provided is both accurate and reliable.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of AI in healthcare communication? Do you believe LLMs can be trusted for specialized medical information? Let’s engage in a discussion! ๐Ÿ’ฌ Share your insights in the comments below or connect with us on social media:

[Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines.].

Abstract

INTRODUCTION: Artificial intelligence (AI) is transforming various aspects of everyday life, including healthcare, through large language models (LLMs) like ChatGPT, Gemini, and Copilot. These systems are increasingly used to disseminate medical information, allowing patients to access simplified explanations. This study aims to compare responses to breast imaging-related questions formulated in Italian and English, based on Eusobi guidelines, evaluating the LLMs’ ability to provide accurate and complete answers on mammography screening concepts.
MATERIALS AND METHODS: Nine questions related to breast cancer screening were developed by five breast radiologists based on Eusobi recommendations. These questions were submitted to ChatGPT, Gemini, and Copilot in both Italian and English. Responses were evaluated by two expert breast radiologists using a Likert scale (1 to 5), with statistical analysis performed to compare the accuracy, average length of responses, use of radiological sources and the agreement among readers.
RESULTS: The average scores for responses were similar in both languages, ranging from 3.6 to 4 out of 5. Questions on general mammography concepts received more accurate answers, while more specific questions based on the latest guidelines showed incomplete responses, especially about the definition of dense breast. The sources used, particularly in Italian, were often non-specialized in radiology, highlighting a limitation of LLMs in providing detailed and up-to-date medical answers.
CONCLUSIONS: The study shows that LLMs are useful tools for medical communication, but they have limitations in delivering accurate answers on highly specialized medical topics. To improve the quality of information, collaboration between AI experts and healthcare professionals is necessary, especially in breast cancer prevention and screening.

Author: [‘Signorini M’, ‘Fontani S’, ‘Minichetti P’, ‘Teggi S’, ‘Barusco A’, ‘Favat M’]

Journal: Recenti Prog Med

Citation: Signorini M, et al. [Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines.]. [Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines.]. 2025; 116:162-167. doi: 10.1701/4460.44556

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.