โก Quick Summary
This pilot study explored how large language models (LLMs) attribute cardiovascular risk across various demographic and clinical domains. The findings revealed significant bias and variability in risk attribution, particularly highlighting differences based on gender and race.
๐ Key Details
- ๐ Model Used: ChatGPT 4.0 mini
- ๐งฉ Domains Examined: General cardiovascular risk, BMI, diabetes, depression, smoking, hyperlipidaemia
- โ๏ธ Methodology: Structured prompts submitted in triplicate
- ๐ Agreement: Strong inter-rater reliability (ICC of 0.949, 95% CI: 0.819-0.992, pโ=โ<0.001)
๐ Key Takeaways
- ๐ Gender Bias: Higher cardiovascular risk attributed to men compared to women.
- ๐ Racial Bias: Black patients consistently judged as higher-risk than white patients.
- ๐ก Comorbidity Influence: Risk attribution between genders changed with the inclusion of depression and smoking.
- ๐ Variability: Decisions varied across different domains, indicating potential inconsistencies in LLM outputs.
- โ ๏ธ Need for Caution: Highlights the importance of evaluating LLM decision-making to prevent reinforcing health inequities.

๐ Background
The integration of large language models in medicine is rapidly increasing, yet their decision-making processes, particularly in cardiovascular risk assessment, remain largely unexamined. Understanding how these models attribute risk across different demographics is crucial for ensuring equitable healthcare delivery.
๐๏ธ Study
This study aimed to investigate the risk attribution capabilities of an LLM by developing a structured prompt set across six domains related to cardiovascular health. The prompts were designed to assess both neutral risk attribution and comparative risk assessments based on demographic factors.
๐ Results
The results indicated that the LLM attributed higher cardiovascular risk to men than women and consistently judged Black patients as higher-risk compared to white patients. Notably, the model’s decisions regarding gender risk changed when comorbidities like depression and smoking were included, demonstrating variability in its outputs.
๐ Impact and Implications
The findings of this study underscore the necessity for careful evaluation of LLMs in clinical settings. The potential for these models to reinforce existing biases in healthcare could lead to significant disparities in patient outcomes. Addressing these biases is essential for the responsible deployment of AI technologies in medicine.
๐ฎ Conclusion
This pilot study reveals critical insights into the biases present in LLMs when attributing cardiovascular risk. As we continue to integrate AI into healthcare, it is vital to ensure that these technologies do not perpetuate inequities. Ongoing research and evaluation are necessary to harness the full potential of LLMs while safeguarding against bias.
๐ฌ Your comments
What are your thoughts on the implications of AI in cardiovascular risk assessment? Let’s engage in a discussion! ๐ฌ Share your insights in the comments below or connect with us on social media:
Uncovering bias and variability in how large language models attribute cardiovascular risk.
Abstract
Large language models (LLMs) are used increasingly in medicine, but their decision-making in cardiovascular risk attribution remains underexplored. This pilot study examined how an LLM apportioned relative cardiovascular risk across different demographic and clinical domains. A structured prompt set across six domains was developed, across general cardiovascular risk, body mass index (BMI), diabetes, depression, smoking, and hyperlipidaemia, and submitted in triplicate to ChatGPT 4.0 mini. For each domain, a neutral prompt assessed the LLM’s risk attribution, while paired comparative prompts examined whether including the domain changed the LLM’s decision of the higher-risk demographic group. The LLM attributed higher cardiovascular risk to men than women, and to Black rather than white patients, across most neutral prompts. In comparative prompts, the LLM’s decision between sex changed in two of six domains: when depression was included, risk attribution was equal between men and women. It changed from females being at higher risk than males in scenarios without smoking, but changed to males being at higher risk than females when smoking was present. In contrast, race-based decisions of relative risk were stable across domains, as the LLM consistently judged Black patients to be higher-risk. Agreement across repeated runs was strong (ICC of 0.949, 95% CI: 0.819-0.992, pโ=โ<0.001). The LLM exhibited bias and variability across cardiovascular risk domains. Although decisions between males/females sometimes changed when comorbidities were included, race-based decisions remained the same. This pilot study suggests careful evaluation of LLM clinical decision-making is needed, to avoid reinforcing inequities.
Author: [‘Chan JTN’, ‘Kwek RK’]
Journal: Front Digit Health
Citation: Chan JTN and Kwek RK. Uncovering bias and variability in how large language models attribute cardiovascular risk. Uncovering bias and variability in how large language models attribute cardiovascular risk. 2025; 7:1710594. doi: 10.3389/fdgth.2025.1710594