โก Quick Summary
This study evaluated the alignment of three popular large language models (LLMs)โChatGPT, Claude, and Geminiโwith expert clinicians in assessing suicide risk. The findings revealed that while these chatbots effectively responded to very low and very high-risk queries, they struggled with intermediate risk levels, highlighting the need for further refinement.
๐ Key Details
- ๐ Dataset: 30 hypothetical suicide-related queries categorized by 13 clinical experts
- ๐งฉ Risk Levels: Very high, high, medium, low, and very low
- โ๏ธ Technology: ChatGPT, Claude, and Gemini chatbots
- ๐ Total Responses: 9,000 (100 responses per query)
๐ Key Takeaways
- ๐ค ChatGPT and Claude provided direct responses to very low-risk queries 100% of the time.
- ๐ซ No direct responses were given for any very high-risk queries by all three chatbots.
- ๐ Inconsistency was noted in addressing intermediate risk levels.
- ๐ Claude was more likely to provide direct responses compared to ChatGPT (AOR=2.01).
- ๐ Gemini was less likely to provide direct responses than ChatGPT (AOR=0.09).
- ๐ Mixed-effects logistic regression was used to analyze the data.
- ๐ง Expert alignment was observed at the extremes of suicide risk.
- โ ๏ธ Need for refinement of LLMs in suicide risk assessment was emphasized.

๐ Background
Suicide risk assessment is a critical area in mental health care, where accurate evaluation can lead to timely interventions. The integration of artificial intelligence and large language models into this field holds promise, but it is essential to understand how these technologies align with expert clinical judgment, especially in sensitive areas like suicide risk.
๐๏ธ Study
The study involved 13 clinical experts who categorized 30 hypothetical suicide-related queries into five distinct levels of self-harm risk. Each of the three LLM-based chatbots responded to these queries 100 times, resulting in a comprehensive dataset of 9,000 responses. The researchers aimed to assess how well these chatbots aligned with expert determinations of risk.
๐ Results
The results indicated that both ChatGPT and Claude provided direct responses to very low-risk queries consistently. However, none of the chatbots offered direct responses to very high-risk queries. Notably, the models did not effectively differentiate between intermediate risk levels, which raises concerns about their reliability in nuanced situations. Claude outperformed ChatGPT in providing direct responses, while Gemini showed a significantly lower likelihood of doing so.
๐ Impact and Implications
The implications of this study are significant for the future of mental health assessment. While LLMs can be valuable tools in suicide risk assessment, their current limitations in addressing intermediate risk levels highlight the necessity for ongoing refinement. As these technologies evolve, they could potentially enhance the accuracy and efficiency of mental health evaluations, ultimately leading to better patient outcomes.
๐ฎ Conclusion
This study underscores the potential of large language models in suicide risk assessment while also revealing critical gaps in their performance. As we continue to explore the integration of AI in mental health, it is crucial to refine these models to ensure they can effectively support clinicians in making informed decisions. The journey toward improved mental health assessment through technology is just beginning, and further research is essential.
๐ฌ Your comments
What are your thoughts on the use of AI in suicide risk assessment? Do you believe these technologies can be refined to better support clinicians? Let’s engage in a discussion! ๐ฌ Share your insights in the comments below or connect with us on social media:
Evaluation of Alignment Between Large Language Models and Expert Clinicians in Suicide Risk Assessment.
Abstract
OBJECTIVE: This study aimed to evaluate whether three popular chatbots powered by large language models (LLMs)-ChatGPT, Claude, and Gemini-provided direct responses to suicide-related queries and how these responses aligned with clinician-determined risk levels for each question.
METHODS: Thirteen clinical experts categorized 30 hypothetical suicide-related queries into five levels of self-harm risk: very high, high, medium, low, and very low. Each LLM-based chatbot responded to each query 100 times (N=9,000 total responses). Responses were coded as “direct” (answering the query) or “indirect” (e.g., declining to answer or referring to a hotline). Mixed-effects logistic regression was used to assess the relationship between question risk level and the likelihood of a direct response.
RESULTS: ChatGPT and Claude provided direct responses to very-low-risk queries 100% of the time, and all three chatbots did not provide direct responses to any very-high-risk query. LLM-based chatbots did not meaningfully distinguish intermediate risk levels. Compared with very-low-risk queries, the odds of a direct response were not statistically different for low-risk, medium-risk, or high-risk queries. Across models, Claude was more likely (adjusted odds ratio [AOR]=2.01, 95% CI=1.71-2.37, p<0.001) and Gemini less likely (AOR=0.09, 95% CI=0.08-0.11, p<0.001) than ChatGPT to provide direct responses.
CONCLUSIONS: LLM-based chatbots' responses to queries aligned with experts' judgment about whether to respond to queries at the extremes of suicide risk (very low and very high), but the chatbots showed inconsistency in addressing intermediate-risk queries, underscoring the need to further refine LLMs.
Author: [‘McBain RK’, ‘Cantor JH’, ‘Zhang LA’, ‘Baker O’, ‘Zhang F’, ‘Burnett A’, ‘Kofner A’, ‘Breslau J’, ‘Stein BD’, ‘Mehrotra A’, ‘Yu H’]
Journal: Psychiatr Serv
Citation: McBain RK, et al. Evaluation of Alignment Between Large Language Models and Expert Clinicians in Suicide Risk Assessment. Evaluation of Alignment Between Large Language Models and Expert Clinicians in Suicide Risk Assessment. 2025; 76:944-950. doi: 10.1176/appi.ps.20250086