๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - August 22, 2025

Evaluation of large language models on mental health: from knowledge test to illness diagnosis.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

This study evaluates 15 state-of-the-art large language models (LLMs) for their effectiveness in mental health assessment and diagnosis within the Chinese context. Notably, models like DeepSeek-R1, QwQ, and GPT-4.1 demonstrated superior performance in both knowledge accuracy and diagnostic capabilities.

๐Ÿ” Key Details

  • ๐Ÿ“Š Dataset: Publicly available datasets including Dreaddit, SDCNL, and CAS Counsellor Qualification Exam questions.
  • ๐Ÿงฉ Models evaluated: 15 LLMs including DeepSeek-R1/V3, GPT-4.1, Llama4, and QwQ.
  • โš™๏ธ Key tasks: Mental health knowledge testing and mental illness diagnosis.
  • ๐Ÿ† Top performers: DeepSeek-R1, QwQ, and GPT-4.1.

๐Ÿ”‘ Key Takeaways

  • ๐Ÿ“Š LLMs are transforming mental health assessment by providing innovative solutions for diagnosis and counseling.
  • ๐Ÿ’ก DeepSeek-R1, QwQ, and GPT-4.1 outperformed other models in accuracy and diagnostic performance.
  • ๐Ÿ‘ฉโ€๐Ÿ”ฌ The study utilized diverse datasets to ensure comprehensive evaluation of model capabilities.
  • ๐Ÿ† Results indicate significant potential for LLMs in enhancing mental health services in China.
  • ๐ŸŒ Findings highlight both strengths and limitations of current LLMs in the mental health domain.
  • ๐Ÿ” Guidance provided for selecting and improving models in sensitive mental health contexts.

๐Ÿ“š Background

The integration of large language models (LLMs) into mental health care is a burgeoning field, offering new avenues for assessment, diagnosis, and psychological counseling. As mental health issues continue to rise globally, the need for innovative solutions becomes increasingly critical. This study aims to systematically evaluate the capabilities of various LLMs in addressing these challenges, particularly within the Chinese context.

๐Ÿ—’๏ธ Study

Conducted by a team of researchers, this study systematically assessed 15 advanced LLMs on their performance in mental health knowledge testing and illness diagnosis. The evaluation utilized publicly available datasets, ensuring a robust analysis of each model’s capabilities. The focus on the Chinese context provides valuable insights into the applicability of these technologies in diverse cultural settings.

๐Ÿ“ˆ Results

The results revealed that DeepSeek-R1, QwQ, and GPT-4.1 significantly outperformed other models in both knowledge accuracy and diagnostic performance. These findings underscore the potential of LLMs to enhance mental health assessments and provide reliable diagnostic support, paving the way for future advancements in this field.

๐ŸŒ Impact and Implications

The implications of this study are profound. By leveraging the strengths of LLMs, mental health professionals can improve the accuracy and efficiency of assessments and diagnoses. This could lead to better patient outcomes and more personalized care. As mental health continues to be a pressing global issue, the integration of advanced technologies like LLMs could revolutionize the way we approach mental health care.

๐Ÿ”ฎ Conclusion

This study highlights the transformative potential of large language models in the field of mental health. With their ability to provide accurate assessments and diagnoses, LLMs represent a significant advancement in mental health care. Continued research and development in this area are essential to fully realize their capabilities and address the ongoing challenges in mental health services.

๐Ÿ’ฌ Your comments

What are your thoughts on the use of large language models in mental health assessment and diagnosis? We would love to hear your insights! ๐Ÿ’ฌ Share your comments below or connect with us on social media:

Evaluation of large language models on mental health: from knowledge test to illness diagnosis.

Abstract

Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.

Author: [‘Xu Y’, ‘Fang Z’, ‘Lin W’, ‘Jiang Y’, ‘Jin W’, ‘Balaji P’, ‘Wang J’, ‘Xia T’]

Journal: Front Psychiatry

Citation: Xu Y, et al. Evaluation of large language models on mental health: from knowledge test to illness diagnosis. Evaluation of large language models on mental health: from knowledge test to illness diagnosis. 2025; 16:1646974. doi: 10.3389/fpsyt.2025.1646974

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.