โก Quick Summary
This study evaluated the reliability and readability of responses generated by five large language models (LLMs) to frequently asked questions about perinatal depression. While most models demonstrated moderate to high reliability, their readability levels exceeded recommended benchmarks, indicating potential challenges for individuals with lower health literacy.
๐ Key Details
- ๐ Questions analyzed: 27 frequently asked questions about perinatal depression
- ๐งฉ Models tested: ChatGPT-5, Gemini-2.5, Microsoft Copilot, Grok4, DeepSeek
- โ๏ธ Evaluation tools: DISCERN, EQIP, JAMA, GQS, HONCODE
- ๐ Readability indices: ARI, GFI, CLI, OLWF, LWGLF, FRF
๐ Key Takeaways
- ๐ค LLMs can provide reliable information on perinatal depression.
- ๐ Inter-rater agreement was high, with ICC values ranging from 0.729 to 0.847.
- ๐ Grok4 scored highest on the DISCERN scale with an average of 60.33.
- ๐ Readability was a common limitation, with all models exceeding the NIH recommended sixth-grade level.
- ๐ก Most models produced empathetic and informative content.
- ๐ Significant differences were found among models for DISCERN, EQIP, and HONCODE (p < 0.001).
- ๐ The study highlights the need for improvements in readability and ethical transparency.
- ๐ฎ Future research should focus on standardizing safety behaviors in high-risk mental health contexts.

๐ Background
Perinatal depression is a critical public health issue that affects many individuals during and after pregnancy. With the rise of generative artificial intelligence (AI), large language models (LLMs) are increasingly being utilized to provide health information. However, concerns about the reliability and readability of AI-generated content remain, particularly for vulnerable populations with varying levels of health literacy.
๐๏ธ Study
This study aimed to assess the reliability and readability of responses generated by five LLMs to common questions about perinatal depression. Researchers derived 27 frequently asked questions from Google Trends and resources from the American College of Obstetricians and Gynecologists (ACOG). Each question was submitted to the selected LLMs, and responses were rated by two obstetricians using validated instruments.
๐ Results
The results indicated that most LLMs demonstrated moderate to high reliability in their responses. Grok4 achieved the highest score on the DISCERN scale, while DeepSeek excelled in the EQIP assessment. However, all models struggled with readability, exceeding the recommended sixth-grade level, which may hinder comprehension for individuals with lower health literacy.
๐ Impact and Implications
The findings of this study underscore the potential of LLMs as supplementary sources of health information. However, the challenges related to readability highlight the need for further enhancements to ensure that AI-generated content is accessible to all individuals, particularly those with lower health literacy. Improving readability and ethical transparency will be crucial for maximizing the public benefit of these technologies.
๐ฎ Conclusion
This study reveals that while LLMs can provide reliable information on perinatal depression, their readability levels pose a significant barrier for many users. As we continue to integrate AI into health communication, it is essential to focus on improving the clarity and accessibility of the information provided. Future research should aim to establish safety standards in high-risk mental health contexts to facilitate the reliable clinical deployment of these technologies.
๐ฌ Your comments
What are your thoughts on the use of large language models in health communication? Do you believe they can effectively support individuals seeking information on perinatal depression? ๐ฌ Share your insights in the comments below or connect with us on social media:
Can large language models be trusted? Reliability and readability of responses to perinatal depression FAQs.
Abstract
OBJECTIVE: Large language models (LLMs), a core technology of generative artificial intelligence (AI), are increasingly used in health education and promotion. Although they may expand access to medical information, concerns remain regarding the reliability and readability of AI generated content for the public. This study evaluated the reliability and readability of answers generated by five LLMs to common questions about perinatal depression. The primary aims were to determine (1) the reliability of LLM responses to frequently asked questions about perinatal depression and (2) whether the readability of the generated content aligns with public health literacy levels.
METHODS: Twenty-seven frequently asked questions were derived from Google Trends and patient facing resources from the American College of Obstetricians and Gynecologists (ACOG). Each question was submitted to ChatGPT-5, Gemini-2.5, Microsoft Copilot, Grok4, and DeepSeek. Two obstetricians independently rated responses using five validated instruments (DISCERN, EQIP, JAMA, GQS, and HONCODE) and inter-rater agreement was quantified using the interclass correlation coefficient (ICC). Readability was assessed using six indices: ARI, GFI, CLI, OLWF, LWGLF, and FRF. Differences among models were analyzed using the Friedman test.
RESULTS: Inter rater agreement was high across 27 perinatal depression questions. ICC values ranged from 0.729 to 0.847. Significant between model differences emerged for DISCERN, EQIP, and HONCODE. All had p less than 0.001. No overall differences were found for JAMA and GQS. Grok4 scored highest on DISCERN at 60.33โฏยฑโฏ5.48. DeepSeek scored highest on EQIP at 53.04โฏยฑโฏ4.91. Copilot scored highest on HONCODE at 9.26โฏยฑโฏ1.85. These results highlight distinct strengths in quality constructs across instruments. Readability posed a common limitation. All models exceeded the NIH recommended sixth grade level on grade-based indices (for example, ARI ranged from 13.49โฏยฑโฏ2.92 to 15.81โฏยฑโฏ3.25). Similarly, OLWF scores fell well below the sixth-grade benchmark of 94 (ranging from 61.44โฏยฑโฏ6.80 to 72.96โฏยฑโฏ10.39, where higher scores denote easier reading). Most models produced empathetic and informative content. However, they fell short in fully addressing clinical safety standards.
CONCLUSION: Most LLMs demonstrated moderate to high reliability when responding to perinatal depression questions, supporting their potential as supplementary sources of health information. However, readability levels above recommended benchmarks suggest that current outputs may remain challenging for individuals with lower health literacy. While LLMs improve information accessibility, further improvements in readability, source attribution, and ethical transparency are needed to maximize public benefit and support equitable health communication. Future work should focus on defining and standardizing safety behaviors in high-risk mental health contexts to enable reliable clinical deployment.
Author: [‘Huang J’, ‘Yu H’, ‘Chen J’, ‘Wang X’, ‘Huang L’, ‘Wen J’, ‘Li H’]
Journal: Front Public Health
Citation: Huang J, et al. Can large language models be trusted? Reliability and readability of responses to perinatal depression FAQs. Can large language models be trusted? Reliability and readability of responses to perinatal depression FAQs. 2026; 14:1760872. doi: 10.3389/fpubh.2026.1760872