โก Quick Summary
This study evaluated the alignment of responses generated by a Large Language Model (LLM) with those of human therapists during Motivational Interviewing (MI) sessions. The findings revealed that while LLM responses showed general alignment, there were significant limitations in long-range coherence and stylistic alignment.
๐ Key Details
- ๐ Dataset: 3706 therapist turns from 154 MI sessions
- ๐งฉ Features used: High-fidelity therapist-client transcripts
- โ๏ธ Technology: GPT-4o LLM
- ๐ Performance Metrics: Mean DeepEval score: 0.72; Mean cosine similarity score: 0.29
๐ Key Takeaways
- ๐ค LLMs can generate responses that align with human therapists in MI contexts.
- ๐ DeepEval scores indicated greater contextual appropriateness compared to cosine similarity scores.
- ๐ Therapist topic consistency significantly moderated response alignment.
- ๐ Performance declined slightly in longer conversations, indicating challenges in maintaining coherence.
- โ ๏ธ Distinct metrics captured different aspects of response alignment, highlighting the need for comprehensive evaluation methods.
- ๐ง Improved prompt design and MI-specific evaluation methods are necessary for better integration into mental health care.

๐ Background
The integration of Large Language Models (LLMs) into mental health care is a burgeoning field, with potential applications in various therapeutic contexts. Motivational Interviewing (MI) is a structured counseling approach that emphasizes collaboration and empathy, making it an ideal framework for assessing the effectiveness of LLM-generated responses in therapeutic settings.
๐๏ธ Study
Conducted between March and May 2025, this cross-sectional study utilized high-fidelity transcripts from publicly available counseling videos. The researchers employed the Motivational Interviewing Treatment Integrity system to annotate therapist-client interactions, allowing for a robust evaluation of LLM responses generated using a standardized MI-informed prompt.
๐ Results
The analysis of 3706 therapist turns revealed that the mean DeepEval scores (0.72) were significantly higher than the mean cosine similarity scores (0.29), indicating that while LLM responses were contextually appropriate, they lacked semantic overlap. Additionally, therapist topic consistency was found to significantly enhance alignment, with higher cosine similarity and DeepEval scores in sessions with greater consistency.
๐ Impact and Implications
The findings of this study underscore the potential for LLMs to assist in therapeutic contexts, particularly in MI. However, the limitations identified in long-range coherence and stylistic alignment suggest that further refinement of LLMs and evaluation methods is essential before their widespread adoption in mental health care. This research opens the door for future innovations in therapeutic technologies, potentially enhancing the quality of care provided to clients.
๐ฎ Conclusion
This study highlights the promising alignment of LLM responses with those of human therapists in MI sessions, yet it also points to critical areas for improvement. As we continue to explore the integration of AI in mental health, it is vital to focus on enhancing prompt design and developing MI-specific evaluation methods to ensure that these technologies can effectively support therapeutic practices. The future of AI in mental health care is bright, but careful consideration and validation are necessary for successful implementation.
๐ฌ Your comments
What are your thoughts on the use of AI in therapeutic settings? Do you believe LLMs can enhance the quality of mental health care? ๐ฌ Share your insights in the comments below or connect with us on social media:
Alignment of Large Language Model Responses With Human Therapists in Motivational Interviewing.
Abstract
IMPORTANCE: Large language models (LLMs) are increasingly applied to mental health contexts, yet their capacity to generate responses that align with evidence-based psychotherapy remains uncertain. Motivational interviewing (MI), a structured counseling approach, provides an empirically grounded setting for evaluating alignment between LLM-generated and human therapist responses.
OBJECTIVE: To evaluate how closely an LLM’s responses align with therapist responses in MI sessions, using automated similarity metrics.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used high-fidelity therapist-client transcripts annotated with the Motivational Interviewing Treatment Integrity system. Transcripts were sourced from publicly available counseling videos. For each therapist turn, the GPT-4o LLM generated a response using a standardized, MI-informed prompt based on the preceding conversation context. Analyses were conducted between March and May 2025.
MAIN OUTCOMES AND MEASURES: Alignment between LLM-generated and therapist responses was assessed using (1) cosine similarity based on sentence embeddings to capture semantic overlap and (2) DeepEval, a contextual deep-learning-based metric assessing coherence and contextual appropriateness. A therapist topic-consistency index quantified within-session thematic coherence and was examined as a moderator of alignment.
RESULTS: A total of 3706 therapist turns from 154 MI sessions were evaluated. Mean (SD) DeepEval scores were higher than mean (SD) cosine similarity scores (0.72 [0.31] vs 0.29 [0.20]; Pโ<โ.001), suggesting limited semantic overlap despite greater contextual appropriateness. Therapist topic consistency significantly moderated similarity, where cosine similarity was higher in high-consistency than low-consistency sessions (mean [SD] difference, 0.027 [0.007]; t3706โ=โ3.987; Pโ<โ.001), as was DeepEval score (mean [SD] difference, 0.038 [0.010]; t3706โ=โ3.747; Pโ<โ.001). Correlation between metrics was negligible (Spearman ฯ, -0.01), indicating that they captured distinct aspects of response alignment. LLM performance declined slightly across longer conversations (mean [SD] slope reduction for cosine similarity, -0.0005 [0.0016], and for DeepEval, -0.0005 [0.0022]), with increased verbosity and signs of reduced contextual grounding.
CONCLUSIONS AND RELEVANCE: In this cross-sectional study of 154 MI sessions, prompted LLMs showed general alignment with therapist responses in MI-oriented conversations, as judged by automated similarity metrics. However, limitations in long-range coherence, stylistic alignment, and the use of indirect proxies for therapeutic quality highlight the need for improved prompt design, MI-specific evaluation methods, and clinical validation before integration into mental health care.
Author: [‘Teferra BG’, ‘Huang S’, ‘Johny N’, ‘Perivolaris A’, ‘Al-Shamali H’, ‘Parkington K’, ‘Rueda A’, ‘Zeifman RJ’, ‘Sharma D’, ‘Krishnan S’, ‘Monson C’, ‘Bhat V’]
Journal: JAMA Netw Open
Citation: Teferra BG, et al. Alignment of Large Language Model Responses With Human Therapists in Motivational Interviewing. Alignment of Large Language Model Responses With Human Therapists in Motivational Interviewing. 2026; 9:e262750. doi: 10.1001/jamanetworkopen.2026.2750