โก Quick Summary
This systematic review evaluated the use of Large Language Models (LLMs) in virtual patient systems for medical history-taking, highlighting their potential to enhance medical education. The findings indicate that while LLMs show promise in simulating various diseases, they currently lack adequate representation of multimorbidity scenarios.
๐ Key Details
- ๐ Dataset: 39 studies reviewed, focusing on internal medicine and mental health disorders
- โ๏ธ Technology: LLMs such as GPT-3.5 and GPT-4
- ๐ Performance Metrics: Top-k accuracy ranged from 0.45 to 0.98, with a hallucination rate of 0.31%-5%
- ๐งฉ Techniques Used: Role-based prompts, few-shot learning, multiagent frameworks, and knowledge graph integration
๐ Key Takeaways
- ๐ LLMs are transforming virtual patient systems in medical education.
- ๐ก Techniques like knowledge graph integration improved diagnostic accuracy.
- ๐งโ๐ Evaluations typically involved 10-50 students and 3-10 experts.
- ๐ Common datasets like MIMIC-III showed ICU bias and limited diversity.
- ๐ Future research should focus on multimodal LLMs and standardized metrics.
- โ ๏ธ Limitations include small sample sizes and inconsistent evaluation metrics.
- ๐๏ธ Risk of bias was moderate across included studies.
- ๐ Open-access datasets are essential for enhancing reproducibility and external validity.

๐ Background
The integration of Large Language Models (LLMs) in medical education represents a significant advancement, offering scalable and cost-effective alternatives to traditional standardized patients. However, the effectiveness of these models, particularly in complex scenarios involving multiple coexisting diseases, remains underexplored. This review aims to fill that gap by systematically evaluating LLM-based virtual patient systems for history-taking.
๐๏ธ Study
Conducted following the PRISMA guidelines, this systematic review analyzed data from nine databases, focusing on studies published between January 1, 2020, and August 18, 2025. The review specifically targeted LLMs used for history-taking tasks, excluding non-transformer models and unrelated tasks. A total of 39 studies were included, providing a comprehensive overview of the current landscape.
๐ Results
The review revealed that LLM-based virtual patient systems primarily simulated internal medicine and mental health disorders. While many studies focused on distinct single disease types, few addressed multimorbidity or rare conditions. Techniques such as role-based prompts and knowledge graph integration significantly enhanced dialogue and diagnostic accuracy, with top-k accuracy reaching 16.02%. However, the evaluations were limited by small sample sizes and inconsistent metrics, which hindered generalizability.
๐ Impact and Implications
The findings from this review underscore the potential of LLMs to revolutionize medical education by providing realistic and immersive virtual patient interactions. However, the lack of representation for multimorbidity scenarios highlights a critical area for future research. By addressing these gaps, we can enhance the educational utility of LLMs and improve the training of future healthcare professionals.
๐ฎ Conclusion
This systematic review illustrates the transformative potential of LLMs in medical education, particularly in history-taking. While current systems excel in simulating various disease types, there is a pressing need for further research to incorporate multimorbidity and enhance the realism of these virtual interactions. The future of medical education could greatly benefit from the integration of advanced LLM technologies and standardized evaluation metrics.
๐ฌ Your comments
What are your thoughts on the use of LLMs in medical education? Do you see potential for improvement in representing complex patient scenarios? Let’s discuss! ๐ฌ Leave your comments below or connect with us on social media:
Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review.
Abstract
BACKGROUND: Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.
OBJECTIVE: This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.
METHODS: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.
RESULTS: A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] โฅ80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.
CONCLUSIONS: Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine – Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.
Author: [‘Li D’, ‘Lebai Lutfi S’]
Journal: JMIR Med Inform
Citation: Li D and Lebai Lutfi S. Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review. Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review. 2026; 14:e79039. doi: 10.2196/79039