โก Quick Summary
This study evaluated the accuracy and efficiency of two large language models, GPT-4o and OpenBioLLM-70B, in fact-checking health-related prompts. The results indicated that both models achieved a high level of agreement with human experts, demonstrating their potential in enhancing patient safety and information quality in healthcare.
๐ Key Details
- ๐ Participants: 2 large language models (GPT-4o and OpenBioLLM-70B) and 3 human experts
- ๐งฉ Scenario: A 23-year-old woman inquiring about the safety of retinoid treatment for acne
- โ๏ธ Methodology: Comparison of fact-checking accuracy and prompt redrafting efficiency
- ๐ Performance Metrics: Accuracy percentage, time taken for fact-checking, and prompt redrafting outcomes
๐ Key Takeaways
- ๐ค GPT-4o and OpenBioLLM-70B achieved 86% agreement with clinical experts in the acne scenario.
- ๐ก GPT-4o demonstrated 100% agreement on a set of 20 medical claims, while OpenBioLLM-70B showed 95% agreement.
- โฑ๏ธ Fact-checking efficiency: GPT-4o completed checks in 42 seconds, while OpenBioLLM-70B took 33 minutes.
- โ ๏ธ Important omissions: Both models failed to adequately convey the urgency of discontinuing isotretinoin when pregnancy is suspected.
- ๐ Zero fabrication: GPT-4o’s responses contained no fabrications or obvious omissions.
- ๐ Human experts still play a crucial role in ensuring accuracy and safety in health-related information.
- ๐ Potential applications: These models can enhance patient education and clinical decision-making.

๐ Background
The integration of large language models in healthcare presents exciting opportunities for improving patient education and clinical decision-making. However, the accuracy and reliability of these models in providing health-related information remain underexplored. This study aims to bridge that gap by evaluating the performance of two prominent models in a real-world clinical scenario.
๐๏ธ Study
Conducted by researchers Ryan P, Davoren O, and Elwyn G, this study involved a parallel comparison of GPT-4o and OpenBioLLM-70B against human experts. The clinical scenario focused on a 23-year-old woman questioning the safety of retinoid treatment for acne. The models were tasked with improving her initial prompt and fact-checking the responses generated.
๐ Results
The findings revealed that both GPT-4o and OpenBioLLM-70B achieved an impressive 86% agreement with human experts in the acne scenario. Notably, GPT-4o excelled in fact-checking, achieving 100% agreement on a set of 20 medical claims, while OpenBioLLM-70B had a 95% agreement. The time efficiency of the models was remarkable, with GPT-4o completing checks in just 42 seconds.
๐ Impact and Implications
The implications of this study are profound. By demonstrating the ability of large language models to conduct efficient and accurate fact-checking, we can envision a future where these technologies play a pivotal role in enhancing patient safety and education. The potential for integrating AI in healthcare could lead to improved clinical decision-making and better patient outcomes.
๐ฎ Conclusion
This study highlights the significant potential of large language models like GPT-4o in supporting healthcare professionals and patients alike. While these models show promise in improving the quality of health-related information, the necessity for human oversight remains critical to ensure accuracy and safety. Continued research in this area is essential for harnessing the full capabilities of AI in healthcare.
๐ฌ Your comments
What are your thoughts on the use of large language models in healthcare? Do you believe they can enhance patient safety and education? ๐ฌ Share your insights in the comments below or connect with us on social media:
Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study.
Abstract
BACKGROUND: Large language models use machine learning to produce natural language. These models have a range of potential applications in health care, such as patient education and diagnosis. However, evaluations of large language models in health care are still scarce.
OBJECTIVE: This study aimed to (1) evaluate the accuracy and efficiency of automated fact-checking by 2 large language models and (2) illustrate a process through which a large language model might support a patient in redrafting a prompt to include key information needed for patient safety.
METHODS: A parallel comparison of 2 large language models and 3 human experts was conducted. A clinical scenario was devised in which a woman aged 23 years questions the safety of retinoid treatment for acne by sending prompts to 2 large language models (GPT-4o and OpenBioLLM-70B). GPT-4o and OpenBioLLM-70B were asked to suggest improvements to the patient’s initial prompt to elicit key information for clinical decision-making. After the patient sent the revised prompt to the large language models, the models were then asked to fact-check the final response. To test the generalizability of automated fact-checking, a set of 20 clinical statements on disparate topics, mostly related to drug indications, contraindications, and side effects, was developed. The large language models also fact-checked these 20 medical statements. The results were compared against the evaluations of 3 clinical experts. The outcome measures were as follows: (1) percentage of accuracy of automated fact-checking, (2) time to complete fact-checking, and (3) a binary outcome for prompt redrafting (advising the patient to revise her prompt by naming her acne medication to address safety concerns).
RESULTS: For the scenario of a patient with acne, GPT-4o and OpenBioLLM-70B both had 86% agreement with the clinical experts’ fact-checking. The large language models did not consistently convey the urgency of discontinuing isotretinoin treatment when pregnancy is suspected. In addition, the models did not adequately convey the importance of folic acid supplementation during pregnancy. For the set of 20 medical claims, GPT-4o fact-checking had 100% agreement with that of human experts, whereas OpenBioLLM-70B had 95% agreement. OpenBioLLM-70B diverged from human experts and GPT-4o on 1 question related to pediatric use of antihistamines. The expert fact-checks took a mean time of 18 (SD 3.74) minutes, GPT-4o took 42 seconds, and OpenBioLLM-70B took 33 minutes. The GPT-4o responses for the acne scenario had some inconsistencies but zero fabrication and no obvious omissions. In contrast, OpenBioLLM-70B omitted 1 key information item needed for patient safety.
CONCLUSIONS: GPT-4o can interact with patients to improve the quality and comprehensiveness of the information contained in health-related prompts. GPT-4o and OpenBioLLM-70B can conduct efficient fact-checking that is close to the level of accuracy of human experts. Human experts need to perform additional checks for accuracy and safety.
Author: [‘Ryan P’, ‘Davoren O’, ‘Elwyn G’]
Journal: JMIR Form Res
Citation: Ryan P, et al. Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study. Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study. 2026; 10:e68223. doi: 10.2196/68223