๐Ÿง‘๐Ÿผโ€๐Ÿ’ป Research - May 5, 2025

Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions.

๐ŸŒŸ Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

โšก Quick Summary

Recent advancements in large language models (LLMs) have shown promising results in the field of physical medicine and rehabilitation (PM&R). The latest model, GPT-4o, achieved an impressive 84.1% accuracy in answering PM&R questions, significantly outperforming its predecessor, GPT-3.5, which scored only 56.9%.

๐Ÿ” Key Details

  • ๐Ÿ“Š Dataset: 744 PM&R knowledge questions
  • ๐Ÿงฉ Features used: Questions covering various PM&R topics
  • โš™๏ธ Technology: OpenAI’s GPT-3.5 and GPT-4o
  • ๐Ÿ† Performance: GPT-3.5: 56.3% – 56.9%, GPT-4o: 83.6% – 84.1%

๐Ÿ”‘ Key Takeaways

  • ๐Ÿค– LLMs are evolving rapidly, with significant improvements in accuracy.
  • ๐Ÿ“ˆ GPT-4o demonstrated a marked increase in performance over GPT-3.5.
  • ๐Ÿง  Applications of LLMs in healthcare could enhance clinical practice and medical education.
  • โš ๏ธ Caution is advised when integrating LLMs into clinical settings due to existing limitations.
  • ๐ŸŒ Study highlights the need for further evaluation of LLMs in specialized medical fields.

๐Ÿ“š Background

The integration of machine learning and artificial intelligence in healthcare has opened new avenues for improving patient care and medical education. However, before these technologies can be effectively utilized, it is essential to assess their accuracy and reliability, particularly in specialized fields like physical medicine and rehabilitation (PM&R).

๐Ÿ—’๏ธ Study

This cross-sectional study aimed to evaluate the accuracy and precision of two OpenAI LLMs, GPT-3.5 and GPT-4o, in answering a comprehensive set of 744 PM&R knowledge questions. The questions encompassed various aspects of the field, including general rehabilitation, stroke, traumatic brain injury, and more. Each model was tested three times to ensure precision in their responses.

๐Ÿ“ˆ Results

The results were striking: GPT-3.5 achieved correct answers in 56.3%, 56.5%, and 56.9% of the questions across three runs. In contrast, GPT-4o demonstrated a significant leap in performance, with correct answers in 83.6%, 84%, and 84.1% of the questions. Notably, GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions.

๐ŸŒ Impact and Implications

The findings from this study suggest that LLM technology has the potential to augment clinical practice, enhance medical training, and improve patient education in PM&R. However, it is crucial for healthcare professionals to remain cautious in adopting these technologies, as they still have limitations that need to be addressed before widespread implementation.

๐Ÿ”ฎ Conclusion

The advancements in LLMs, particularly with the introduction of GPT-4o, highlight the exciting potential of artificial intelligence in healthcare. As these technologies continue to evolve, they may play a vital role in improving the accuracy of medical knowledge assessments and enhancing patient care. Ongoing research and evaluation will be essential to fully realize their benefits while ensuring safe and effective use in clinical settings.

๐Ÿ’ฌ Your comments

What are your thoughts on the integration of large language models in healthcare? Do you see potential benefits or challenges? ๐Ÿ’ฌ Join the conversation in the comments below or connect with us on social media:

Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions.

Abstract

BACKGROUND: There have been significant advances in machine learning and artificial intelligence technology over the past few years, leading to the release of large language models (LLMs) such as ChatGPT. There are many potential applications for LLMs in health care, but it is critical to first determine how accurate LLMs are before putting them into practice. No studies have evaluated the accuracy and precision of LLMs in responding to questions related to the field of physical medicine and rehabilitation (PM&R).
OBJECTIVE: To determine the accuracy and precision of two OpenAI LLMs (GPT-3.5, released in November 2022, and GPT-4o, released in May 2024) in answering questions related to PM&R knowledge.
DESIGN: Cross-sectional study. Both LLMs were tested on the same 744โ€‰PM&R knowledge questions that covered all aspects of the field (general rehabilitation, stroke, traumatic brain injury, spinal cord injury, musculoskeletal medicine, pain medicine, electrodiagnostic medicine, pediatric rehabilitation, prosthetics and orthotics, rheumatology, and pharmacology). Each LLM was tested three times on the same question set to assess for precision.
SETTING: N/A.
PATIENTS: N/A.
INTERVENTIONS: N/A.
MAIN OUTCOME MEASURE: Percentage of correctly answered questions.
RESULTS: For three runs of the 744-question set, GPT-3.5 answered 56.3%, 56.5%, and 56.9% of the questions correctly. For three runs of the same question set, GPT-4o answered 83.6%, 84%, and 84.1% of the questions correctly. GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions.
CONCLUSIONS: LLM technology is rapidly advancing, with the more recent GPT-4o model performing much better on PM&R knowledge questions compared to GPT-3.5. There is potential for LLMs in augmenting clinical practice, medical training, and patient education. However, the technology has limitations and physicians should remain cautious in using it in practice at this time.

Author: [‘Bitterman J’, “D’Angelo A”, ‘Holachek A’, ‘Eubanks JE’]

Journal: PM R

Citation: Bitterman J, et al. Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions. Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions. 2025; (unknown volume):(unknown pages). doi: 10.1002/pmrj.13386

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.