Follow us
🧑🏼‍💻 Research - January 11, 2025

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.

🌟 Stay Updated!
Join Dr. Ailexa’s channels to receive the latest insights in health and AI.

⚡ Quick Summary

A recent study evaluated the performance of seven large language models (LLMs) on the Chinese National Nursing Licensing Examination (CNNLE), revealing that Qwen-2.5 outperformed its peers with an impressive accuracy of 88.9%. The integration of machine learning techniques further enhanced accuracy to 90.8%, showcasing the transformative potential of LLMs in healthcare education.

🔍 Key Details

  • 📊 Dataset: 1200 multiple-choice questions from CNNLE (2019-2023)
  • 🧩 Models evaluated: GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, Qwen-2.5
  • ⚙️ Machine learning models used: Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, CatBoost
  • 🏆 Top performer: Qwen-2.5 with 88.9% accuracy

🔑 Key Takeaways

  • 🏆 Qwen-2.5 achieved the highest accuracy among evaluated LLMs.
  • 📈 XGBoost improved combined model accuracy to 90.8%.
  • 🔍 Qwen-2.5 excelled in the Practical Skills section of the exam.
  • 📊 Area under the curve (AUC) for XGBoost was 0.961.
  • 💡 Sensitivity of XGBoost was 0.905 and specificity 0.978.
  • 🤖 LLMs demonstrated potential in handling complex clinical decision-making.
  • 🌍 First study to assess LLMs on the CNNLE, paving the way for future research.

📚 Background

The Chinese National Nursing Licensing Examination (CNNLE) poses unique challenges due to its requirement for both deep nursing knowledge and complex clinical decision-making. As large language models (LLMs) gain traction in medical education, their application in such specialized examinations remains largely unexplored, highlighting a significant gap in research and practice.

🗒️ Study

This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. The performance of seven LLMs was evaluated, focusing on their ability to manage domain-specific nursing knowledge and clinical decision-making. Additionally, the study explored whether combining outputs from these models using machine learning techniques could enhance overall accuracy.

📈 Results

The results indicated that Qwen-2.5 achieved the highest overall accuracy of 88.9%, significantly outperforming other models such as GPT-4o at 80.7% and ERNIE Bot-3.5 at 78.1%. The integration of outputs from the seven LLMs using XGBoost resulted in an impressive accuracy of 90.8%, with a remarkable AUC of 0.961.

🌍 Impact and Implications

The findings from this study underscore the transformative potential of LLMs in revolutionizing healthcare education. By demonstrating their ability to accurately navigate complex nursing scenarios, LLMs like Qwen-2.5 could significantly enhance examination preparation and professional training, ultimately improving the quality of care in nursing practice.

🔮 Conclusion

This pioneering study highlights the significant capabilities of large language models in the context of nursing examinations. The integration of machine learning techniques not only boosts accuracy but also opens new avenues for research and application in healthcare education. As we look to the future, further exploration of LLMs in medical training could lead to substantial advancements in the field.

💬 Your comments

What are your thoughts on the use of large language models in nursing education? We would love to hear your insights! 💬 Leave your comments below or connect with us on social media:

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.

Abstract

BACKGROUND: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.
OBJECTIVE: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy.
METHODS: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques.
RESULTS: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977.
CONCLUSIONS: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.

Author: [‘Zhu S’, ‘Hu W’, ‘Yang Z’, ‘Yan J’, ‘Zhang F’]

Journal: JMIR Med Inform

Citation: Zhu S, et al. Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study. Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study. 2025; 13:e63731. doi: 10.2196/63731

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.