โก Quick Summary
This study compared the performance of ChatGPT and DeepSeek in answering orthopedic multiple-choice questions (MCQs), revealing that ChatGPT achieved a correctness rate of 80.38% compared to DeepSeek’s 74.2%. Additionally, ChatGPT demonstrated a significantly shorter average response time of 10.40 seconds versus DeepSeek’s 34.42 seconds.
๐ Key Details
- ๐ Dataset: 209 orthopedic MCQs from the 2023-2024 academic year
- ๐งฉ Technologies used: ChatGPT (with “Reason” function) and DeepSeek (with “DeepThink” function)
- โ๏ธ Evaluation metrics: Correctness, response time, and reliability
- ๐ Performance: ChatGPT: 80.38% correctness; DeepSeek: 74.2% correctness
๐ Key Takeaways
- ๐ค ChatGPT outperformed DeepSeek in both correctness and response time.
- โฑ๏ธ Average response time for ChatGPT was significantly shorter at 10.40 seconds.
- ๐ Reliability of ChatGPT was nearly perfect (ฮบ=0.81) compared to DeepSeek’s substantial agreement (ฮบ=0.78).
- ๐ 7.7% of responses from both models were completely false.
- ๐ Review by orthopedic faculty indicated that some MCQs require revisions for clarity.
- ๐ Potential for AI integration into medical assessments is promising.
- ๐ Further research is needed to explore AI’s role in other medical disciplines.

๐ Background
Multiple-choice questions (MCQs) are a cornerstone of medical education, serving as a vital tool for assessing knowledge and clinical reasoning. However, traditional MCQ development can be labor-intensive and prone to bias. The emergence of large language models (LLMs) like ChatGPT and DeepSeek presents an innovative opportunity to enhance the efficiency and accuracy of MCQ assessments in medical education.
๐๏ธ Study
This cross-sectional study involved 209 orthopedic MCQs from summative assessments conducted during the 2023-2024 academic year. The researchers aimed to compare the performance of ChatGPT and DeepSeek in terms of correctness, response time, and reliability when answering these MCQs. The study utilized statistical tests to analyze the data and assess the reliability of the models.
๐ Results
The findings revealed that ChatGPT achieved a correctness rate of 80.38% (168 out of 209 questions), while DeepSeek scored 74.2% (155 out of 209 questions), with a statistically significant difference (P=.04). ChatGPT’s “Reason” function also outperformed DeepSeek’s “DeepThink” function, achieving 84.7% correctness. Furthermore, ChatGPT’s average response time was significantly shorter at 10.40 seconds compared to DeepSeek’s 34.42 seconds (P<.001).
๐ Impact and Implications
The results of this study highlight the potential of integrating AI tools like ChatGPT into medical education assessments. By demonstrating superior performance in both correctness and response time, ChatGPT could streamline the evaluation process for medical students, ultimately enhancing educational outcomes. However, the need for revisions in some MCQs indicates that human oversight remains crucial in ensuring clarity and accuracy in assessments.
๐ฎ Conclusion
This study underscores the promising role of AI in medical education, particularly in the assessment of MCQs. With ChatGPT showing higher correctness and efficiency, it opens the door for further exploration of AI’s capabilities in various medical disciplines. Continued research is essential to validate these findings and expand the application of LLMs in educational settings.
๐ฌ Your comments
What are your thoughts on the use of AI in medical education assessments? We would love to hear your insights! ๐ฌ Share your comments below or connect with us on social media:
Comparing ChatGPT and DeepSeek for Assessment of Multiple-Choice Questions in Orthopedic Medical Education: Cross-Sectional Study.
Abstract
BACKGROUND: Multiple-choice questions (MCQs) are essential in medical education for assessing knowledge and clinical reasoning. Traditional MCQ development involves expert reviews and revisions, which can be time-consuming and subject to bias. Large language models (LLMs) have emerged as potential tools for evaluating MCQ accuracy and efficiency. However, direct comparisons of these models in orthopedic MCQ assessments are limited.
OBJECTIVE: This study compared the performance of ChatGPT and DeepSeek in terms of correctness, response time, and reliability when answering MCQs from an orthopedic examination for medical students.
METHODS: This cross-sectional study included 209 orthopedic MCQs from summative assessments during the 2023-2024 academic year. ChatGPT (including the “Reason” function) and DeepSeek (including the “DeepThink” function) were used to identify the correct answers. Correctness and response times were recorded and compared using a ฯ2 test and Mann-Whitney U test where appropriate. The two LLMs’ reliability was assessed using the Cohen ฮบ coefficient. The MCQs incorrectly answered by both models were reviewed by orthopedic faculty to identify ambiguities or content issues.
RESULTS: ChatGPT achieved a correctness rate of 80.38% (168/209), while DeepSeek achieved 74.2% (155/209; P=.04). ChatGPT’s Reason function also outperformed DeepSeek’s DeepThink function (177/209, 84.7% vs 168/209, 80.4%; P=.12). The average response time for ChatGPT was 10.40 (SD 13.29) seconds, significantly shorter than DeepSeek’s 34.42 (SD 25.48) seconds (P<.001). Regarding reliability, ChatGPT demonstrated an almost perfect agreement (ฮบ=0.81), whereas DeepSeek showed substantial agreement (ฮบ=0.78). A completely false response was recorded in 7.7% (16/209) of responses for both models.
CONCLUSIONS: ChatGPT outperformed DeepSeek in correctness and response time, demonstrating its efficiency in evaluating orthopedic MCQs. This high reliability suggests its potential for integration into medical assessments. However, our results indicate that some MCQs will require revisions by instructors to improve their clarity. Further studies are needed to evaluate the role of artificial intelligence in other disciplines and to validate other LLMs.
Author: [‘Anusitviwat C’, ‘Suwannaphisit S’, ‘Bvonpanttarananon J’, ‘Tangtrakulwanich B’]
Journal: JMIR Form Res
Citation: Anusitviwat C, et al. Comparing ChatGPT and DeepSeek for Assessment of Multiple-Choice Questions in Orthopedic Medical Education: Cross-Sectional Study. Comparing ChatGPT and DeepSeek for Assessment of Multiple-Choice Questions in Orthopedic Medical Education: Cross-Sectional Study. 2025; 9:e75607. doi: 10.2196/75607