โก Quick Summary
This study evaluated the performance of ChatGPT models in detecting errors in nucleic acid test reports and providing treatment recommendations. The results indicated that these models achieved an impressive average error detection rate of 88.9% to 91.7% for various error types, showcasing their potential to enhance clinical decision-making and reduce workload.
๐ Key Details
- ๐ Dataset: 86 nucleic acid test reports with 285 intentionally introduced errors
- ๐งฉ Error Categories: Omission, time sequence, and dual role errors
- โ๏ธ Technology: ChatGPT models (4o, o1, o1 mini)
- ๐ Performance: Error detection rates: 88.9% (omission), 91.6% (time sequence), 91.7% (dual role)
๐ Key Takeaways
- ๐ค GPT models demonstrated strong capabilities in identifying errors in medical laboratory reports.
- ๐ Error detection rates varied, with a notable 51.9% for result input format errors.
- โฑ๏ธ Reading time for GPT models was significantly reduced compared to human experts (p<0.001).
- ๐ GPT-o1 mini outperformed other models in consistency of error identification.
- ๐ก Treatment recommendations were most accurate from the GPT-o1 model (p<0.0001).
- ๐ Kappa statistics indicated substantial agreement with senior medical laboratory scientists (0.778 to 0.837).
- ๐ฉโโ๏ธ Clinical implications suggest potential for AI to enhance healthcare efficiency and decision-making.
๐ Background
Accurate medical laboratory reports are crucial for delivering high-quality healthcare. With the rise of advanced artificial intelligence, particularly models like ChatGPT, there is a growing interest in leveraging these technologies to improve the accuracy and efficiency of medical reporting. This study aims to explore the capabilities of these models in detecting errors and providing reliable treatment recommendations.
๐๏ธ Study
Conducted as a retrospective analysis, this study compiled 86 nucleic acid test reports for seven upper respiratory tract pathogens. Researchers introduced 285 errors across four common categories to assess the performance of three ChatGPT models: 4o, o1, and o1 mini. The models’ error detection capabilities were compared against those of three senior medical laboratory scientists and three medical laboratory interns.
๐ Results
The results revealed that the ChatGPT models achieved impressive error detection rates, with averages of 88.9% for omissions, 91.6% for time sequence errors, and 91.7% for instances where the same individual acted as both inspector and reviewer. However, the models struggled with result input format errors, achieving only a 51.9% detection rate. Notably, the GPT-o1 mini model demonstrated superior consistency in error identification compared to its counterparts.
๐ Impact and Implications
The findings from this study highlight the potential of AI technologies like ChatGPT to significantly enhance the accuracy of medical laboratory reporting. By improving error detection and providing reliable treatment recommendations, these models could reduce the workload for healthcare professionals and facilitate better clinical decision-making. The implications for healthcare efficiency are profound, suggesting a future where AI plays a pivotal role in laboratory diagnostics.
๐ฎ Conclusion
This study underscores the competence of ChatGPT models in detecting errors in medical laboratory reports and providing accurate treatment recommendations. As AI continues to evolve, its integration into healthcare could lead to improved patient outcomes and more efficient workflows. Continued research in this area is essential to fully realize the benefits of AI in clinical settings.
๐ฌ Your comments
What are your thoughts on the integration of AI in medical laboratory reporting? We would love to hear your insights! ๐ฌ Join the conversation in the comments below or connect with us on social media:
Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models.
Abstract
OBJECTIVES: Accurate medical laboratory reports are essential for delivering high-quality healthcare. Recently, advanced artificial intelligence models, such as those in the ChatGPT series, have shown considerable promise in this domain. This study assessed the performance of specific GPT models-namely, 4o, o1, and o1ย mini-in identifying errors within medical laboratory reports and in providing treatment recommendations.
METHODS: In this retrospective study, 86 medical laboratory reports of Nucleic acid test report for the seven upper respiratory tract pathogens were compiled. There were 285 errors from four common error categories intentionally and randomly introduced into reports and generated 86 incorrected reports. GPT models were tasked with detecting these errors, using three senior medical laboratory scientists (SMLS) and three medical laboratory interns (MLI) as control groups. Additionally, GPT models were tasked with generating accurate and reliable treatment recommendations following positive test outcomes based on 86 corrected reports. ฯ2 tests, Kruskal-Wallis tests, and Wilcoxon tests were used for statistical analysis where appropriate.
RESULTS: In comparison with SMLS or MLI, GPT models accurately detected three error types, and the average detection rates of the three GPT models were 88.9โฏ%(omission), 91.6โฏ% (time sequence), and 91.7โฏ% (the same individual acted both as the inspector and the reviewer). However, the average detection rate for errors in the result input format by the three GPT models was only 51.9โฏ%, indicating a relatively poor performance in this aspect. GPT models exhibited substantial to almost perfect agreement with SMLS in detecting total errors (kappa [min, max]: 0.778, 0.837). However, the agreement between GPT models and MLI was moderately lower (kappa [min, max]: 0.632, 0.696). When it comes to reading all 86 reports, GPT models showed obviously reduced reading time compared with SMLS or MLI (all p<0.001). Notably, our study also found the GPT-o1 mini model had better consistency of error identification than the GPT-o1 model, which was better than that of the GPT-4o model. The pairwise comparisons of the same GPT model's outputs across three repeated runs showed almost perfect agreement (kappa [min, max]: 0.912, 0.996). GPT-o1 mini showed obviously reduced reading time compared with GPT-4o or GPT-o1(all p<0.001). Additionally, GPT-o1 significantly outperformed GPT-4o or o1ย mini in providing accurate and reliable treatment recommendations (all p<0.0001).
CONCLUSIONS: The detection capability of some of medical laboratory report errors and the accuracy and reliability of treatment recommendations of GPT models was competent, especially, potentially reducing work hours and enhancing clinical decision-making.
Author: [‘Han W’, ‘Wan C’, ‘Shan R’, ‘Xu X’, ‘Chen G’, ‘Zhou W’, ‘Yang Y’, ‘Feng G’, ‘Li X’, ‘Yang J’, ‘Jin K’, ‘Chen Q’]
Journal: Clin Chem Lab Med
Citation: Han W, et al. Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models. Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models. 2025; (unknown volume):(unknown pages). doi: 10.1515/cclm-2025-0089