⚡ Quick Summary
This study evaluated the performance of large language models (LLMs) in interpreting arterial blood gas (ABG) results, demonstrating strong concordance with expert evaluations. The findings suggest that LLMs can significantly enhance the understanding of acid-base disorders among medical students while minimizing calculation-related errors.
🔍 Key Details
- 📊 Dataset: 200 ABG datasets across five diagnostic categories
- 🧩 Participants: Three medical students using different LLMs
- ⚙️ Technologies: ChatGPT GPT-4o, Copilot GPT-4, Gemini 1.5-flash/2.5-flash
- 🏆 Reference Standard: Evaluations by two clinical pathologists
🔑 Key Takeaways
- 📊 Strong agreement (Cohen’s κ ≥ 0.88) for identifying primary acid-base disorders across all approaches.
- 💡 LLM-I showed moderate agreement for identifying both primary and secondary disorders (ChatGPT κ = 0.65, Copilot κ = 0.61, Gemini κ = 0.62).
- 🏆 LLM-S achieved strong agreement (ChatGPT κ = 0.91, Copilot κ = 0.81, Gemini κ = 0.81).
- 🤖 LLM-assisted interpretation can enhance learning and reduce errors among medical students.
- 🌍 Study conducted at a single center, providing a focused analysis of LLM performance.
📚 Background
Interpreting acid-base disorders is a complex task that often challenges medical professionals, particularly in cases with mixed disorders. The advent of large language models (LLMs) presents an opportunity to assist in these cognitively demanding tasks, potentially improving educational outcomes and clinical decision-making.
🗒️ Study
This single-center retrospective study involved the curation of 200 ABG datasets, categorized into five diagnostic groups: metabolic acidosis, respiratory acidosis, metabolic alkalosis, respiratory alkalosis, and no acid-base disorder. Three medical students were assigned to interpret these datasets using different LLMs, with evaluations compared to a reference standard established by two clinical pathologists.
📈 Results
The study found that the agreement for identifying primary acid-base disorders was strong across all LLM approaches. Notably, the LLM-S method demonstrated superior performance in identifying both primary and secondary disorders, indicating that supervision enhances the accuracy of LLM-assisted interpretations.
🌍 Impact and Implications
The implications of this study are significant for medical education and practice. By integrating LLMs into the learning process, medical students can gain a better understanding of acid-base disorders, leading to improved diagnostic skills and reduced errors in clinical settings. This could ultimately enhance patient care and outcomes in various healthcare environments.
🔮 Conclusion
This study highlights the promising role of LLMs in assisting with the interpretation of ABG results. As these technologies continue to evolve, they hold the potential to transform medical education and practice, making complex tasks more manageable for healthcare professionals. Further research is encouraged to explore the broader applications of LLMs in clinical settings.
💬 Your comments
What are your thoughts on the integration of LLMs in medical education and practice? We would love to hear your insights! 💬 Leave your comments below or connect with us on social media:
Human-in-the-Loop Performance of LLM-Assisted Arterial Blood Gas Interpretation: A Single-Center Retrospective Study.
Abstract
Background and Objectives: Interpreting acid-base disorders is challenging, particularly in complex or mixed cases. Given the growing potential of large language models (LLMs) to assist in cognitively demanding tasks, this study evaluated their performance in interpreting arterial blood gas (ABG) results. Materials and Methods: In this single-center retrospective study, 200 ABG datasets were curated to include 40 cases in each of five diagnostic categories: metabolic acidosis, respiratory acidosis, metabolic alkalosis, respiratory alkalosis, and no acid-base disorder. Three medical students, each assigned to one LLM (ChatGPT GPT-4o, Copilot GPT-4, or Gemini 1.5-flash/2.5-flash), perform ABG interpretation using two evaluation methods: interpretation (LLM-I) and interpretation with supervision model (LLM-S). Two clinical pathologists independently performed the conventional evaluation to serve as the reference standard. Results: Agreement for identifying the primary acid-base (APD) disorder was strong across all approaches (Cohen’s κ ≥ 0.88). For identifying both primary and secondary disorders regardless of order (APSD), LLM-I showed moderate agreement (ChatGPT κ = 0.65, Copilot κ = 0.61, Gemini κ = 0.62), whereas LLM-S achieved strong agreement (ChatGPT κ = 0.91, Copilot κ = 0.81, Gemini κ = 0.81). Conclusions: LLM-assisted ABG interpretation demonstrates strong concordance with expert interpretation in detecting primary acid-base disorders. These tools may enhance the understanding of acid-base disorders while reducing calculation-related errors among medical students.
Author: [‘Ayala-De la Cruz S’, ‘Arenas-Hernández PE’, ‘Fernández-Herrera MF’, ‘Quiñones-Díaz RA’, ‘Llaca-Díaz JM’, ‘Díaz-Chuc EA’, ‘Robles-Espino DG’, ‘San Miguel-Garay EA’]
Journal: J Clin Med
Citation: Ayala-De la Cruz S, et al. Human-in-the-Loop Performance of LLM-Assisted Arterial Blood Gas Interpretation: A Single-Center Retrospective Study. Human-in-the-Loop Performance of LLM-Assisted Arterial Blood Gas Interpretation: A Single-Center Retrospective Study. 2025; 14:(unknown pages). doi: 10.3390/jcm14186676