โก Quick Summary
This study evaluated the performance of large language models (LLMs) in the context of aerospace medicine, specifically testing ChatGPT-4, Google Gemini Advanced, and a custom Retrieval-Augmented Generation (RAG) LLM. The results indicated that while these models showed promise, they also exhibited significant gaps in factual knowledge and clinical reasoning.
๐ Key Details
- ๐ Dataset: 857 free-response questions and 20 multiple-choice questions from Aerospace Medicine Boards.
- ๐ค Models Tested: ChatGPT-4, Google Gemini Advanced, and a custom RAG LLM.
- ๐ Performance Metrics: ChatGPT-4 scored 70% correct on multiple-choice questions, while the RAG LLM achieved 85% accuracy.
- ๐ Reader Scores: ChatGPT-4 had a mean score of 4.23 to 5.00 on a Likert scale of 1-5.
๐ Key Takeaways
- ๐ LLMs have significant potential as clinical decision-support tools in aerospace medicine.
- โ ๏ธ Gaps in Knowledge: All models exhibited factual inaccuracies that could be harmful in a clinical setting.
- ๐ง Clinical Reasoning: The models may not meet the standards required for the aerospace medicine board exam.
- ๐ Performance Variability: ChatGPT-4 and Gemini Advanced showed inconsistencies in answering self-generated questions.
- ๐ Future Development: Continued advancements in model training and data quality are essential for improving LLM performance.
- ๐ Research Implications: The findings highlight the need for careful evaluation before deploying LLMs in critical medical environments.
๐ Background
The integration of large language models (LLMs) into clinical settings, particularly in space medicine, presents a unique opportunity to enhance decision-making processes. However, the potential for generating incorrect information raises concerns about patient safety and care quality. Understanding the capabilities and limitations of these models is crucial for their effective application in autonomous medical operations.
๐๏ธ Study
This study aimed to evaluate the factual knowledge and clinical reasoning abilities of LLMs in the context of aerospace medicine. Researchers tested ChatGPT-4, Google Gemini Advanced, and a custom RAG LLM using a comprehensive set of questions derived from the Aerospace Medicine Boards. The goal was to assess their performance and identify any critical gaps in knowledge that could impact clinical decision-making.
๐ Results
The results revealed that ChatGPT-4 achieved a mean reader score ranging from 4.23 to 5.00, while Google Gemini Advanced and the RAG LLM scored between 3.30 to 4.91 and 4.69 to 5.00, respectively. In terms of multiple-choice questions, ChatGPT-4 answered correctly 70% of the time, Gemini Advanced 55%, and the RAG LLM excelled with an 85% accuracy rate. Despite these promising scores, the models still demonstrated significant gaps in factual knowledge and clinical reasoning.
๐ Impact and Implications
The findings of this study underscore the potential of LLMs to revolutionize medical operations in spaceflight. However, the identified gaps in knowledge and reasoning highlight the importance of rigorous evaluation and ongoing development. As advancements in model training and data quality continue, the integration of LLMs into clinical practice could lead to improved decision-making and patient care in challenging environments.
๐ฎ Conclusion
This study illustrates the promise and challenges associated with using large language models in aerospace medicine. While the potential for these technologies to enhance clinical decision-making is significant, it is essential to address the existing gaps in knowledge and reasoning. Continued research and development will be crucial in harnessing the full capabilities of LLMs for safe and effective medical operations in space.
๐ฌ Your comments
What are your thoughts on the use of large language models in aerospace medicine? Do you see potential for improvement in their application? ๐ฌ Share your insights in the comments below or connect with us on social media:
Evaluating Large Language Models on Aerospace Medicine Principles.
Abstract
IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.
Author: [‘Anderson KD’, ‘Davis CA’, ‘Pickett SM’, ‘Pohlen MS’]
Journal: Wilderness Environ Med
Citation: Anderson KD, et al. Evaluating Large Language Models on Aerospace Medicine Principles. Evaluating Large Language Models on Aerospace Medicine Principles. 2025; (unknown volume):10806032251330628. doi: 10.1177/10806032251330628