โก Quick Summary
This study evaluated the effectiveness of large language models (LLMs) like OpenAI’s ChatGPT-5 Pro and Google’s Gemini 2.5 Pro in assisting with decision-making in minimally invasive spine surgery (MISS). The findings indicate that while these models can differentiate between surgical and non-surgical triage, expert oversight remains essential until further advancements are made.
๐ Key Details
- ๐ Study Design: Vignette-based cross-sectional study
- ๐งฉ Models Tested: OpenAI’s ChatGPT-5 Pro and Google’s Gemini 2.5 Pro
- โ๏ธ Methodology: 90 clinical vignettes from published case reports
- ๐ Metrics Used: Jensen-Shannon divergence, Cohen’s ฮบ, Stuart-Maxwell tests
๐ Key Takeaways
- ๐ค LLMs show promise in assisting with surgical decision-making.
- ๐ Agreement levels were slight for ChatGPT-5 Pro (ฮบ = 0.146) and fair for Gemini 2.5 Pro (ฮบ = 0.245).
- ๐ Collapsing categories to surgical vs non-surgical improved agreement significantly (ฮบ = 0.415 and 0.587).
- ๐ Divergence from reference was minimal (0.115 for ChatGPT-5 Pro and 0.112 for Gemini 2.5 Pro).
- ๐ง Expert oversight is crucial for procedure selection until LLMs mature.
- ๐ This study establishes a baseline for integrating LLMs into surgical workflows.
- ๐ Published in: Global Spine Journal, 2025.

๐ Background
The integration of artificial intelligence (AI) in healthcare is rapidly evolving, particularly in the realm of surgical decision-making. However, the application of generative AI in minimally invasive spine surgery (MISS) remains underexplored. This study aims to bridge that gap by assessing the capabilities of LLMs in categorizing surgical cases and aiding in triage decisions.
๐๏ธ Study
Conducted by a team of researchers, this study involved the creation of 90 clinical vignettes derived from published case reports. Each LLM was prompted to assign one or more of ten predefined categories, along with two-sentence rationales for their choices. The study aimed to evaluate the models’ agreement with expert management categories at both procedural and binary triage levels.
๐ Results
The results indicated that both LLMs demonstrated a small divergence from reference categories, with Jensen-Shannon divergence values of 0.115 for ChatGPT-5 Pro and 0.112 for Gemini 2.5 Pro. While differences from the reference were statistically significant, the agreement levels were only slight for ChatGPT-5 Pro and fair for Gemini 2.5 Pro. Notably, collapsing categories to surgical vs non-surgical improved agreement metrics significantly.
๐ Impact and Implications
The findings of this study highlight the potential for LLMs to assist in surgical triage and decision-making processes. However, they also underscore the importance of maintaining expert oversight in procedural selection. As LLMs continue to evolve, their integration into surgical workflows could enhance precision in spine care, but caution is warranted until these systems are fully matured.
๐ฎ Conclusion
This study marks a significant step in evaluating the role of large language models in minimally invasive spine surgery. While the results are promising, they also emphasize the necessity for expert involvement in surgical decision-making. As we look to the future, further research and development in this area could lead to transformative changes in how surgical care is delivered.
๐ฌ Your comments
What are your thoughts on the integration of AI in surgical decision-making? We would love to hear your insights! ๐ฌ Leave your comments below or connect with us on social media:
Evaluating Large Language Models for Decision Support in Minimally Invasive Spine Surgery Triage and Procedural Categories.
Abstract
Study DesignVignette-based cross-sectional study.ObjectiveGenerative artificial intelligence (AI) programs such as large language models (LLMs) are reshaping treatment decision-making, yet applications in minimally invasive spine surgery (MISS) are still scarce. This study examines whether OpenAI’s ChatGPT-5 Pro and Google’s Gemini 2.5 Pro reproduce expert management categories from published MISS cases and measures agreement at procedural and binary triage levels.MethodsWe constructed 90 clinical vignettes from published case reports and prompted each LLM to assign 1 or more of ten predefined categories with two-sentence rationales. Agreement with reference was assessed using Jensen-Shannon divergence (JSD), Stuart-Maxwell tests, Cohen’s ฮบ, and McNemar’s test for surgical vs non-surgical triage.ResultsDivergence from reference was small, with Jensen-Shannon divergence 0.115 (ChatGPT-5 Pro) and 0.112 (Gemini 2.5 Pro), and smaller between models at 0.073. Paired multinomial tests found differences from the reference (Stuart-Maxwell ฯ2(9) = 24.8 and 26.0; P = 0.007 and 0.006) but not between models (14.4; P = 0.108). Case-level agreement was slight for ChatGPT-5 Pro and fair for Gemini 2.5 Pro (ฮบ = 0.146 and 0.245). Collapsing categories to surgical vs non-surgical improved agreement (ฮบ = 0.415 and 0.587 vs reference; 0.692 between models) with no bias in rates (P โฅ 0.401).ConclusionsLLMs may differentiate between surgical and non-surgical triage, but procedure selection should remain expert-led until systems mature. These findings establish a baseline for integrating LLMs into surgical triage workflows and highlight promise and limitations of generative AI in precision spine care.
Author: [‘Kartal A’, ‘Manalil NF’, ‘Cheng CD’, ‘Chung LK’, ‘Gebhard H’, ‘Greenberg M’, ‘Hรคrtl R’, ‘Elsayed GA’]
Journal: Global Spine J
Citation: Kartal A, et al. Evaluating Large Language Models for Decision Support in Minimally Invasive Spine Surgery Triage and Procedural Categories. Evaluating Large Language Models for Decision Support in Minimally Invasive Spine Surgery Triage and Procedural Categories. 2025; (unknown volume):21925682251411225. doi: 10.1177/21925682251411225