⚡ Quick Summary
This study evaluated the performance of two generative AI tools in responding to medical questions related to spine treatment, revealing that while both tools achieved an overall alignment score of 3.5±1.1 with established guidelines, 24.0% of their references were fabricated or inauthentic. This highlights the need for caution when utilizing AI-generated content in clinical settings.
🔍 Key Details
- 📊 Study Design: Comparative study using 33 North American Spine Society (NASS) guideline questions.
- 🧩 Tools Used: Two freely available generative AI tools (referred to as Tool I and Tool II).
- ⚙️ Evaluation Method: Responses scored on a five-point alignment scale against NASS guidelines.
- 🏆 Performance Score: Overall alignment score of 3.5±1.1.
🔑 Key Takeaways
- 🤖 Generative AI tools can provide responses that align with clinical guidelines.
- 📚 Authenticity of references is a significant concern, with 24.0% being fabricated.
- 📝 Claude tool outperformed Gemini in referencing authentic peer-reviewed papers (91.0% vs. 50.7%).
- 📅 Reference Age: Claude provided older references on average (2008±6) compared to Gemini (2014±6).
- 📊 Clinical Relevance: Only 24.3% of Claude’s references were cited in NASS guidelines, compared to 2.8% for Gemini.
- ⚠️ Caution Advised: The presence of inauthentic references limits the clinical applicability of AI-generated responses.
- 🌐 Study Context: Conducted in the context of spine treatment, a critical area in healthcare.
📚 Background
The rise of generative artificial intelligence (AI) tools, such as ChatGPT, has sparked interest in their potential applications in medicine. These tools can quickly generate responses to complex medical queries, but their reliability and the authenticity of their references remain under scrutiny. Understanding the limitations and capabilities of these AI systems is crucial for their safe integration into clinical practice.
🗒️ Study
This study aimed to assess the scientific basis of generative AI tools by comparing their responses to established guidelines from the North American Spine Society (NASS). A total of 33 guideline questions were posed to two AI tools, and their responses were evaluated for correctness and the authenticity of cited references.
📈 Results
Both AI tools achieved an overall alignment score of 3.5±1.1, indicating acceptable performance in generating responses aligned with NASS guidelines. However, the analysis revealed that 76.0% of the 254 references generated were authentic, while 24.0% were fabricated. Notably, Claude provided a significantly higher proportion of authentic peer-reviewed papers compared to Gemini, highlighting disparities in the reliability of the tools.
🌍 Impact and Implications
The findings of this study underscore the dual nature of generative AI in medicine. While these tools can assist in generating guideline-aligned responses, the presence of fabricated references poses a significant risk in clinical settings. Healthcare professionals must exercise caution and critically evaluate AI-generated content before application in practice, ensuring that patient care is not compromised by misinformation.
🔮 Conclusion
This study illustrates the potential of generative AI tools to support medical decision-making while also highlighting their limitations. The 24.0% rate of inauthentic references serves as a reminder that AI should not replace expert judgment in clinical practice. Ongoing research and development are essential to enhance the reliability of these technologies, paving the way for their safe integration into healthcare.
💬 Your comments
What are your thoughts on the use of generative AI in clinical settings? Do you believe the benefits outweigh the risks? 💬 Share your insights in the comments below or connect with us on social media:
The Double-Edged Sword of Generative AI: Surpassing an Expert or a Deceptive “False Friend”?
Abstract
BACKGROUND CONTEXT: Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.
PURPOSE: To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.
STUDY DESIGN: Comparative Study METHODS: Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to two freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a five-point “alignment score.” Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.
RESULTS: Both tools’ responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n = 193) were authentic and 24.0% (n = 61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n = 111, 91.0%; Gemini: n = 36, 50.7%; p< 0.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p< 0.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n = 27, 24.3%; Gemini: n = 1, 2.8%; p = 0.04).
CONCLUSIONS: Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or non-scientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.
Author: [‘Altorfer FCS’, ‘Kelly MJ’, ‘Avrumova F’, ‘Rohatgi V’, ‘Zhu J’, ‘Bono CM’, ‘Lebl DR’]
Journal: Spine J
Citation: Altorfer FCS, et al. The Double-Edged Sword of Generative AI: Surpassing an Expert or a Deceptive “False Friend”?. The Double-Edged Sword of Generative AI: Surpassing an Expert or a Deceptive “False Friend”?. 2025; (unknown volume):(unknown pages). doi: 10.1016/j.spinee.2025.02.010