🧑🏼‍💻 Research - June 11, 2026

New benchmark exposes AI medication safety flaws

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

Hospital IT departments eyeing generative AI for medication safety may want to pause, as a new benchmark reveals that some top-tier models behave like hyper-anxious assistants that flag almost everything as a hazard.

What happens when you ask a cutting-edge AI to police hospital prescriptions? It turns out many of them choose the easiest path to avoid blame. They cry wolf.

If a system flags every single drug order as a potential killer, it technically achieves perfect safety. But in a real hospital, this triggers massive alert fatigue. Doctors quickly learn to ignore the warnings, which defeats the entire purpose of clinical decision support. This is the core tension exposed by a new evaluation framework.

To expose this flaw, researchers built PsiBench, a medication-safety benchmark containing 492 clinical scenarios across 11 safety categories. The test material is based on auditing standards used by more than 2,000 U.S. hospitals. The team ran 40 frontier models from 10 different providers through 59,040 total evaluations to see how they handled high-stakes medical decisions.

The benchmark split these scenarios into three distinct tiers. The discrimination tier tested 98 scenarios, balancing 50 fatal errors against 48 deceptive cases. The operational tier evaluated 394 scenarios, containing 261 serious unsafe situations and 133 safe ones, including 41 cases designed to trigger excessive alerts. Finally, the attribution tier focused on 311 scenarios where an alert was strictly required.

The illusion of safety

The raw numbers look impressive at first glance, but the details reveal a troubling pattern. While top-performing models like OpenAI’s o4-mini and o3 achieved F1 scores of 92.3% and 92.2% respectively, other models took a lazy shortcut.

Specifically, three models hit a perfect 100% sensitivity rate but paired it with a specificity score below 25%.

This is a critical failure. A model with a specificity as low as 6.1% is practically indistinguishable from a naive program that simply alerts on every single patient file. It is not practicing medicine. It is covering its own tracks.

Key performance metrics

  • Overall accuracy across the 40 models ranged widely from 65.4% to 89.8%.
  • F1 performance spanned from a mediocre 78.5% to a strong 92.3%.
  • Specificity dropped as low as 6.1%, though the best model reached 81.8%.
  • Sensitivity remained high across the board, starting at 81.4% and topping out at 100%.

Rethinking clinical AI

Hospital administrators must look past headline accuracy numbers. A model that never misses an error but constantly interrupts doctors with false alarms is a liability, not an asset. If we deploy these systems today, we risk worsening the exact clinician burnout we are trying to solve.

This benchmark proves we need to evaluate AI on operational viability, not just raw clinical knowledge. Until AI developers prioritize specificity, these models remain too noisy for the wards. We must demand models that know when to stay silent.

Read the full preprint on medRxiv.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.