🧑🏼‍💻 Research - June 8, 2026

AI models mistake Pokemon for real prescription drugs

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

A new study reveals that major AI models routinely prescribe dosing instructions for fictional Pokemon characters, exposing a critical safety flaw in clinical automation.

If a clinician asks an AI to review a patient’s medication list, they expect the software to flag an error. They do not expect the AI to write a detailed dosing schedule for a Pokemon.

Yet that is exactly what happens when today’s most advanced large language models encounter fabricated drug names. The assumption that AI can safely parse medical charts is hit by a harsh reality check. This is not just a funny quirk of machine learning. It is a major liability for any health system rushing to automate clinical workflows.

The industry has assumed that setting AI temperature to zero—making it deterministic—prevents wild hallucinations. This study proves that assumption wrong. Deterministic settings did not stop the models from hallucinating. Instead, the real solution lies in specialized architecture or aggressive mitigation prompting. If general-purpose models are deployed without these specific guardrails, they will confidently invent medical facts. This shifts the burden of safety from the model’s inherent reasoning to how strictly we prompt it.

Testing the models

Researchers built two datasets of 250 medication lists. Each list contained four to six real medications and one fake drug, which was actually a Pokemon character. They tested six models: GPT-5-Chat, GPT-4o-mini, DrugGPT, Gemma-3-27B-IT, Llama-3.3-70B-Instruct, and Qwen3-32B. The researchers asked the models for dosing information and disease indications under three different prompting conditions.

The results showed a massive gap between general models and specialized software.

  • Standard AI models failed spectacularly, showing baseline confabulation rates between 47.2% and 99.6%.
  • DrugGPT, a specialized model, performed much better with a baseline error rate of only 2.7% to 6.4%.
  • Adding mitigation prompting dropped GPT-5-Chat’s error rate to 0%, down from a baseline of 66.4% to 88.5%.
  • Changing the temperature to zero (deterministic decoding) did not substantially reduce the errors.

The safety bottleneck

This finding matters because healthcare startups are rapidly plugging general-purpose LLMs into electronic health records. If an AI cannot distinguish a pocket monster from a real beta-blocker, it cannot be trusted to reconcile medication lists. The study shows we cannot rely on raw model power alone. GPT-5-Chat still failed up to 88.5% of the time without specific mitigation instructions.

We must acknowledge the study’s limits. It used a highly specific adversarial test. In real clinics, doctors rarely type Pokemon names into charts. However, typos and obscure compound names are common. If a model treats every unfamiliar string of text as a real drug, the potential for patient harm is severe. Developers must implement hard-coded verification databases rather than relying on an LLM’s internal knowledge.

Read the full analysis in medRxiv.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.