ChatGPT, Grok, Gemini Fail Medical Misinformation Audit
Key insights
- 30% of responses were rated "somewhat problematic" and 19.6% "highly problematic" across 250 medical prompts in a BMJ Open audit.
- All five chatbots refused only 2 of 250 total prompts, responding with high confidence despite widespread inaccuracy.
- Hallucinated citations prevented all five systems from producing a fully accurate medical reference list.
Why this matters
AI chatbots are being adopted in consumer health settings at scale, and this audit shows that all five systems tested, including those from OpenAI, Google, and xAI, will answer contraindicated medical queries with high confidence. A 19.6% highly-problematic response rate across 250 structured prompts is a systemic failure, not an edge case, produced under a peer-reviewed adversarial framework. The BMJ Open venue ensures these findings reach clinicians and policymakers, increasing the likelihood of regulatory attention on AI medical reliability in the near term.
Summary
A BMJ Open audit of five AI chatbots found nearly half of responses to medical queries problematic: 30% "somewhat problematic" and 19.6% "highly problematic."
Researchers tested Gemini 2.0, DeepSeek V3, Meta AI Llama 3.3, ChatGPT 3.5, and Grok 2 on 50 prompts each across cancer, vaccines, stem cells, nutrition, and athletic performance. Only 2 of 250 total prompts were refused, while all five responded with high confidence despite widespread inaccuracy throughout.
Essentially: (Gemini, DeepSeek, Meta AI, ChatGPT, Grok) all failed, with hallucinated citations preventing any chatbot from producing a fully accurate reference list.
- Nutrition queries produced the weakest results; vaccine and cancer questions fared best.
- All five chatbots wrote at a "difficult" reading level equivalent to college students, limiting accessibility for general audiences.
- Researchers warn that continued deployment without public education and oversight risks amplifying misinformation at scale.
The study appears in BMJ Open, not a niche AI outlet, making these findings harder for AI companies to dismiss.
Potential risks and opportunities
Risks
- Google, OpenAI, and xAI face increased regulatory and hospital-procurement pressure to add explicit health-query disclaimers or refusal protocols following a peer-reviewed BMJ Open failure audit.
- Consumer reliance on nutrition advice from these chatbots, the weakest-performing category in the study, could accelerate health harm before platform-level content filters or warnings are deployed.
- Hallucinated citations across all five systems create medical liability exposure for institutions or clinicians who recommend AI-assisted health guidance to patients without disclosure of fabrication risk.
Opportunities
- Medical AI verification and clinical decision-support vendors can use this BMJ Open audit as a procurement lever, pitching audited and citation-grounded alternatives to general-purpose chatbots for hospital systems.
- Third-party AI health-reliability audit firms gain a reproducible benchmark from the study's adversarial 50-prompt framework, usable as a standardized baseline for testing future model versions at release.
- AI literacy and public health education organizations have a high-credibility peer-reviewed anchor to build consumer awareness campaigns around the limits of chatbot medical advice.
What we don't know yet
- Whether ChatGPT 3.5 (released November 2022) was the right version to test given newer OpenAI models have since released, and whether the highly-problematic rate would differ on current versions.
- Which specific claims within the five topic categories generated the most hallucinated citations, and whether certain topics are structurally harder for large language models to handle accurately.
- Whether any of the five AI providers (Google, High-Flyer, Meta, OpenAI, xAI) were notified of the findings pre-publication and whether product changes have since followed.
Originally reported by psypost.org
Read the original article →Original headline: Five AI Chatbots Fail Medical Misinformation Test — Nearly 50% of Responses Problematic, Grok Performs Worst, Hallucinated Citations Universal