psu.edu via Reddit May 29th 2026

Penn State AI health study flags dermatology risk

healthcare hallucinations ai-healthcare clinical-accuracy

Key insights

AI chatbots answered 76% of general health questions correctly but accuracy varies sharply across medical specialties.
OB/GYN and ENT scored highest while neurology, dermatology, and internal medicine had the lowest validity and highest harm scores.
Prompts between 60 and 250 characters consistently produced more accurate responses than vague or very long queries.

Why this matters

Consumer-facing health AI is already deployed at scale by companies like Amazon, Google, and numerous telehealth platforms, and this study provides the first specialty-level accuracy map they can benchmark against. The 76% average conceals harm-elevated categories like dermatology, where errors carry real clinical risk, creating liability exposure for any company recommending AI health tools without specialty-specific disclaimers. Regulators including the FDA and FTC, actively developing AI health guidance in 2026, now have peer-reviewed evidence that accuracy diverges sharply by specialty, which will likely accelerate disclosure requirements.

Summary

A Penn State study presented at ACM FAccT puts a number on consumer health AI reliability: 76% accuracy across general queries, with that figure masking a wide gap between specialties. OB/GYN and ENT scored highest, while neurology, dermatology, and internal medicine showed the lowest validity and elevated harm ratings. Query length also mattered: prompts between 60 and 250 characters outperformed vague or very long queries. Essentially: (Penn State, ACM FAccT) produced the first large-scale specialty-stratified accuracy benchmark for consumer health AI. - Dermatology is flagged specifically as a category that should carry explicit caution warnings to users. - Harm ratings vary by specialty, meaning some errors carry more clinical risk than others. - The findings apply to consumer health AI broadly, not a single chatbot. For a space where companies are racing to embed AI into patient-facing tools, a specialty-level accuracy map gives regulators and liability lawyers concrete evidence to work with.

Potential risks and opportunities

Risks

Telehealth platforms (Teladoc, Babylon Health) that have embedded AI chatbots for triage without specialty-level disclaimers face increased regulatory scrutiny following this publication.
Patients who received incorrect neurological or dermatological guidance from consumer health AI could become the basis for product liability claims against chatbot providers in the next 12-18 months.
Apple (Health app AI features) and Google (Search AI health responses) risk FTC enforcement action if they do not add specialty-specific accuracy warnings within the next regulatory cycle.

Opportunities

Specialty-focused health AI startups targeting OB/GYN and ENT verticals can now cite this study to justify their validated, narrow-scope approach to enterprise hospital and insurer buyers.
Medical AI auditing vendors (Viz.ai or new entrants) gain a clear commercial hook: third-party specialty-stratified accuracy certification for consumer health AI products.
Malpractice and product liability insurers can develop AI health tool coverage tiers priced by specialty risk, using this study's harm-rating framework as an actuarial baseline.

What we don't know yet

Which specific AI chatbots were evaluated: the study benchmarks the category broadly but does not disclose per-product performance breakdowns.
Whether elevated harm ratings in dermatology reflect image-based diagnostic limitations or text-only query failures, since the study used text prompts.
How accuracy scores shift as frontier models are updated through 2025-2026: results may already be partially outdated given the pace of model releases.

Originally reported by psu.edu

Read the original article →

Original headline: Penn State Study: AI Chatbots Answer 76% of Everyday Health Questions Correctly, But Neurology and Dermatology Score Lowest With Elevated Harm Ratings