psu.edu via Reddit

Penn State AI health study flags dermatology risk

healthcare hallucinations ai-healthcare clinical-accuracy

Key insights

  • AI chatbots answered 76% of general health questions correctly but accuracy varies sharply across medical specialties.
  • OB/GYN and ENT scored highest while neurology, dermatology, and internal medicine had the lowest validity and highest harm scores.
  • Prompts between 60 and 250 characters consistently produced more accurate responses than vague or very long queries.

Why this matters

Consumer-facing health AI is already deployed at scale by companies like Amazon, Google, and numerous telehealth platforms, and this study provides the first specialty-level accuracy map they can benchmark against. The 76% average conceals harm-elevated categories like dermatology, where errors carry real clinical risk, creating liability exposure for any company recommending AI health tools without specialty-specific disclaimers. Regulators including the FDA and FTC, actively developing AI health guidance in 2026, now have peer-reviewed evidence that accuracy diverges sharply by specialty, which will likely accelerate disclosure requirements.

Summary

A Penn State study presented at ACM FAccT puts a number on consumer health AI reliability: 76% accuracy across general queries, with that figure masking a wide gap between specialties. OB/GYN and ENT scored highest, while neurology, dermatology, and internal medicine showed the lowest validity and elevated harm ratings. Query length also mattered: prompts between 60 and 250 characters outperformed vague or very long queries. Essentially: (Penn State, ACM FAccT) produced the first large-scale specialty-stratified accuracy benchmark for consumer health AI. - Dermatology is flagged specifically as a category that should carry explicit caution warnings to users. - Harm ratings vary by specialty, meaning some errors carry more clinical risk than others. - The findings apply to consumer health AI broadly, not a single chatbot. For a space where companies are racing to embed AI into patient-facing tools, a specialty-level accuracy map gives regulators and liability lawyers concrete evidence to work with.

Potential risks and opportunities

Risks

  • Telehealth platforms (Teladoc, Babylon Health) that have embedded AI chatbots for triage without specialty-level disclaimers face increased regulatory scrutiny following this publication.
  • Patients who received incorrect neurological or dermatological guidance from consumer health AI could become the basis for product liability claims against chatbot providers in the next 12-18 months.
  • Apple (Health app AI features) and Google (Search AI health responses) risk FTC enforcement action if they do not add specialty-specific accuracy warnings within the next regulatory cycle.

Opportunities

  • Specialty-focused health AI startups targeting OB/GYN and ENT verticals can now cite this study to justify their validated, narrow-scope approach to enterprise hospital and insurer buyers.
  • Medical AI auditing vendors (Viz.ai or new entrants) gain a clear commercial hook: third-party specialty-stratified accuracy certification for consumer health AI products.
  • Malpractice and product liability insurers can develop AI health tool coverage tiers priced by specialty risk, using this study's harm-rating framework as an actuarial baseline.

What we don't know yet

  • Which specific AI chatbots were evaluated: the study benchmarks the category broadly but does not disclose per-product performance breakdowns.
  • Whether elevated harm ratings in dermatology reflect image-based diagnostic limitations or text-only query failures, since the study used text prompts.
  • How accuracy scores shift as frontier models are updated through 2025-2026: results may already be partially outdated given the pace of model releases.