the-decoder.com web signal

Copilot Auto Mode Fabricates Findings from Identical Data

microsoft google openai anthropic ai-reliability enterprise-ai

Key insights

  • Copilot and Gemini Flash invented nationality-based differences when given two identical 2,000-row datasets labeled with different country names.
  • Advanced reasoning models like Claude Opus 4.7 caught the duplication by automatically writing Python code to verify the raw data.
  • Most professionals never manually switch from default model settings, meaning fabricated outputs are generated at scale undetected.

Why this matters

Default model selection silently determines output quality, and most enterprise users never change it, making Kucharski's finding a systemic reliability issue rather than an isolated edge case. The fabrication mechanism (cultural stereotype injection) is harder to detect than standard hallucination because the outputs are internally consistent and match reader expectations rather than contradicting them. Any organization using AI for cross-group data analysis without explicitly enabling reasoning modes may have already generated decision-support artifacts that bear no relationship to their actual datasets.

Summary

Copilot Auto mode and Gemini Flash fabricated ethnic differences from two identical datasets. Mathematician Adam Kucharski labeled the same 2,000-row data as 'Italians' and 'Brits.' Copilot reported Italians were three times more likely to pursue arts careers. No such signal existed. Claude Opus 4.7 and ChatGPT Instant caught it by writing Python to inspect the raw data. Default fast models produced stereotyped fabrications instead. Essentially: (Microsoft Copilot, Google Gemini Flash) filled absent analytical signal with culturally plausible narratives. - Default models matched cultural stereotypes rather than actual data. - Reasoning models self-verified with code; default models did not. - Most professionals never change default settings, so this error scales silently.

Potential risks and opportunities

Risks

  • Organizations using Copilot or Gemini Workspace defaults for HR, marketing, or policy analysis may have already distributed stereotype-driven fabrications to decision-makers with no awareness the outputs had no basis in source data
  • Microsoft and Google face reputational exposure if enterprise customers run retroactive audits on past AI-assisted analyses and find culturally patterned fabrications unrelated to actual datasets
  • EU AI Act compliance teams could use Kucharski's reproducible test case to challenge default model deployments in high-stakes professional contexts within the next 12 months

Opportunities

  • AI audit and governance vendors (Credo AI, Arthur AI, Fiddler) can position dataset-identity and duplication checks as a baseline evaluation step for enterprise model deployments
  • Anthropic and OpenAI have a concrete, reproducible differentiator to market reasoning-mode defaults for professional data analysis over fast default models
  • Enterprise software vendors building on AI APIs could add dataset-fingerprinting or duplication-detection layers as a compliance feature for regulated industries handling cross-group analysis

What we don't know yet

  • Whether Microsoft or Google have tested Copilot Auto mode and Gemini Flash against identical-dataset inputs since Kucharski's May 24 findings went public
  • Which specific version of Copilot Auto mode was tested, and whether the behavior persists after Microsoft's ongoing model-routing updates
  • What proportion of enterprise Copilot and Gemini Workspace deployments are operating on default model settings with no audit of model selection policies