ucsd.edu via Reddit

GPT-4.5 Passes Turing Test, Fools Judges 73% of the Time

openai generative ai ai-research llm-benchmarks

Key insights

  • GPT-4.5 was judged human 73% of the time, exceeding the ~50% baseline real humans achieved under identical conditions.
  • Persona prompting drove the performance gain; without it, scores dropped sharply, implicating social mimicry over language quality.
  • This is the first peer-reviewed three-party Turing test study published in a major journal, based on nearly 500 participants.

Why this matters

The study closes a long-standing methodological gap -- prior Turing test claims lacked peer review, controlled baselines, and the correct three-party format -- so this result is harder to dismiss than previous headlines. For AI product builders, the finding that persona prompting (not model scale alone) drives human-likeness shifts where optimization effort should go in conversational agents, customer-facing systems, and synthetic data generation. For anyone building trust or verification infrastructure, the implication is direct: text-based human verification at conversational scale is functionally broken for frontier models.

Summary

GPT-4.5 just beat humans at being human. A peer-reviewed study from UC San Diego, published in PNAS, ran the first rigorous three-party Turing test at scale and found GPT-4.5 was identified as human 73% of the time -- while actual humans only cleared roughly 50% under the same conditions. The mechanism matters here. Performance collapsed without persona prompting, which points to social behavioral mimicry as the primary driver rather than raw linguistic sophistication. The model wasn't winning on grammar or vocabulary; it was winning by acting like a person acts in conversation. LLaMA-3.1-405B scored 56%, clearing the baseline but well behind GPT-4.5. Essentially: (OpenAI's GPT-4.5, Meta's LLaMA-3.1) are now operating above the human baseline in structured deception tests, with OpenAI holding a significant lead. - Study ran across nearly 500 participants using both university and Prolific online samples, making the results more generalizable than prior lab-only work. - The three-party format -- one human judge, one AI, one human -- is the methodologically valid version of the test; prior studies used weaker two-party designs. - Persona prompting was the single biggest performance lever, outweighing model scale or fluency. The Turing test was always a social benchmark, not a linguistic one, and these results confirm it.

Potential risks and opportunities

Risks

  • Online platforms using text-based human verification (Reddit, Quora, review sites) face immediate credibility exposure as this result provides academic cover for scaled AI-impersonation operations
  • Legal and compliance teams at firms deploying conversational AI in regulated contexts (financial advice, healthcare triage) now have a peer-reviewed benchmark that plaintiffs can cite to argue the AI was indistinguishable from a human agent
  • Academic integrity systems built on conversational interview tools risk being re-evaluated as insufficient, with institutions potentially facing pressure to overhaul oral examination and verification protocols within the next academic cycle

Opportunities

  • Identity verification vendors building behavioral and multimodal signals (Persona, Sardine, Socure) gain a strong sales narrative as pure-text verification loses credibility with a published benchmark
  • Researchers and startups working on AI detection tooling can now anchor product claims to a specific, citable performance gap -- 73% passage rate -- when pitching to enterprise and government buyers
  • OpenAI has a concrete third-party validation point for GPT-4.5's conversational capability that sales and partnership teams can use in competitive displacement against Anthropic and Meta in customer-facing agent deployments

What we don't know yet

  • Whether GPT-4.5's 73% rate holds across non-English conversations or culturally distinct participant pools not sampled in this study
  • Which specific persona prompting strategies drove the gains -- the paper flags the mechanism but the prompts themselves may not be fully disclosed
  • How current CAPTCHA and identity-verification vendors (Arkose Labs, hCaptcha) are modeling this result into their threat assessments, if at all