pnas.org via Reddit

GPT-4.5 Clears Turing Test With 73% Human Rating

openai research safety ai ethics turing-test llm-evaluation ai-capabilities

Key insights

  • GPT-4.5 with a humanlike persona was identified as human 73% of the time, surpassing actual human participants in the same experiment.
  • LLaMA-3.1-405B matched the human baseline at 56%, while GPT-4o and ELIZA scored below chance, showing output style shapes detectability.
  • The study used Turing's original three-party design with preregistration, making it the most methodologically rigorous Turing test published.

Why this matters

This is the first peer-reviewed, preregistered confirmation that LLMs pass the Turing test under controlled conditions, stripping away the methodological objections that let practitioners dismiss earlier claims. Fraud detection pipelines, content moderation systems, and identity verification tools built on the assumption that AI-generated conversational text is detectable now have PNAS-level evidence that assumption fails against GPT-4.5. The gap between GPT-4o scoring below chance and GPT-4.5 scoring 73% demonstrates that persona instruction design is a first-order variable in misuse risk, which means operators cannot rely on raw model benchmarks to estimate deceptive-use exposure.

Summary

GPT-4.5, given a humanlike persona prompt, was judged human 73% of the time in the first controlled, preregistered Turing test, outperforming the real humans it competed against. The UCSD study published in PNAS used Turing's original three-party design across two independent populations. LLaMA-3.1-405B matched the human baseline at 56%, while GPT-4o and ELIZA both fell below chance. Essentially: (OpenAI's GPT-4.5, Meta's LLaMA-3.1-405B) now clear human detection thresholds in structured conversation. - Persona framing drove the 73% result, not raw model scale alone. - GPT-4o's formal, hedging output style actively signaled AI to judges, explaining its below-chance score. - The preregistered design removes the methodological objections that dismissed prior Turing test claims. Fraud detection and online identity systems now operate on assumptions about human communication that no longer hold.

Potential risks and opportunities

Risks

  • Trust and safety teams at Reddit, X, and LinkedIn face immediate pressure to audit detection systems that relied on behavioral signals the study shows GPT-4.5 now replicates at scale
  • Conversational identity verification vendors (Persona, Jumio, Socure) using challenge-response methods now have published peer-reviewed evidence their text-based approach fails against a prompted GPT-4.5 deployment
  • Platforms and courts relying on text-based attestations of human identity in legal or regulatory contexts have no disclosed remediation path following this finding

Opportunities

  • AI detection vendors (Originality.ai, Hive Moderation, Writer) can anchor new enterprise contracts on conversational AI detection as a named, PNAS-validated threat category
  • Authentication companies adding biometric or multi-modal verification gain immediate pricing leverage as text-only identity verification loses institutional credibility
  • EU AI Act compliance consultancies gain a concrete enforcement hook, since the study's results strengthen the case for mandatory AI disclosure obligations in conversational deployments

What we don't know yet

  • Whether the 73% pass rate holds in asynchronous contexts like email or forums versus the real-time chat format used in the study
  • GPT-4o's below-chance result is unexplained in the published findings; no analysis of which specific linguistic features triggered AI identification by judges
  • No data on detection rates when judges are explicitly instructed to look for AI, versus the naive-judge setup used across both populations