pnas.org via Reddit

GPT-4.5 Clears Turing Test With 73% Human Rating

By Alexis Dufresne Published May 29, 2026 at 11:05 UTC Updated May 29, 2026 at 11:10 UTC

openai research safety ai ethics turing-test llm-evaluation ai-capabilities

Key insights

GPT-4.5 with a humanlike persona was identified as human 73% of the time, surpassing actual human participants in the same experiment.
LLaMA-3.1-405B matched the human baseline at 56%, while GPT-4o and ELIZA scored below chance, showing output style shapes detectability.
The study used Turing's original three-party design with preregistration, making it the most methodologically rigorous Turing test published.

Why this matters

This is the first peer-reviewed, preregistered confirmation that LLMs pass the Turing test under controlled conditions, stripping away the methodological objections that let practitioners dismiss earlier claims. Fraud detection pipelines, content moderation systems, and identity verification tools built on the assumption that AI-generated conversational text is detectable now have PNAS-level evidence that assumption fails against GPT-4.5. The gap between GPT-4o scoring below chance and GPT-4.5 scoring 73% demonstrates that persona instruction design is a first-order variable in misuse risk, which means operators cannot rely on raw model benchmarks to estimate deceptive-use exposure.

Summary

GPT-4.5, given a humanlike persona prompt, was judged human 73% of the time in the first controlled, preregistered Turing test, outperforming the real humans it competed against. The UCSD study published in PNAS used Turing's original three-party design across two independent populations. LLaMA-3.1-405B matched the human baseline at 56%, while GPT-4o and ELIZA both fell below chance. Essentially: (OpenAI's GPT-4.5, Meta's LLaMA-3.1-405B) now clear human detection thresholds in structured conversation. - Persona framing drove the 73% result, not raw model scale alone. - GPT-4o's formal, hedging output style actively signaled AI to judges, explaining its below-chance score. - The preregistered design removes the methodological objections that dismissed prior Turing test claims. Fraud detection and online identity systems now operate on assumptions about human communication that no longer hold.

Potential risks and opportunities

Risks

Trust and safety teams at Reddit, X, and LinkedIn face immediate pressure to audit detection systems that relied on behavioral signals the study shows GPT-4.5 now replicates at scale
Conversational identity verification vendors (Persona, Jumio, Socure) using challenge-response methods now have published peer-reviewed evidence their text-based approach fails against a prompted GPT-4.5 deployment
Platforms and courts relying on text-based attestations of human identity in legal or regulatory contexts have no disclosed remediation path following this finding

Opportunities

AI detection vendors (Originality.ai, Hive Moderation, Writer) can anchor new enterprise contracts on conversational AI detection as a named, PNAS-validated threat category
Authentication companies adding biometric or multi-modal verification gain immediate pricing leverage as text-only identity verification loses institutional credibility
EU AI Act compliance consultancies gain a concrete enforcement hook, since the study's results strengthen the case for mandatory AI disclosure obligations in conversational deployments

What we don't know yet

Whether the 73% pass rate holds in asynchronous contexts like email or forums versus the real-time chat format used in the study
GPT-4o's below-chance result is unexplained in the published findings; no analysis of which specific linguistic features triggered AI identification by judges
No data on detection rates when judges are explicitly instructed to look for AI, versus the naive-judge setup used across both populations

Originally reported by pnas.org

Read the original article →

Original headline: PNAS: GPT-4.5 Is Judged Human 73% of the Time — First Controlled Turing Test Study Confirms LLMs Now Pass the Benchmark