reddit.com via Reddit

Sonnet 4.6 Tops 3,200-Prompt Sycophancy Benchmark

Key insights

  • Sonnet 4.6 ranked first among four frontier models tested on 3,200 false-premise prompts designed to elicit sycophancy or hallucination.
  • HalBench's 12,800 graded responses include a 100-item human-validated subset, giving the benchmark a calibration layer absent from most automated evals.
  • The open-source methodology is generating debate about whether sycophancy testing requires structurally different prompt design than capability benchmarks.

Why this matters

Sycophancy is one of the hardest failure modes to catch at deployment time, and a community-built benchmark with human-validated scoring gives practitioners a replication-ready signal that labs' internal evals don't provide. The ranking order directly affects model selection for applications where agreeing with wrong user premises causes downstream harm, including legal, medical, and financial tooling. An open methodology also means product teams can adapt HalBench's false-premise prompt design to domain-specific evals without waiting for official lab releases.

Summary

A community developer released HalBench, an open benchmark stress-testing sycophancy resistance using 3,200 false-premise prompts across four frontier models. Sonnet 4.6, Grok 4.3, GPT-5.4, and Gemini 3.1 Pro generated 12,800 graded responses total. A 100-item subset was validated against human raters to calibrate the automated scoring. Sonnet 4.6 ranked first in resisting false premises; Grok 4.3, GPT-5.4, and Gemini 3.1 Pro followed in that order. Essentially: (Anthropic, xAI, OpenAI, Google) are now being compared on a dimension standard capability benchmarks routinely ignore. - Sonnet 4.6 led all four models in refusing to accept false premises embedded in user queries. - HalBench's methodology is open-source and designed for third-party replication. - The thread is debating whether sycophancy evals require structurally different prompt design than capability tests. Hallucination resistance is shaping up as a distinct quality axis, decoupled from reasoning or coding performance.

Potential risks and opportunities

Risks

  • GPT-5.4 and Gemini 3.1 Pro teams face reputational pressure if enterprise buyers use HalBench rankings to justify switching model contracts to Anthropic in the next procurement cycle
  • If the 100-item human-validation subset is found to contain labeling errors, all four model rankings could be invalidated, undermining a benchmark already gaining community adoption
  • Any lab that fine-tunes against HalBench's open prompt set would inflate its own future rankings and distort comparisons for the entire community using the benchmark

Opportunities

  • Anthropic can use Sonnet 4.6's top sycophancy-resistance ranking as a concrete differentiator in enterprise sales targeting regulated industries where false-premise acceptance is a legal liability
  • Teams building RAG or agentic systems can immediately adapt HalBench's open false-premise prompt methodology to domain-specific internal evals at no cost
  • Evaluation tooling companies (Braintrust, Langsmith, Patronus AI) could integrate HalBench as a pre-built eval suite, capturing practitioner demand for sycophancy-specific testing before labs build competing offerings

What we don't know yet

  • Whether the 100-item human-validation subset is representative enough to calibrate all 3,200 prompt scores, given that the selection criteria aren't detailed in the thread
  • How the benchmark handles model version drift -- Sonnet 4.6's ranking could shift if Anthropic ships a patch before third parties complete independent replication
  • Whether domain-specific false premises (medical, legal, financial) were included, or the prompt set covers only general-knowledge claims