reddit.com via Reddit

r/ClaudeAI: Developer Runs Claude Fable 5 Against 4 Private Benchmarks Using Hidden Docker/Playwright Tests — Passes 3, Fails One Sonnet 4.6 Partially Caught

anthropic coding tools agents benchmarks claude coding-agents

Summary

A developer tested Claude Fable 5 on four private coding benchmarks built from real production bugs, graded by hidden Playwright tests running inside Docker after the agent finishes — one attempt, no retries, model never sees the test assertions. Fable 5 passed three benchmarks cleanly but failed a fourth that Claude Sonnet 4.6 had partially caught, suggesting the new flagship model does not uniformly dominate its predecessor on hard regression-style bug reproduction. The thread is gaining attention as a methodologically isolated counterpoint to the showcase demos dominating Fable 5 community coverage.