anthropic.com web signal

Anthropic Reports AI Now Authors Over 80% of Its Production Code

TL;DR

  • As of May 2026, over 80% of Anthropic's production code was authored by Claude, up from single digits before February 2025.
  • Claude Mythos Preview outperformed human researchers on next-step research judgment 64% of the time as of April 2026.
  • Anthropic engineers shipped 8x more code per quarter in Q2 2026 versus 2024, driven by Claude-assisted development.

The speed at which a company ships software is usually limited by the number of engineers. Anthropic's essay on recursive self-improvement suggests that relationship is breaking down for them: as of May 2026, over 80% of code merged into their production systems was authored by Claude, up from single digits before February 2025. Engineers meanwhile shipped roughly 8x more code per quarter in Q2 2026 than in 2024. The piece is framed as a public accounting of what it looks like when a frontier AI lab uses its own models to accelerate AI development.

The underlying numbers sketch a trend that compounds rather than plateaus. Task completion time horizons, meaning the length of real-world tasks a model can handle reliably, have been doubling roughly every four months, versus every seven months previously. Claude Opus 3 in March 2024 could handle tasks that take a human about four minutes; Claude Opus 4.6 in March 2026 extended that to twelve hours. On software engineering benchmarks, models went from single-digit scores to saturation in two years; on research reproduction benchmarks, from roughly 20% success in 2024 to saturation in fifteen months.

The more striking claim is what happens when Claude is pointed at research judgment rather than code. According to the essay, Claude Mythos Preview chose better next steps than human researchers 64% of the time in April 2026, up from 51% with Opus 4.5 in November 2025. A demonstration in the same month reportedly had Claude-powered agents recover 97% of the performance gap on an AI safety research problem over 800 cumulative hours of work. A March 2026 internal survey of 130 Anthropic researchers estimated a median 4x output multiplier from access to Mythos Preview. These are internal figures reported by the lab itself, not independent peer-reviewed benchmarks.

The honest caveat is that Anthropic is assessing its own progress on its own models, and the incentives all tilt toward optimism. The coordination proposal in the essay, that Anthropic would support a verifiable global slowdown if other frontier labs agreed, comes with an asterisk the authors include themselves: verification would be "much more challenging" than with other technologies, because training runs are easier to conceal than physical infrastructure. The piece does not explain who would verify or what thresholds would trigger action.

What the reporting does not give you is a clear picture of failure modes at scale. Claude-powered code review reportedly catches around one-third of the bugs behind past production incidents, which is a real improvement, but also means two-thirds slip through in code that is now mostly AI-authored. Teams evaluating AI-assisted development workflows have the most to gain from studying these numbers closely: the feedback loop Anthropic describes is already commercially available and accelerating.

Shared on Bluesky by 15 AI experts (top 5 by trust)