transformernews.ai via Reddit

Researcher Finds Fatal Flaws in METR AI Progress Graph

safety ai-benchmarks ai-capabilities evaluation

Key insights

  • NYU Stern's Nathan Witkin identifies severe errors in METR's Long Tasks benchmark, the data behind the widely-cited AI time horizons graph.
  • The METR graph has been used in safety, policy, and investment contexts to argue AI task-completion horizons are expanding rapidly.
  • If the critique holds, a primary empirical pillar supporting frontier AI progress narratives loses its evidentiary weight.

Why this matters

The METR time horizons graph has functioned as a load-bearing reference across AI governance debates, venture investment theses, and safety research, so errors that invalidate its conclusions would force a re-evaluation of how urgently regulators, funders, and researchers should treat near-term AI risk. Practitioners who have built technical roadmaps or resource-allocation arguments on top of this data need to know whether the underlying benchmark is reliable before citing it further. The episode also highlights a systemic problem: a single graph with contested methodology becoming canonical in a field that moves too fast for normal academic error-correction cycles.

Summary

NYU Stern researcher Nathan Witkin has published a detailed takedown of the METR AI time horizons graph, one of the most-cited empirical exhibits in AI safety, policy, and investment circles. The graph, which purports to show AI agents completing increasingly long-horizon tasks at an accelerating rate, has been treated as near-canonical evidence that frontier AI capabilities are on an explosive trajectory. Witkin's critique, published in Transformer News, argues the Long Tasks benchmark underlying the graph contains numerous severe methodological errors that make it impossible to draw meaningful conclusions from the data. If the errors he identifies hold up to scrutiny, the graph cannot support the capability-growth claims routinely built on top of it. Essentially: (METR, Nathan Witkin) are now at the center of a fight over whether the empirical foundation for rapid-AI-progress narratives is sound. - The METR graph has been cited in AI safety arguments, regulatory submissions, and investment theses to justify claims of near-term transformative AI. - Witkin's focus is the Long Tasks benchmark specifically, not METR's broader research program. - The critique has not yet been formally peer-reviewed or responded to by METR publicly. How the AI research community responds to this challenge will determine whether one of the field's most-repeated data points survives or gets quietly retired.

Potential risks and opportunities

Risks

  • AI safety organizations and policy advocates who cited the METR graph in regulatory comments or congressional testimony face credibility damage if the errors are confirmed and those citations are surfaced.
  • Venture funds that used the time horizons graph to justify AI infrastructure or agent-company valuations in 2024-2025 rounds may face LP scrutiny if the capability-growth narrative weakens.
  • METR's standing as a neutral empirical authority in AI evaluation could erode at a critical moment when benchmark credibility is already under broad challenge from the research community.

Opportunities

  • Independent AI evaluation organizations (Epoch AI, Scale AI HELM, Eleuther AI) gain leverage to position their benchmarking methodology as more rigorous alternatives to METR's Long Tasks approach.
  • Researchers who can produce a credible, reproducible long-horizon task benchmark in the next 90 days could fill the vacuum and become the new reference point for capability-growth claims.
  • Policy analysts and think tanks with technical staff (RAND, CSET, UK AISI) could accelerate influence by auditing which existing regulatory arguments rest on the METR graph and publishing corrections before those arguments are used in active rulemaking.

What we don't know yet

  • METR has not publicly responded to Witkin's specific methodological objections as of publication -- whether they dispute the errors or plan corrections is unresolved.
  • Which specific regulatory submissions, safety papers, or investment documents cited the METR graph directly, and whether those authors have been notified.
  • Whether Witkin's critique has been reviewed by independent benchmark researchers, or whether it is itself subject to methodological gaps not addressed in the Transformer News piece.