huggingface.co web signal

ByteDance's EdgeBench measures 12-hour AI agent progress

TL;DR

  • EdgeBench holds 134 real-world tasks, 51 of them open, tracking agents over 12+ hours rather than one-shot performance.
  • Claude Opus 4.8 leads the 12-hour leaderboard at 43.6, ahead of GPT-5.5 at 42.7 and GPT-5.4 at 34.3.
  • ByteDance reports a log-sigmoid scaling law between interaction time and score, fit at R² = 0.998 across all 134 tasks.

Most agent benchmarks measure one shot. ByteDance Seed's new EdgeBench dataset on Hugging Face is doing something less common: measuring how autonomous agents improve when you give them twelve hours or more on the same real problem, and tracking the shape of that improvement curve across 134 tasks.

The 51-task open subset covers six categories, spanning scientific problems and ML, systems and software engineering, combinatorial optimization, professional knowledge work, formal math and theorem proving in Lean 4 and Coq, and interactive games and simulators like NetHack, OpenTTD and text adventures. On the twelve-hour leaderboard for that subset, Claude Opus 4.8 tops the table at 43.6, ahead of GPT-5.5 at 42.7 and GPT-5.4 at 34.3, with GLM-5.1 and DS-V4-Pro trailing. Systems and software engineering is where every listed model scores highest, with Claude Opus 4.8 reaching 62.0 in that category.

The claim that will draw the most attention is the scaling result. ByteDance reports that performance follows a log-sigmoid scaling law as a function of interaction time, and puts the fit at R² = 0.998 across all 134 tasks. If that curve holds under independent evaluation, deciding how long to let an agent run stops being a vibe and starts being something you can extrapolate. The dataset card also flags an evaluation harness called SForge, released alongside the benchmark, and cites roughly 38,000 hours of recorded agent interaction as the underlying data.

The honest caveat is that this is a self-published benchmark from the same lab writing the paper, only the 51-task subset is currently open (the full 134 is available by request via email), and the dataset card does not spell out how each competing model was scaffolded or how many seeds it was run with. What the release does not give you is the compute cost of reproducing those long runs, which matters if a smaller lab wants to check the scaling claim.

For anyone picking a coding assistant on the basis of raw leaderboard positions, the interesting shift is the axis, not the ranking. Twelve-hour trajectories reward tool use, memory and self-correction in a way single-turn evals do not, and that is where the next round of vendor differentiation is likely to happen.

Shared on Bluesky by 2 AI experts