huggingface.co web signal

CMU's PACE proxy predicts agent benchmark scores 100x cheaper

agents hugging face ai-business

TL;DR

  • Pace predicts agent benchmark scores (GAIA, SWE-Bench Verified, SWE-Bench Multimodal, SWT-Bench) with mean absolute error under 4% and pairwise ranking accuracy around 85%.
  • The method fits a regression from a model's scores on about 100 atomic instances, drawn from 19 non-agentic benchmarks, to its full agent score.
  • Authors report roughly 100x lower dollar cost than a random target-sampling baseline, and PlanBench emerges as the biggest source contributor across all four targets.

Full agent evals, the SWE-Bench Verified pass and the GAIA browser run, have gotten expensive enough that they gate how often teams can measure progress during training. A new paper on Hugging Face from Carnegie Mellon and Salesforce AI Research proposes a workaround: predict those agent scores from a small set of cheap atomic evals instead of running the whole rollout.

The method, called Pace, picks around 100 instances from a pool of 19 non-agentic benchmarks (things like PlanBench, BFCL, IFEval, MMMU) and fits a lightweight regression from a model's scores on those instances to its score on the target agent benchmark. The authors report a mean absolute error under 4% and a pairwise model-ranking accuracy of about 85% across four target benchmarks: GAIA, SWE-Bench Verified, SWE-Bench Multimodal, and SWT-Bench. Their headline claim is that at equal prediction quality, Pace runs roughly 100x cheaper than a random target-sampling baseline, evaluated across 14 models with leave-one-out cross-validation.

One useful thing that falls out of the selection step is a per-benchmark capability profile. GAIA's proxy leans on instruction-following and verification instances; SWE-Bench Multimodal is dominated by long-context aggregation; SWT-Bench concentrates on planning and test verification. PlanBench shows up as the single biggest contributor across all four targets, which is a specific signal about what atomic skills actually track agent performance in this setup.

The honest caveats sit in the paper's own limitations section. Everything here was measured on agents built with the OpenHands framework, so transfer to other scaffolds is untested. The calibration set is only 14 models, small relative to the 100 selected features, which the authors flag as making individual regression weights less reliable. And the usual proxy hazard applies: once a small, fixed public set becomes the leaderboard people iterate against, it can be optimized without genuine capability gains. What the paper doesn't tell you is whether the proxy still holds on model families released after that 14-model calibration, or on any scaffold other than OpenHands.

If it holds up, the upside is a much tighter iteration loop for anyone doing agent post-training or base-model selection, where the price of a single SWE-Bench Verified sweep has been enough to make you think twice about running it. Cheap enough to run per checkpoint is a different regime than expensive enough to run once at the end.