huggingface.co web signal

SelfCompact Cuts Agent Token Cost 30-70% With No Fine-Tuning

By Alexis Dufresne Published June 23, 2026 at 05:03 UTC Updated June 23, 2026 at 05:05 UTC

agents inference agents inference context-compression

TL;DR

SelfCompact lets models decide when to compress their own context, guided by a two-part rubric requiring no fine-tuning.
Tested across 7 open-weight models and 6 benchmarks, it cuts per-question token cost 30-70% while gaining up to 18.1 points on competitive math.
Neither component alone works: the compaction tool needs the rubric to avoid being invoked at unhelpful moments.

Long agent runs quietly rot from within. Chains of thought and tool call results pile up, stale content anchors future generations, and eventually the whole trace outgrows the context window. The standard fix is blunt: summarize everything once a token threshold is crossed and keep going. A new paper on Hugging Face from researchers including Tianjian Li, Jingyu Zhang, and Daniel Khashabi argues that the timing of compaction matters as much as the mechanism, and proposes a cleaner alternative called SelfCompact.

The core diagnosis the paper offers is what it calls a "meta-cognitive gap": unprompted models cannot reliably detect when their accumulated context is degrading their performance. Left to their own devices, they invoke compaction unevenly -- too early, too late, or not at all. SelfCompact closes that gap with two inference-time components used together. First, a compaction tool the model can call to summarize its trace. Second, a lightweight rubric that specifies when firing is appropriate -- when a sub-task has resolved, or when the trajectory is converging -- and when to suppress it, mid-derivation or when the model is stuck. The paper is explicit that neither works without the other: the tool alone is unevenly used across open-weight models, often invoked at unhelpful moments; the rubric alone cannot act.

The results, across 7 open-weight models and 6 benchmarks, show up to 18.1 points improvement on competitive math and 5-9 points on agentic search tasks, with a 30-70% reduction in per-question token cost compared to fixed-interval approaches. No fine-tuning is required.

The honest caveat is that competitive math and agentic search are among the more structured agent tasks available, with relatively clean sub-task boundaries that make fire/suppress decisions tractable. Whether the rubric transfers to open-ended or adversarial trajectories, or to closed proprietary models outside the 7 tested, the paper does not address. The latency cost of the compaction calls themselves is also not reported, which matters for anyone where wall-clock time is as constrained as token budget.

For practitioners running open-weight agents on math, research, or similar structured pipelines today, the cost reduction is immediate and requires no retraining. The broader implication -- that context management can be solved at the scaffolding layer rather than the training layer -- is the result worth tracking as agent traces grow longer and token budgets stay finite.

Originally reported by huggingface.co

Read the original article →

Original headline: SelfCompact: No-Fine-Tune Scaffolding Lets Agents Self-Manage Context Compaction, Cutting Token Cost 30-70% and Adding Up to 18 Points on Math Benchmarks