arxiv.org web signal

SelfCompact Cuts Agent Token Costs Up to 70% Without Fine-Tuning

TL;DR

  • SelfCompact lets language model agents autonomously decide when to summarize accumulated context, requiring no fine-tuning.
  • The method reduces per-question token usage by 30-70% across six benchmarks and seven models tested.
  • Performance gains reach up to 18.1 points on mathematical problems and 5-9 points on agentic search tasks.

One of the quieter cost drivers in production AI agents is not the model itself but the context it drags along: every tool call, reasoning step, and intermediate result accumulates, and by the time an agent is deep into a multi-step task, a large share of its context window may be stale information constraining every new generation. According to a paper from researchers at Johns Hopkins and Google DeepMind, models left to their own devices struggle to recognize when their context has deteriorated to this point, yet the fix turns out to be surprisingly lightweight.

The proposed system, SelfCompact, gives the agent a compaction tool it can invoke to summarize its accumulated context, paired with a rubric specifying when to trigger that tool (at sub-task resolution or trajectory convergence) and when to suppress it (mid-derivation or when the agent is stuck). The critical design choice is that the whole thing requires no fine-tuning or external supervision -- it is scaffolding, not a retrained model.

The reported results span six benchmarks and seven models. Per-question token usage falls 30-70% compared to baselines, and the method matches or exceeds fixed-interval summarization. On mathematical problems, the gains reach up to 18.1 points; on agentic search tasks, 5-9 points. The framing the authors offer is that context management is a capability structured prompts can provide, not one that requires training.

The honest caveat is that benchmark structure tends to be cleaner than production workloads, and a rubric built around concepts like 'sub-task resolution' may misfire in tasks where those boundaries are genuinely blurry. What the paper does not address is latency -- the compaction tool call is itself a generation step, and for latency-sensitive applications the wall-clock trade-off against token savings is an open question.

For teams already running agents at scale and watching token bills compound across long traces, this is worth a close look. The no-retraining requirement means it can be dropped into existing pipelines without a fine-tuning investment, which is the kind of low-adoption-barrier improvement that tends to spread quickly once someone benchmarks it against their own workload.

Shared on Bluesky by 2 AI experts