Plan Eviction Cuts ALFWorld Agent Success by 34.7 Points
TL;DR
- Naive plan eviction reduces ALFWorld task success by 34.7 percentage points in experiments on Llama-3.1-70B.
- On Llama-3.1-70B, plan signal in hidden states falls 4.1x within a single action-observation step after eviction.
- Probe-gated re-surfacing, where a probe detects signal decay and reintroduces the plan, fails to recover lost performance.
Long-horizon agents, the ones you deploy to browse, research, or execute multi-step tasks, depend on context management to function. When the token window fills, something has to go. The conventional wisdom is that older information, specifically conversational history and intermediate observations, can be safely dropped first. A new paper by Aman Mehta and Anupam Datta challenges where that assumption breaks: it breaks with plans.
The paper, posted to arXiv, introduces "replay pairing," a diagnostic that runs the same agent trajectory twice, once with the plan retained in context history and once without, measuring the difference in the model's hidden states. On Llama-3.1-70B, plan signal spikes immediately after the plan is introduced, then falls 4.1x within a single action-observation step. On HotpotQA, the decay is 12.4x in that same step. The core finding is that "standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context."
The practical cost surfaces in a compression stress test: naive plan eviction cuts success on the ALFWorld household task benchmark by 34.7 percentage points. The authors also tested a probe-gated re-surfacing strategy, where a diagnostic probe detects plan signal decay and reintroduces the plan, and found it does not recover performance. Detection alone does not solve the problem.
Reasoning models complicate the picture further. Models that use explicit reasoning traces re-derive plan content during inference, which creates what the authors call a "reasoning-trace confound": naive stripping of the plan from the test condition still leaves plan evidence behind in those traces, masking the decay signal. Correcting for this with strict stripping of prior reasoning blocks recovers substantial measurement signal.
The honest caveat is that this is a measurement and diagnostic framework, not a solution. The paper gives practitioners a tool for identifying when plans are at risk of being evicted, but the probe-gated approach explicitly fails to recover lost performance. What the paper does not give you is a working fix. For teams building agent infrastructure, the practical implication is that plan tokens may need first-class protection in eviction policies rather than being dropped by recency, and architectures that genuinely persist plan representations outside the context window remain an open research direction.
Shared on Bluesky by 1 AI expert
Originally reported by paper
Read the original article →Original headline: Evict the Plan, Lose 34 Points: New Study Makes Context Management Load-Bearing for Agents