arxiv.org web signal

Paper finds AGENTS.md files hurt coding agent task success

TL;DR

  • Context files like AGENTS.md tend to reduce coding agent task success while raising inference cost by over 20% on average, the paper reports.
  • On a new 138-issue Python benchmark, LLM-generated context files added 3.92 steps per task and pushed costs up by 23%.
  • The finding held across Claude Sonnet-4.5, GPT-5.2, GPT-5.1 mini and Qwen3-30b-coder, so it is not tied to one model family.

A new arxiv paper takes a straight look at one of the more universal conventions to emerge from the coding agent boom, the AGENTS.md file, and finds it mostly does not work. The authors evaluated coding agents on SWE-bench tasks and on a novel benchmark of 138 real-world Python issues drawn from repositories that already ship developer-written context files, and reported that providing those files tends to reduce task success rates while raising inference cost by over 20% on average.

The specific figures are the striking part. On the paper's new benchmark, LLM-generated context files added 3.92 steps to a task on average, which translated into a 23% cost increase, while the SWE-bench setting saw 2.45 extra steps and a 20% cost bump. Human-written context files fared better than machine-generated ones, improving success across most agents, though not universally. The authors tested Claude Sonnet-4.5, GPT-5.2, GPT-5.1 mini and Qwen3-30b-coder, so the effect is not tied to one model family.

This is worth reading carefully if you have been auto-generating AGENTS.md files across a monorepo, or paying an agent to do it for you at the start of a session. The behavioral finding is that context files encourage broader exploration, with more thorough testing and file traversal, and coding agents tend to respect those instructions. But the added thoroughness does not turn into more resolved tickets. What helps, on the authors' read, is a minimal file that describes only non-standard practices, not the repository overview that model providers currently recommend as default.

The honest caveats are that the benchmark is Python-only and issue-flavored, the tested agents are a specific set of frontier systems, and the paper does not break down which instruction categories inside a context file actually help versus hurt. What the paper also does not give you is a clean answer for longer-horizon or multi-repo work, where the calculus could plausibly look different.

The takeaway I would carry into next week is unglamorous but useful. If you are running agent fleets at scale, a 20%+ inference cost you can shave by trimming a file back is real money, and the fashionable long AGENTS.md is a reasonable place to start looking.

Shared on Bluesky by 2 AI experts