arxiv.org web signal

Nam et al.: bad context 'pigeonholes' LLMs, 38-40% accuracy drop

TL;DR

  • Across 10 models and 10 tasks, repeating an incorrect answer from context dropped LLM performance by 38 to 40 percent.
  • Performance degraded a further 14 percent or more as repeated mistakes accumulated from one to five conversation turns.
  • The authors' proposed mitigation, RLVR trained with synthetic errors, improved models by 43 to 60 percent under bad contexts versus vanilla RLVR.

A new paper from Hyunji Nam, Keertana Chidambaram, Dorottya Demszky and Natasha Jaques, posted to arXiv, puts a name to a failure mode that does not need a jailbreak to show up: pigeonholing. The setup they describe is mundane, a user pushes a wrong solution, or the assistant's earlier reply was buggy and now sits in the context, and the model quietly degrades from there.

The numbers are the part that made me pay attention. Across ten models and ten tasks, repeating an incorrect answer from context dropped performance by 38 to 40 percent. With more turns it gets worse, an additional fourteen percent or more as repeated mistakes pile up from one turn to five. Coding and text generation collapse into a narrow band of answers, and on controversial topics the model flips its stance to align with whatever was said before. The detail I find most uncomfortable is that mode collapse showed up even when the example provided in context was correct.

Why this matters if you build with LLMs: most production chat and coding assistants retain history by design. If the trajectory of a session matters as much as this paper says, the conventional approach of feeding everything back is leaking quality you cannot see from a single-turn eval. Long sessions, and few-shot prompting that includes prior user input, start to look like risk surfaces rather than free wins.

The authors propose a mitigation, reinforcement learning with verifiable rewards augmented with synthetic errors during training, which they report improves models by 43 to 60 percent under bad contexts compared with vanilla RLVR. Take that as the authors' own claim, not yet independently replicated. The abstract also does not tell you which ten models were tested, how closed frontier systems compared with open weights, or what the clean-prompt cost of the training tweak is.

If the finding holds, the win goes to the teams building eval harnesses and to application teams running long-session products. A benchmark that explicitly measures degradation across turns, plus a bit of context pruning at runtime, would be a cheap defense against what is otherwise an invisible decay.

Shared on Bluesky by 2 AI experts