paper web signal

Stanford AutoMem lifts Qwen 32B to match Claude Opus 4.5

TL;DR

  • Stanford's AutoMem trains memory management as a standalone skill, roughly 2x to 4x lifting a Qwen2.5-32B-Instruct agent without changing task-action weights.
  • On Crafter, MiniHack and NetHack the tuned 32B scores 51.36, 30.00 and 1.85, versus Claude Opus 4.5 at 49.5, 27.5, 2.0.
  • Two outer loops drive the gains: a strong LLM rewrites memory structure from trajectories, and good memory decisions are distilled back into the agent.

There's a Stanford paper out worth reading if you care about where open-weight agents are heading. In AutoMem, Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang and Serena Yeung-Levy argue that memory management, knowing what to write down, when to read it back, how to organise a working file system, is not fixed scaffolding around an agent but a distinct, trainable skill of its own.

Their setup takes Qwen2.5-32B-Instruct as the base agent and leaves its task-action policy alone. Two outer loops do the work: one has a stronger LLM review complete agent trajectories and iteratively rewrite the memory structure (the prompts, file schemas, and memory-action vocabulary), and the other harvests the model's own good memory decisions across many episodes and trains directly on them. Optimising memory alone, the authors report, gives roughly 2x to 4x improvements on three procedurally generated long-horizon games: Crafter, MiniHack and NetHack.

Where it gets interesting is the head-to-head. On the reported progression-rate table, Qwen2.5-32B + AutoMem lands at 51.36 on Crafter, 30.00 on MiniHack and 1.85 on NetHack, against Claude Opus 4.5 at 49.5, 27.5 and 2.0, and Gemini 3.1 Pro Thinking at 55.0, 27.5 and 2.6. Take the specifics as reported, not as settled: the error bars are wide, plus or minus 3.81 and 7.25 in places, and the match on some benchmarks is well inside noise. It is still a striking result for a 32B open-weight model.

The honest caveat is scope. These are three long-horizon game benchmarks, not customer-support agents, coding tasks, or open-web research, and the outer loop leans on a strong frontier LLM as reviewer, so the approach isn't free of proprietary dependencies. What the paper doesn't give you is the compute cost, the specific reviewer model, or evidence that the memory-as-skill treatment transfers beyond game-like environments.

If it does transfer, though, the picture for anyone building on open weights shifts. It suggests you can chase frontier-level long-horizon behaviour by investing in the memory layer rather than another round of task fine-tuning, and that is a much cheaper direction for small teams and open-source projects to push on.

Shared on Bluesky by 1 AI expert