paper web signal

DuoMem distillation lifts 4B model to 77.9% on ALFWorld

TL;DR

  • DuoMem improves a 4B student model from 4.3% to 77.9% task success on the ALFWorld embodied benchmark, versus a 72B teacher at 87.1%.
  • The method pairs context-space distillation, prepending teacher-generated procedural memories, with parameter-space LoRA fine-tuning on successful teacher trajectories.
  • The enhanced 4B model runs over 3x faster than the 72B teacher, adds fewer than 10M trainable parameters, and stores only several megabytes of pre-computed memories.

A 4-billion-parameter model going from 4.3% task success to 77.9% on a hard embodied benchmark, without adding meaningful weight, is the kind of result that tends to get lost in the flagship-model news cycle, which is a shame because the recipe is more interesting than most of the headline scores this month.

The paper, DuoMem, posted to arXiv on June 29, calls the technique dual-space distillation. There are two moves. The first is context-space: the small student model is fed procedural memories that were generated by a much larger 72B teacher, prepended to its input. The second is parameter-space: LoRA adapters are fine-tuned on successful trajectories from that same teacher. The authors report fewer than 10 million trainable parameters and only several megabytes of pre-computed teacher memories, and the resulting 4B system reaches 77.9% on ALFWorld against the 72B teacher's 87.1%, at over 3x the inference speed.

Why this matters if you are not writing agent papers: capable agents have mostly required either a big hosted model or a lot of task-specific fine-tuning. If you can transplant most of a 72B agent's task success into a 4B model with LoRA plus a memory bank, the deployment story for on-device assistants and edge robots gets meaningfully cheaper, and less dependent on a round-trip to a cloud endpoint.

The honest caveat is that ALFWorld is a specific, structured embodied benchmark. The paper as reported does not show that dual-space distillation carries over to open-ended web agents, real robotics, or long-horizon tool use, and the approach still assumes you have access to a strong teacher whose trajectories and memories you can harvest. Take the specifics as reported, not settled.

What the abstract does not give you is how much of the gain comes from memory prepending versus the LoRA fine-tune, or what the memory bank costs to keep fresh as tasks drift. If those hold up, the biggest beneficiaries are teams shipping agent behavior onto phones, appliances, and edge robots, exactly the class of hardware that cannot afford a 72B call per step.