Allen AI ships Tmax, open RL recipe for terminal agents
TL;DR
- Allen AI released Tmax, an open recipe for terminal agents at 2B, 9B, and 27B parameters, with code, data, and checkpoints public.
- Tmax-9B hits 27.2% on Terminal-Bench 2.0, called the strongest open-weights result under 10B parameters under official settings.
- The release bundles TMax-15k, a 14,600-environment RL dataset described as over 2.5 times larger than any prior open terminal-agent dataset.
A new release from Allen AI lands squarely in one of the more practical gaps in open AI research right now: how do you actually train a small open model to drive a terminal. Tmax, published on Hugging Face, is the full recipe (dataset, training code, rollouts, and checkpoints) for doing exactly that, with model sizes from 2B up to 27B parameters.
The headline result, according to Nathan Lambert's writeup of the project, is that the 9B model reaches 27.2% on Terminal-Bench 2.0, which the team calls the strongest result among open-weights models under 10B parameters under official settings. The 27B variant climbs to about 42.7%. Those scores come from reinforcement learning on top of Qwen 3.5 9B and Qwen 3.6 27B base models, with the 9B improving by roughly six points over its base.
The more interesting part sits underneath the leaderboard line. The release bundles TMax-15k, a dataset of 14,600 RL training environments built from a compositional pipeline that samples across nine structured axes including domain, skills, personas, and complexity. The reporting describes it as over 2.5 times larger than any previously released comparable terminal-agent dataset. The training side is a divergence-penalized variant of policy optimization the team labels DPPO, which the authors say buys over five Terminal-Bench points compared with vanilla GRPO.
Why this matters if you are not in the agent training business: the strong terminal agents practitioners actually use today are mostly wrappers around closed frontier models. A fully open recipe that works at the 9B scale changes the cost picture for anyone wanting to run coding or shell automation on hardware they control, and it gives every other open-model team a concrete loop to layer on their own bases.
The honest caveat is that a Terminal-Bench score is not the same as production reliability on a developer's actual machine, and Lambert's own post notes that nailing down the recipe took roughly a hundred training jobs on 8 H100 nodes per run at about a thousand dollars per RL step. What the reporting does not give you is a head-to-head against closed agents from Anthropic or OpenAI on the same benchmark, or evidence that the loop transfers to browser and IDE coding tasks. The forward bet is that other open-weights teams pick this up quickly and the gap between open and closed coding agents narrows from the small-model end first.
Shared on Bluesky by 1 AI expert
Originally reported by huggingface.co
Read the original article →Original headline: Tmax - a allenai Collection