Allen AI's Tmax Trains Open Terminal Agent to 27% Accuracy
TL;DR
- TMax-9B scores 27.2% on Terminal-Bench 2.0, the strongest result among open-weights models under 10 billion parameters under official settings.
- The TMax-15k dataset contains 14,600 RL environments, over 2.5 times larger than any previously released comparable terminal-agent dataset.
- Supervised fine-tuning before RL degraded performance on the stronger Qwen 3.5 base model, even though it helped the older Qwen 3-8B model.
Terminal agents -- software that navigates and operates within command-line environments autonomously -- have mostly lived behind proprietary APIs, with open alternatives trailing significantly. A collaboration between the University of Washington and the Allen Institute for AI, described on Hugging Face and published June 22, introduces Tmax: a simplified reinforcement learning recipe that trains a 9B-parameter model to 27.2% on Terminal-Bench 2.0, which the authors describe as the strongest result among open-weights models under 10B parameters under official settings. The larger TMax-27B variant reaches 42.7%.
The recipe rests on two things: a cleaner training loop and a substantially larger dataset. The team built TMax-15k, containing 14,600 RL environments generated through a compositional pipeline that samples across nine structured axes including domain, skills, personas, and complexity levels. That dataset is over 2.5 times larger than comparable open alternatives and reportedly harder, with a domain balance score of 0.998 against 0.481 to 0.646 for prior work. Training uses GRPO plus stability fixes -- specifically a divergence-penalized variant rather than vanilla GRPO -- which the authors say produces over five points of improvement on Terminal-Bench.
One of the sharper sub-findings is what the team calls the SFT trap: supervised fine-tuning before RL actually degraded Qwen 3.5's performance, even though it benefited the older Qwen 3-8B. That asymmetry is a real practical caution for anyone trying to copy this recipe onto a different base model without testing first.
The gains also transferred beyond terminal tasks. TMax-9B improved from 44.0% to 53.5% on SWE-Bench Verified, and AIME scores rose from 73.3% to 91.1%, suggesting the RL recipe sharpens general reasoning rather than only terminal-specific behavior.
All three checkpoints (2B, 9B, 27B), the full training code, and the TMax-15k dataset are open-sourced at the project repository. The honest caveat is that even 42.7% on Terminal-Bench means more than half of tasks remain unsolved, and benchmark conditions rarely match real enterprise shell environments. But for teams building coding automation or DevOps agents without the budget for proprietary APIs, the combination of open weights, a well-documented dataset, and a reproducible training recipe gives something concrete to build on.
Originally reported by huggingface.co
Read the original article →Original headline: Allen AI Releases Tmax: Simplified RL Recipe Trains 9B Terminal Agent to 27% on Terminal-Bench 2.0, Largest Open Terminal-Agent Dataset Released