AutoTrainess agent beats CLI on unattended LLM post-training
TL;DR
- AutoTrainess wraps post-training steps in agent-computer interfaces, lifting GPT-5.4 Codex from 23.21 to 26.94 average on PostTrainBench.
- The benchmark caps each agent at 10 hours on a single H100 GPU to post-train a base model on a downstream task.
- DeepSeek-V4-Flash under OpenCode rose from 12.13 to 19.58 with the same structured interfaces, suggesting the workflow generalises across driver models.
The gap between 'an agent can run one bash command' and 'an agent can supervise a full post-training run of a language model, unattended, for hours' is where the interesting problem lives right now. A new arxiv paper introduces AutoTrainess, an agent designed to close that gap by structuring post-training as a repository of agent-computer interfaces rather than dropping the model into a raw shell.
The setup is deliberately austere. The authors evaluate on PostTrainBench, a companion benchmark that gives an agent 10 hours on one H100 GPU to lift a base model on a downstream task. On that harness, GPT-5.4 driving Codex scores 26.94 on average when it uses AutoTrainess's interfaces for planning, data preparation, training, evaluation and logging, versus 23.21 when the same agent works from a CLI-only baseline. The lift generalises across driver models: DeepSeek-V4-Flash inside OpenCode goes from 12.13 to 19.58 under the same setup.
The pitch, in the authors' own framing, is that AutoTrainess 'externalizes prior human experience as explicit workflows, rules, and execution constraints.' In practice that means the agent isn't rediscovering mid-run that it needs to checkpoint, or that its data loader is misconfigured. Those decisions are pre-shaped by the tool surface it is given. For anyone building fine-tunes on a budget, this is closer to how a human ML engineer actually works than the 'give a model a terminal and hope' pattern.
The honest caveat is that the reported wins are still a long way from a human-engineered pipeline. PostTrainBench's own numbers show the best agent averaging 23.2% versus 51.1% for instruction-tuned baselines across its tasks, even if agents can occasionally beat human engineering on narrow ones like function calling. What the reporting doesn't give you is a per-task breakdown of where the AutoTrainess interfaces helped and where they hurt, or how the workflow behaves beyond the 10-hour, single-GPU envelope.
Still, the direction is the useful part. If a cheap unattended fine-tune on one GPU becomes reliably better than a raw-CLI agent, the population of people who can produce a serviceable domain-specific model quietly widens, and the interesting question shifts from 'can an agent do this at all' to 'how much of the ML engineer's playbook is worth externalising into the tool surface.'
Shared on Bluesky by 1 AI expert
Originally reported by paper
Read the original article →Original headline: AutoTrainess Runs Full LLM Post-Training Loop Autonomously on One H100, Beats CLI Baseline by 16%