Dockerless matches Docker-based coding agent post-training
TL;DR
- Dockerless resolves 62.0% of SWE-bench Verified, 50.0% Multilingual, and 35.2% Pro without executing patches in per-repo Docker containers.
- On a verifier-evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points using agentic repository exploration.
- Used as both SFT trajectory filter and RL reward, Dockerless lifts a Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points across the three splits.
The unglamorous heart of coding-agent training is a Docker registry the size of a small planet, one image per repository, each with its own quirks about Python versions and system libraries. A paper on arXiv, Dockerless: Environment-Free Program Verifier for Coding Agents, argues you can throw all of that out.
The authors' verifier decides whether a candidate patch is correct by having an agent explore the repository for evidence, rather than by running the test suite inside a container. On a verifier-evaluation benchmark they report a 14.3 AUC-point lead over the strongest open-source baseline. Plugged into a full training pipeline as both the SFT trajectory filter and the RL reward, the resulting model resolves 62.0% of SWE-bench Verified, 50.0% of Multilingual, and 35.2% of Pro. Those are improvements of 2.4, 8.7, and 2.9 points over their Qwen3.5-9B baseline.
The claim the paper makes is that these numbers match environment-based post-training rather than beat it, and that is the honest framing. The interesting result here isn't 'verifier without Docker is better', it's 'verifier without Docker is not worse', which is a very different and much more useful thing if you were the person paying to keep the Docker images alive.
Take the specifics as reported, not settled. What the abstract doesn't give you is the per-patch cost of the exploration agent, which matters a lot inside an RL loop that calls it thousands of times. Nor does it address whether an agent that judges by evidence can catch patches that look plausible on inspection but silently break behaviour a real test would flag. Both are the sort of question the full paper would need to answer before anyone reworks a pipeline around this.
For a lab that can't afford a fleet of per-repo container images just to train a coding model, though, this is the direction worth watching. The largest reported gain lands on SWE-bench Multilingual, where non-Python container setup tends to be the ugliest part of the job.
Originally reported by paper
Read the original article →Original headline: Dockerless: No-Container Coding Agent Verifier Hits 62% on SWE-bench Verified, Beats Best Open-Source Verifier by 14.3 AUC