Arbor AI Optimization Framework Outperforms Claude Code and Codex by 2.5× on Same Compute Budget via Hypothesis-Tree Architecture
Summary
Researchers from Renmin University of China and Microsoft Research published Arbor, an autonomous agent optimization framework that treats each improvement hypothesis as an isolated git-worktree experiment so successful changes are cleanly merged and failed ones pruned without entangling results. In benchmark comparisons published June 19, Arbor achieved 2.5× the average performance gain of Claude Code and Codex on the same compute budget, raising held-out BrowseComp accuracy from a 45.3% baseline to 67.7% while competing systems stalled at 50–53%. The approach generalizes across model training, harness engineering, and data synthesis tasks using multiple LLM backends including Claude Opus 4.6 and GPT-5.5.
Originally reported by venturebeat.com
Read the original article →Original headline: Arbor AI Optimization Framework Outperforms Claude Code and Codex by 2.5× on Same Compute Budget via Hypothesis-Tree Architecture