Sakana's Fugu orchestrators top frontier LLMs on SWE-Bench Pro
TL;DR
- Fugu-Ultra reportedly scores 73.7% on SWE-Bench Pro and 82.1% on Terminal Bench 2.1, beating Claude-Opus-4.8, Gemini-3.1-Pro, and GPT-5.5 baselines.
- Fugu is itself a language model that learns to orchestrate Gemini-3.1-Pro, Claude-Opus-4.8, and GPT-5.5 as expert workers per query.
- Training combines supervised fine-tuning with sep-CMA-ES evolutionary optimization for Fugu, and GRPO reinforcement learning for Fugu-Ultra.
Sakana AI's new technical report makes a claim that cuts against the usual frontier-model framing. Rather than train a bigger standalone model, the lab built what it calls a "family of learned orchestrators that expose a multi-agent system through a single model interface" — a coordinator that, according to the arXiv report, learns to construct query-adaptive workflows over a pool of expert LLM workers, naming Gemini-3.1-Pro, Claude-Opus-4.8, and GPT-5.5 as the frontier models it routes across.
The headline numbers are the reason it is worth a look. The heavier variant, Fugu-Ultra, reportedly hits 73.7% on SWE-Bench Pro against 69.2% for Claude-Opus-4.8, and 82.1% on Terminal Bench 2.1 against 74.6% for the same baseline. Both Fugu and Fugu-Ultra reach 95.5% on GPQA-Diamond, and on Humanity's Last Exam they reach 50.0% and 47.2% respectively, ahead of the three frontier baselines as reported. The lighter Fugu trails Ultra on the hardest tasks but is pitched as the latency-friendly variant for everyday use.
The training recipe is in two halves. Fugu is built with large-scale supervised fine-tuning on single-step tasks, then optimized end-to-end with evolutionary strategies — specifically sep-CMA-ES — on interactive workflows. Fugu-Ultra is trained with GRPO, a reinforcement-learning method, to output natural-language agentic workflows that designate subtasks and assign them to agents. The orchestrator is itself a language model; the bet is that the value lives in the routing rather than in any single backbone.
Take the specifics as reported, not settled. These are the authors' own benchmark runs against named frontier baselines, and the report does not give you per-query API cost, latency, or how the system behaves when one of the underlying expert models is upgraded or pulled. The interesting thing to watch is whether "learned orchestrator" becomes its own product category, or whether the frontier labs simply absorb the idea into their next round of native agentic releases.
Shared on Bluesky by 1 AI expert
Originally reported by arxiv.org
Read the original article →Original headline: Sakana Fugu Technical Report