DART trims hybrid model thinking budgets without training
TL;DR
- DART samples two cheap no-think drafts and accepts the direct answer when both agree, skipping the extended thinking step entirely.
- On Olympiad-level math, the authors report up to +9.0 accuracy points with 15-69% fewer thinking tokens; on code, up to +22.5 with 51-63% reductions.
- The method is training-free and the paper reports it extends across 0.6B-32B parameter models, multiple model families, and API-only hosted settings.
Hybrid reasoning models give you a knob: answer immediately, or burn extra tokens thinking before answering. Picking the right setting per query is awkward because you usually do not know in advance which problems actually need the thinking pass. A new arxiv preprint, DART, proposes a small, training-free trick for exactly that routing decision.
The idea, as the paper describes it, is to sample two cheap no-think drafts, accept the direct answer when both drafts agree, and predict a thinking budget from the draft entropy when they disagree. No labeled training data, no gradient updates, no extra model on the side.
The reported numbers are the headline. On math reasoning, the authors report accuracy improving by up to +9.0 points on Olympiad-level problems while thinking tokens drop 15-69%. On code reasoning under execution-based equivalence, they report up to +22.5 points of accuracy with thinking tokens dropping 51-63%. They also report that the signal extends across model scales from 0.6B to 32B parameters, across model families, and into API-only hosted settings where you cannot touch the weights at all.
For anyone paying per-token on a reasoning model, the practical implication is straightforward. If this generalizes, an inexpensive router can decide when a problem actually deserves the expensive thinking pass, and reportedly keep accuracy intact or improve it. Thinking tokens are where the bill on hybrid models grows fastest, so a method that trims them without retraining is the kind of thing platform teams notice quickly.
The honest caveats. The reported gains concentrate on math and code, the two domains where automatic grading is cleanest, and the preprint as summarized does not tell you how DART behaves on the messier work practitioners deploy, like multi-step agent flows or open-ended drafting. The two-draft sampling step also has its own latency cost, and the available summary does not give you a wall-clock breakdown of where the token savings end up versus where they get eaten back. Take the specifics as reported, not as settled production results.
The forward-looking part is who can move on this fastest. Because the method is training-free and works in API-only settings, the immediate beneficiaries are inference platforms and routing layers sitting above hosted reasoning APIs. They can wrap the same trick around any provider's hybrid model without coordination from the model owner, which is how methods like this usually slide into production stacks well before they become widely cited.
Shared on Bluesky by 2 AI experts
Originally reported by arxiv.org
Read the original article →Original headline: DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models