TACO gives each tool call credit without an external judge
TL;DR
- TACO scores each tool call by comparing the agent's answer before and after the call, using only its own rule-based grader.
- Built on Qwen2.5-VL-7B, it reports a 68.1 macro-average across twelve perception, reasoning and general benchmarks, ahead of PyVision at 63.7.
- On perception tasks the paper reports 89.6 on V* and 81.6 on HR-Bench-8K, with GPT-4o averaging 58.5 across the same suite.
A new paper from a team led by Mingkuan Feng, posted to Hugging Face's daily papers feed, takes on a small but stubborn problem in training multimodal agents that "think with images". These are the systems that crop, zoom or run code on a picture and then reason over the result. A tool call in that setting can be useful, redundant, or actively misleading, and an outcome-only reward, which is what most reinforcement learning recipes use, cannot tell the three apart.
The proposed fix, Tool-Augmented Credit Optimization or TACO, is a variant of GRPO with two ingredients. The first is a Differential Answer-Probe Reward: before the model runs its code, the system prefills the answer header and asks it to commit to an answer; after the tool call returns and is reasoned over, it asks again. The difference between the two scores from the same rule-based checker is the credit. A useful crop turns a wrong pre-tool answer into a right post-tool one and earns positive credit, a misleading crop does the opposite and earns negative credit, and a redundant one gets zero. The second piece, Outcome-Gated Advantage Routing, sends the final-answer credit only to the tokens responsible for the outcome, so a correct chain of pre-tool reasoning is not penalised when a later tool call spoils it.
The numbers are interesting because the model doing the work is small. Built on Qwen2.5-VL-7B with two epochs of supervised fine-tuning and one epoch of GRPO, TACO reports a 68.1 macro-average across twelve perception, reasoning and general multimodal benchmarks. That clears the best comparable code-tool agent the authors test, PyVision at 63.7, by 4.4 points, and lands ahead of GPT-4o's reported 58.5 on the same suite. On perception the paper highlights 89.6 on V* and 81.6 on HR-Bench-8K.
What is genuinely new is not the score but the cost structure. The closest competitor on the process-reward side, CodeV, uses GPT-4o as an external judge to score tool use during training. TACO uses two short greedy decodes from the same agent and the same answer checker, with no extra API calls. The authors also call out a failure mode they label probe-hacking, where a model writes its conclusion early into its reasoning to inflate the probe, and argue that taking a before-after difference cancels the inflation because both probes read the same pre-tool text.
The honest caveat is that the entire evaluation lives in visual question answering with verifiable, mostly exact-match graders that return values in the set {-1, 0, +1}. That works for math benchmarks and structured perception tasks but does not obviously transfer to open-ended generation, and the reporting leans on a single base model. What the paper does not give you is a clean latency table on its own page, just a reference to one. If the approach holds outside this setup, the appealing part is that it gives smaller teams a way to train tool-using visual agents without paying for a judge model on every step.
Originally reported by huggingface.co
Read the original article →Original headline: TACO: Tool-Augmented Credit Optimization Introduces Judge-Free 'Differential Answer-Probe' Reward for Multimodal Code-Tool Agents