arxiv.org web signal July 1st 2026

Dynamo lets frozen VLMs grow tools, closes 65-99% of RL gap

TL;DR

Dynamo is a training-free framework that adapts a frozen vision-language model by evolving reusable reasoning skills and executable visual tools from a small labeled set.
Across four visual reasoning benchmarks and five VLM backbones (20 model-benchmark settings), the paper reports an average accuracy gain of +5.6.
Against task-specific RL methods VTool-R1 and DeepEyes, the authors claim Dynamo closes 65-99% of the gap at a fraction of the compute, and combines additively with RL.

A quiet claim in a new paper on arxiv caught my eye, because it points at where practical progress on multimodal agents may actually come from. Instead of retraining a vision-language model to be better at visual reasoning, the authors of Dynamo let the model watch its own correct and incorrect attempts on a small labeled set and grow a persistent library of reusable reasoning skills and executable visual tools. The underlying model is never updated.

The reported numbers are what make the framing interesting. Across four visual reasoning benchmarks and five different VLM backbones, so twenty model-benchmark settings in total, the authors report an average accuracy gain of +5.6. The more striking comparison is against task-specific reinforcement learning methods like VTool-R1 and DeepEyes, where they say Dynamo closes 65 to 99 percent of that RL gap at what they describe as a fraction of the compute, and combines additively with RL when it is available.

Why this matters if you are not training frontier models: task-specific RL on VLMs is expensive and out of reach for a lot of teams. If a training-free agent loop can capture most of that gain by accumulating tools and learning when to call each one, the cost picture for making an off-the-shelf VLM useful in a specific visual domain shifts. The paper also notes that when the tool set is given in advance, per-step tool choice improves on every tested backbone, which is the practical hook for teams that already have their own tools.

The honest caveat is that the evaluation is narrow. Four benchmarks and five backbones is a reasonable sweep for a paper, but the 65 to 99 percent range is wide enough that where any given setup lands matters a lot. What the reporting does not give you is how large the persistent skill and tool library grows in practice, how the agent chooses between tools once it is long, or how these gains hold on tasks that look nothing like the labeled training subset the agent learned from.

The direction is the part worth watching. Frozen-model deployments, regulated settings, and small teams without an RL budget are all places where a stackable, additive improvement that leaves the weights alone is the more interesting story.

Shared on Bluesky by 2 AI experts

Originally reported by arxiv.org

Read the original article →

Original headline: [2606.30185] Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents