research.nvidia.com via Reddit May 27th 2026

NVIDIA LocateAnything Ships Open at 10x Rival Speed

nvidia computer vision open source multimodal computer-vision open-source multimodal inference

Key insights

NVIDIA's Parallel Box Decoding generates all bounding boxes in one forward pass, achieving 10x throughput over Qwen3-VL with matching accuracy.
LocateAnything trained on 138M samples covering 785M bounding boxes, the largest publicly disclosed grounding training set to date.
The 3B model ships fully open on Hugging Face with weights, paper, and code targeting agentic computer-use and GUI automation pipelines.

Why this matters

Throughput has been the binding constraint for visual grounding in production agentic systems, and a 10x improvement at 3B parameters means real-time computer-use pipelines previously dependent on expensive proprietary APIs can now run on a single consumer GPU. The fully open release, including weights, training methodology, and code, removes vendor lock-in for teams building GUI agents and document-intelligence workflows at scale. Parallel Box Decoding as an architectural approach could propagate across the vision-language research community, compressing the gap between open and closed model capability on spatial reasoning tasks faster than prior release cycles suggested.

Summary

NVIDIA Research has open-sourced LocateAnything, a 3B vision-language model that delivers 10x the throughput of Qwen3-VL on visual grounding tasks with no accuracy loss. The mechanism is Parallel Box Decoding (PBD): all bounding boxes are generated in a single forward pass rather than one token at a time. The model trained on 138M samples covering 785M bounding boxes across GUI grounding, OCR, document understanding, and dense object detection. Essentially: (NVIDIA) ships an open grounding model that out-throughputs Qwen3-VL by 10x without the accuracy penalty. - Weights, paper, GitHub, and an interactive demo are live on Hugging Face now. - At 3B parameters it runs on a single consumer GPU with no API dependency. - GUI grounding benchmarks make it a direct fit for agentic computer-use pipelines. The open release gives teams building production computer-use agents a fast grounding backbone that was previously only achievable through proprietary APIs.

Potential risks and opportunities

Risks

Teams deploying LocateAnything commercially face IP uncertainty if NVIDIA's 138M-sample training set contains scraped web content, as AI training-data copyright litigation expands through 2026.
Qwen3-VL's commercial operator (Alibaba Cloud) faces pricing pressure on grounding API products if open alternatives sustain comparable accuracy across downstream fine-tuning scenarios within 6-12 months.
NVIDIA competitors building proprietary computer-use vision stacks (Google DeepMind, Microsoft) face faster commoditization of grounding as a differentiator if open models continue closing the accuracy gap on GUI automation benchmarks.

Opportunities

Computer-use platform builders (Browserbase, Steel, Playwright-based agent startups) can replace proprietary grounding APIs immediately, cutting inference costs and eliminating rate-limit constraints.
Document intelligence vendors (Reducto, LlamaParse, Unstructured) gain a fast open grounding backbone for OCR and layout-parsing pipelines without licensing constraints or upstream API dependencies.
NVIDIA reinforces its position as the preferred open-AI research lab for enterprise GPU buyers, with LocateAnything serving as a deployment showcase for Hopper and Blackwell hardware efficiency at production scale.

What we don't know yet

Latency at batch size 1 is unreported; published benchmarks cover throughput in batch scenarios that may not reflect real-time interactive agent latency requirements.
Training data licensing for 138M samples is undisclosed, leaving commercial deployers uncertain about IP exposure as AI training-data litigation expands through 2026.
Whether PBD accuracy degrades under heavy occlusion or densely overlapping bounding boxes in cluttered scenes is not addressed in the published benchmarks.

Originally reported by research.nvidia.com

Read the original article →

Original headline: NVIDIA LocateAnything: Open 3B Vision-Language Grounding Model Runs 10× Faster Than Qwen3-VL Using Parallel Box Decoding