NVIDIA LocateAnything Ships Open at 10x Rival Speed
Key insights
- NVIDIA's Parallel Box Decoding generates all bounding boxes in one forward pass, achieving 10x throughput over Qwen3-VL with matching accuracy.
- LocateAnything trained on 138M samples covering 785M bounding boxes, the largest publicly disclosed grounding training set to date.
- The 3B model ships fully open on Hugging Face with weights, paper, and code targeting agentic computer-use and GUI automation pipelines.
Why this matters
Throughput has been the binding constraint for visual grounding in production agentic systems, and a 10x improvement at 3B parameters means real-time computer-use pipelines previously dependent on expensive proprietary APIs can now run on a single consumer GPU. The fully open release, including weights, training methodology, and code, removes vendor lock-in for teams building GUI agents and document-intelligence workflows at scale. Parallel Box Decoding as an architectural approach could propagate across the vision-language research community, compressing the gap between open and closed model capability on spatial reasoning tasks faster than prior release cycles suggested.
Summary
NVIDIA Research has open-sourced LocateAnything, a 3B vision-language model that delivers 10x the throughput of Qwen3-VL on visual grounding tasks with no accuracy loss.
The mechanism is Parallel Box Decoding (PBD): all bounding boxes are generated in a single forward pass rather than one token at a time. The model trained on 138M samples covering 785M bounding boxes across GUI grounding, OCR, document understanding, and dense object detection.
Essentially: (NVIDIA) ships an open grounding model that out-throughputs Qwen3-VL by 10x without the accuracy penalty.
- Weights, paper, GitHub, and an interactive demo are live on Hugging Face now.
- At 3B parameters it runs on a single consumer GPU with no API dependency.
- GUI grounding benchmarks make it a direct fit for agentic computer-use pipelines.
The open release gives teams building production computer-use agents a fast grounding backbone that was previously only achievable through proprietary APIs.
Potential risks and opportunities
Risks
- Teams deploying LocateAnything commercially face IP uncertainty if NVIDIA's 138M-sample training set contains scraped web content, as AI training-data copyright litigation expands through 2026.
- Qwen3-VL's commercial operator (Alibaba Cloud) faces pricing pressure on grounding API products if open alternatives sustain comparable accuracy across downstream fine-tuning scenarios within 6-12 months.
- NVIDIA competitors building proprietary computer-use vision stacks (Google DeepMind, Microsoft) face faster commoditization of grounding as a differentiator if open models continue closing the accuracy gap on GUI automation benchmarks.
Opportunities
- Computer-use platform builders (Browserbase, Steel, Playwright-based agent startups) can replace proprietary grounding APIs immediately, cutting inference costs and eliminating rate-limit constraints.
- Document intelligence vendors (Reducto, LlamaParse, Unstructured) gain a fast open grounding backbone for OCR and layout-parsing pipelines without licensing constraints or upstream API dependencies.
- NVIDIA reinforces its position as the preferred open-AI research lab for enterprise GPU buyers, with LocateAnything serving as a deployment showcase for Hopper and Blackwell hardware efficiency at production scale.
What we don't know yet
- Latency at batch size 1 is unreported; published benchmarks cover throughput in batch scenarios that may not reflect real-time interactive agent latency requirements.
- Training data licensing for 138M samples is undisclosed, leaving commercial deployers uncertain about IP exposure as AI training-data litigation expands through 2026.
- Whether PBD accuracy degrades under heavy occlusion or densely overlapping bounding boxes in cluttered scenes is not addressed in the published benchmarks.
Originally reported by research.nvidia.com
Read the original article →Original headline: NVIDIA LocateAnything: Open 3B Vision-Language Grounding Model Runs 10× Faster Than Qwen3-VL Using Parallel Box Decoding