research.nvidia.com via Reddit

NVIDIA LocateAnything Ships Open at 10x Rival Speed

nvidia computer vision open source multimodal computer-vision open-source multimodal inference

Key insights

  • NVIDIA's Parallel Box Decoding generates all bounding boxes in one forward pass, achieving 10x throughput over Qwen3-VL with matching accuracy.
  • LocateAnything trained on 138M samples covering 785M bounding boxes, the largest publicly disclosed grounding training set to date.
  • The 3B model ships fully open on Hugging Face with weights, paper, and code targeting agentic computer-use and GUI automation pipelines.

Why this matters

Throughput has been the binding constraint for visual grounding in production agentic systems, and a 10x improvement at 3B parameters means real-time computer-use pipelines previously dependent on expensive proprietary APIs can now run on a single consumer GPU. The fully open release, including weights, training methodology, and code, removes vendor lock-in for teams building GUI agents and document-intelligence workflows at scale. Parallel Box Decoding as an architectural approach could propagate across the vision-language research community, compressing the gap between open and closed model capability on spatial reasoning tasks faster than prior release cycles suggested.

Summary

NVIDIA Research has open-sourced LocateAnything, a 3B vision-language model that delivers 10x the throughput of Qwen3-VL on visual grounding tasks with no accuracy loss. The mechanism is Parallel Box Decoding (PBD): all bounding boxes are generated in a single forward pass rather than one token at a time. The model trained on 138M samples covering 785M bounding boxes across GUI grounding, OCR, document understanding, and dense object detection. Essentially: (NVIDIA) ships an open grounding model that out-throughputs Qwen3-VL by 10x without the accuracy penalty. - Weights, paper, GitHub, and an interactive demo are live on Hugging Face now. - At 3B parameters it runs on a single consumer GPU with no API dependency. - GUI grounding benchmarks make it a direct fit for agentic computer-use pipelines. The open release gives teams building production computer-use agents a fast grounding backbone that was previously only achievable through proprietary APIs.

Potential risks and opportunities

Risks

  • Teams deploying LocateAnything commercially face IP uncertainty if NVIDIA's 138M-sample training set contains scraped web content, as AI training-data copyright litigation expands through 2026.
  • Qwen3-VL's commercial operator (Alibaba Cloud) faces pricing pressure on grounding API products if open alternatives sustain comparable accuracy across downstream fine-tuning scenarios within 6-12 months.
  • NVIDIA competitors building proprietary computer-use vision stacks (Google DeepMind, Microsoft) face faster commoditization of grounding as a differentiator if open models continue closing the accuracy gap on GUI automation benchmarks.

Opportunities

  • Computer-use platform builders (Browserbase, Steel, Playwright-based agent startups) can replace proprietary grounding APIs immediately, cutting inference costs and eliminating rate-limit constraints.
  • Document intelligence vendors (Reducto, LlamaParse, Unstructured) gain a fast open grounding backbone for OCR and layout-parsing pipelines without licensing constraints or upstream API dependencies.
  • NVIDIA reinforces its position as the preferred open-AI research lab for enterprise GPU buyers, with LocateAnything serving as a deployment showcase for Hopper and Blackwell hardware efficiency at production scale.

What we don't know yet

  • Latency at batch size 1 is unreported; published benchmarks cover throughput in batch scenarios that may not reflect real-time interactive agent latency requirements.
  • Training data licensing for 138M samples is undisclosed, leaving commercial deployers uncertain about IP exposure as AI training-data litigation expands through 2026.
  • Whether PBD accuracy degrades under heavy occlusion or densely overlapping bounding boxes in cluttered scenes is not addressed in the published benchmarks.