NVIDIA SpatialClaw Scores 59.9% on Spatial Tasks, Training-Free
TL;DR
- SpatialClaw achieves 59.9% average accuracy across 20 spatial benchmarks, outperforming prior agent SpaceTools by 11.2 percentage points.
- The framework is training-free, using a stateful Python kernel with tools including Depth Anything 3 and SAM3 to compose 3D perception via code.
- Largest gains appear on dynamic 4D and multi-view tasks: DSI-Bench improved 17.6 points and MindCube improved 15.3 points.
For vision-language models, knowing where things are in three-dimensional space has been the persistent gap that benchmarks keep exposing. Most fixes have involved retraining or fine-tuning, which means committing compute and dataset overhead before you see any return. NVIDIA Research's SpatialClaw, covered by MarkTechPost, takes a different line: treat code as the action interface and let an existing model write the perception pipeline itself.
The architecture wraps a stateful Python kernel preloaded with specialized tools, including Depth Anything 3 for depth and camera geometry, and SAM3, which produces masks from text, point, or box prompts. Rather than asking the model to answer spatial questions directly, SpatialClaw runs a five-stage loop of planning, code generation, execution, feedback, and answer submission. According to the research, 52.2% of its wins over baseline approaches are attributed to code composition capabilities specifically -- the ability to iteratively chain perception outputs in executable cells.
The headline numbers are notable. Across 20 benchmarks, SpatialClaw reaches 59.9% average accuracy, an 11.2 percentage point gain over SpaceTools, the prior leading spatial agent. The largest single-task improvements appear on dynamic 4D and multi-view problems, with DSI-Bench gaining 17.6 points and MindCube gaining 15.3 points. Those results reportedly hold across six model backbones in the 26B to 397B parameter range from the Qwen3.5/3.6 and Gemma4 families, using the same system prompt and hyperparameters throughout.
The honest caveat is that curated benchmark gains don't always survive contact with real deployment conditions. The reporting doesn't give you latency figures, which matters considerably in robotics where response time is part of the spec alongside accuracy. The system also depends on Depth Anything 3 and SAM3 producing clean outputs; robustness under noisy or ambiguous real-world scenes isn't addressed.
The group that stands to benefit most near term is robotics and multi-view inspection teams already running large-scale VLMs who want structured 3D reasoning without the overhead of another training run. If the training-free claim holds up outside benchmark conditions, this kind of plug-in spatial reasoning scaffold could meaningfully shorten the path from a capable foundation model to a deployment-ready perception system.
Originally reported by marktechpost.com
Read the original article →Original headline: NVIDIA Research Releases SpatialClaw: Training-Free Agent That Composes 3D Perception Tools via Code, Gains 11+ Points Over Prior Best on Spatial Benchmarks