H Company Holo3.1 Delivers Mobile and Local Agents
Key insights
- Holo3.1's 35B-A3B model scores 79.3% on AndroidWorld, up from 67%, the largest single benchmark gain in the release.
- NVFP4 checkpoints on DGX Spark cut average agent step time from 6.8 seconds to 3.3 seconds.
- Four model sizes from 0.8B to 35B-A3B are released, with quantized FP8, NVFP4, and Q4 GGUF checkpoints available for the 35B-A3B.
Why this matters
Holo3.1's mobile expansion and local deployment options push computer use agents closer to production parity across every device category, not just browser and desktop. The quantized inference path means organizations can deploy AI agents on consumer hardware or DGX Spark without data leaving their network, removing a key privacy barrier that has slowed enterprise adoption. The addition of native function-calling closes the integration gap with third-party agent orchestration stacks, meaning teams can adopt Holo3.1 without rebuilding existing pipelines.
Summary
H Company released Holo3.1 on June 2, a four-model open-weight family (0.8B, 4B, 9B, 35B-A3B) built on the Qwen architecture for computer use agents across web, desktop, and mobile.
Mobile is the headline gain: the 35B-A3B model climbs from 67% to 79.3% on AndroidWorld, while 4B and 9B variants rise from 58% to 72%. Added native function-calling support closes a key integration gap for third-party agent stacks, with more than 25% improvement inside the Holotab product harness.
Essentially: (H Company, NVIDIA) co-developed agent harness optimizations and quantized checkpoints enabling fully local deployment for the first time.
- NVFP4 on DGX Spark cuts average agent step time from 6.8s to 3.3s, a compound roughly 2x speedup over the FP8 baseline.
- Q4 GGUF checkpoints run on Windows or Mac with nothing leaving the user's network.
- FP8 and NVFP4 match OSWorld scores within two points of full-precision BF16.
Production computer use agents can now run entirely at the edge.
Potential risks and opportunities
Risks
- Teams that optimized pipelines around Holo3's structured JSON output format may face integration rework now that function-calling exists as a parallel execution path with near-parity performance.
- NVFP4 quantization depends on NVIDIA's Model Optimizer toolchain; users on non-NVIDIA hardware have no equivalent optimized path and will see the slower BF16 latency figures.
- The roughly two-point OSWorld accuracy drop from BF16 to quantized FP8/NVFP4 could compound in long multi-step agentic tasks, where small per-step errors accumulate across sequential actions.
Opportunities
- NVIDIA gains a direct integration showcase on DGX Spark and its Model Optimizer toolchain, with a concrete production agent inference demo for enterprise hardware sales.
- Third-party agent orchestration platforms benefit immediately from Holo3.1's native function-calling support, removing the need to wrap structured JSON outputs or maintain custom adapters.
- Organizations with cloud-egress or data-privacy restrictions now have a viable fully-private computer use path via Q4 GGUF on Windows or Mac, with no data leaving the local network.
What we don't know yet
- How Holo3.1 performs on iOS automation benchmarks is not addressed; AndroidWorld is the only mobile benchmark cited.
- Specific Apple Silicon inference latency figures are referenced in the article but not reported in detail; only DGX Spark step times are given.
- Licensing terms for commercial deployment of the open-weight model family are not specified in the release post.
Originally reported by huggingface.co
Read the original article →Original headline: H Company Releases Holo3.1 — Open VLM Family (0.8B–35B) for Computer Use Agents With Mobile, Desktop, and Web Support