NVIDIA ships Nemotron 3 Ultra, a 550B open MoE for agents
TL;DR
- NVIDIA released Nemotron 3 Ultra with 550B total parameters and 55B active per token, under the permissive OpenMDW-1.1 license that allows commercial use.
- The model uses a hybrid architecture interleaving Mamba-2, Mixture-of-Experts, and attention layers, trained with an NVFP4 pre-training recipe.
- NVIDIA reports roughly 5x higher inference throughput than other open frontier models, with an Artificial Analysis intelligence score near 48.
The most interesting thing about NVIDIA's Nemotron 3 Ultra release on Hugging Face is not the parameter count, even though 550 billion is the headline. It is that the company is shipping a frontier-scale Mixture-of-Experts under a permissive open-weight license, and that the same checkpoint comes pre-quantized to NVFP4, NVIDIA's own 4-bit format that Blackwell can run natively.
The architecture is unusual for a model in this class. According to MarkTechPost's writeup, it interleaves Mamba-2 layers, MoE layers and select attention layers, and adds Multi-Token Prediction on top. Total parameters land at 550B with 55B active per token, roughly 10% sparsity, and the NVFP4 checkpoint operates at 5.03 bits-per-element by mixing NVFP4 routed experts with FP8 shared experts and Mamba linears, while attention stays in BF16. Pre-training data cuts off in September 2025 and post-training in May 2026.
Why this matters if you are building agents. The reported numbers, echoed by vLLM, put inference throughput at 5.9x, 4.8x and 1.6x other open frontier models (GLM-5.1, Kimi-K2.6, Qwen-3.5) on an 8K input / 64K output shape, with comparable quality on agentic and reasoning benchmarks: 90.0 on PinchBench, 56.0 on ProfBench Search, and SWE-Bench Verified in the 65 to 70.4% band across several scaffolds. The license is the other half of the story. OpenMDW-1.1 grants royalty-free commercial use and explicitly carves model outputs out of the license obligations, which matters if you are shipping a product on top.
The honest caveat is that the headline speed numbers come from a single benchmark shape against three named competitors, and your sequence patterns will not match. The NVFP4 advantage is conditional too. Native FP4 math only kicks in on Blackwell; on Hopper the same checkpoint falls back to W4A16, because Hopper lacks FP4 tensor cores. And on raw intelligence, ChatForest's tracking puts Nemotron 3 Ultra around 48 on the Artificial Analysis index, top of the US open-weight pile but six points behind Kimi K2.6.
The forward-looking read is that this is the first credible US-origin open frontier MoE that is genuinely cheap to serve at scale, if you have the right hardware. For teams that were paying hosted-API prices for agent workloads and wanted weights they could self-host under a clean commercial license, the calculus just changed.
Shared on Bluesky by 2 AI experts
Originally reported by huggingface.co
Read the original article →Original headline: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 · Hugging Face