NVIDIA GB300 Tops All Seven MLPerf 6.0 Benchmarks
Key insights
- MLCommons counted 95 unique systems from 24 organizations and 13 hardware accelerators in Training 6.0, yet NVIDIA claimed all seven benchmark wins.
- GB300 NVL72 ran up to 1.6x faster than GB200 NVL72 at identical GPU scale, with gains from higher power headroom and NVFP4 compute density.
- NVIDIA's DeepSeek-V3 throughput improved 1.3x in three months through software alone, via CUDA graphs, MXFP8 attention, and near-100% all-to-all overlap.
Why this matters
Summary
Potential risks and opportunities
Risks
- Competing hardware vendors (AMD, Google) could challenge NVFP4 as a non-comparable precision format, narrowing the perceived performance gap if benchmarks are re-run at matched numerical precision.
- Enterprises and cloud customers who purchased GB200 NVL72 systems face near-term obsolescence pressure now that GB300 NVL72 results have been publicly verified with a 1.6x delta.
- The concentration of nineteen partner submissions across Google Cloud, Microsoft Azure, CoreWeave, HPE, and Dell creates a single-vendor hardware dependency that raises supply-chain risk for buyers building multi-year training infrastructure plans.
Opportunities
- CoreWeave and Microsoft Azure, who posted the headline DeepSeek-V3 671B and Llama 3.1 405B numbers respectively, can use these MLPerf-verified results as direct sales collateral in competitive GPU cloud procurement.
- HPE, Dell Technologies, and other OEM partners submitting results gain certification-adjacent positioning for enterprise buyers evaluating on-premises GB300 NVL72 deployments.
- NVIDIA's NVIDIA Resiliency Extension (NVRx) -- covering fault detection and checkpoint-based recovery validated across 30-plus manufacturing tests -- creates an upsell path for managed resiliency software on top of GB300 NVL72 hardware in long-running frontier training jobs.
What we don't know yet
- No non-NVIDIA platform submissions were reported for the two new mixture-of-experts benchmarks (DeepSeek-V3 671B and GPT-OSS-20B) -- whether AMD or Google submitted competing results is unaddressed.
- Whether the up-to-1.6x speedup over GB200 NVL72 is consistent across all seven benchmarks or reflects best-case gains on NVFP4-optimized workloads is not broken out per benchmark.
- Cost per training run at 8,192-GPU scale is entirely absent -- no pricing context is given for CoreWeave's 2.02-minute DeepSeek-V3 result or Microsoft Azure's 7.07-minute Llama result.
What others are reporting
-
MLCommons via GlobeNewsWire Read →
The benchmark organizer's neutral release counts 24 submitters and 13 accelerator types, framing NVIDIA's sweep as a market structure finding rather than a vendor claim.
Sparse computation is a dominant trend in AI right now. All of the major new generative AI models have utilized a sparse computation architecture.
-
NVIDIA Developer Blog Read →
Technical blog details six software optimizations driving the gains and documents DeepSeek-V3 throughput rising 1.3x in three months without any hardware changes.
achieved the fastest time to train at scale, and also delivered the highest performance when normalized on a per-accelerator basis on every benchmark.
-
CoreWeave Read →
Confirms its 2.02-minute DeepSeek-V3 result on 8,192 GB300 GPUs ran on production customer infrastructure, not a benchmark-only configuration, closing the deployment credibility gap.
The gap between benchmark performance and production reality remains one of the most persistent challenges in AI infrastructure.
-
Lambda Read →
Names specific per-system training times across GB300 NVL72 competitors and quantifies an 18.7% software-only speed gain over the previous round on the same hardware class.
This represents an 18.7% improvement in training speed attributed purely to software improvements over the last round.
Originally reported by nvidia.com
Read the original article →Original headline: NVIDIA Blackwell Sweeps All Seven MLPerf Training 6.0 Benchmarks at 8,192-GPU Scale — 1.6× Faster Than GB200