reddit.com via Reddit June 1st 2026

mistral.rs v0.8.2: Up to 2.8x Faster CUDA Inference Than llama.cpp on GB10, B200, and H100

inference open source nvidia inference cuda performance-benchmark

Summary

The mistral.rs maintainer released v0.8.2 on June 1, 2026, claiming up to 2.8x faster CUDA throughput than llama.cpp on Gemma 4 dense and MoE models, with benchmarks run on NVIDIA GB10, H100, and B200 hardware. The release focuses exclusively on CUDA performance optimization and the developer's self-posted benchmark tables show mistral.rs leading llama.cpp at every tested point across the hardware lineup. The project is an active open-source inference engine built for local model serving.

Originally reported by reddit.com

Read the original article →

Original headline: mistral.rs v0.8.2: Up to 2.8x Faster CUDA Inference Than llama.cpp on GB10, B200, and H100