reddit.com via Reddit May 23rd 2026

Qwen3.6 35B MoE Hits 249 tok/s on RTX 5090 Laptop

alibaba nvidia inference local-llm benchmarks mtp blackwell

Key insights

Qwen3.6-35B-A3B with MTP achieves 249 tok/s on a 24GB RTX 5090 laptop, 3.4x faster than a dense 27B model.
Multi-token prediction compounds Blackwell's ~896 GB/s memory bandwidth advantage, delivering outsized throughput gains on the new architecture.
The MoE model activates only 3B parameters per token, making a 35B model affordable to run locally at near-cloud speeds.

Why this matters

Consumer hardware is crossing a threshold where a single laptop GPU can run a 35B MoE model at throughput previously requiring server-class multi-GPU rigs, which directly changes the cost calculus for local inference deployments and offline AI products. MTP on Blackwell GPUs appears to deliver multiplicative rather than incremental gains, meaning teams that benchmark their inference stacks against Ada-era hardware baselines may be significantly underestimating what is now achievable without cloud spend. For founders and infrastructure teams, the practical implication is that privacy-sensitive or latency-critical workloads that previously required cloud APIs can now be reconsidered as edge deployments on standard developer hardware.

Summary

Qwen3.6-35B-A3B with multi-token prediction just posted 249 tokens per second on a consumer laptop RTX 5090 with 24GB VRAM, marking one of the first published MTP benchmarks on Nvidia's new Blackwell sm_120 architecture. The throughput figure is 3.4 times what a dense 27B model achieves on the same hardware, running at Q3_K_XL quantization. The RTX 5090 laptop variant delivers roughly 896 GB/s of memory bandwidth, and MTP compounds that advantage by predicting multiple tokens per forward pass, reducing the number of passes needed to generate a given output. Essentially: (Qwen team, Nvidia Blackwell) are converging to make frontier-class inference viable on single-consumer-GPU setups. - Q3_K_XL quantization keeps the 35B MoE model within 24GB VRAM while preserving enough fidelity for practical use. - MTP gains are architecture-sensitive; Blackwell's higher memory bandwidth amplifies the effect compared to prior Ampere or Ada hardware. - The 35B-A3B MoE model activates only 3B parameters per token, so the effective compute cost is far lower than the parameter count implies. As Blackwell GPU availability expands to consumer laptops, the gap between cloud and local inference throughput continues to narrow faster than most deployment roadmaps assumed.

Potential risks and opportunities

Risks

Developers who ship local inference products benchmarked on RTX 4090 hardware may find their latency and throughput estimates are no longer representative of the Blackwell installed base, creating user experience gaps at product launch.
Q3_K_XL quantization at this speed may introduce quality regressions in precision-sensitive tasks (code generation, structured output), and no quality benchmarks were shared alongside throughput claims, risking over-deployment in production contexts.
If Nvidia's Blackwell laptop GPU allocation remains constrained through late 2026, only a narrow slice of local developers can replicate these results, making community benchmarks non-generalizable for most hardware planning decisions.

Opportunities

Inference optimization tooling vendors (llama.cpp maintainers, Ollama, LM Studio) can prioritize Blackwell sm_120 MTP support to capture the growing installed base of RTX 5090 laptop users seeking maximum throughput.
Founders building privacy-first or air-gapped AI products (legal tech, healthcare, defense adjacent) now have a concrete hardware spec to sell around, with a single consumer laptop supporting near-cloud throughput on a frontier-class model.
Quantization tooling developers (GGUF ecosystem, Unsloth, AutoGPTQ) can differentiate by publishing systematic quality-vs-speed tradeoff tables specifically for Blackwell MTP configurations, filling the gap this benchmark left open.

What we don't know yet

Whether MTP throughput gains hold at higher quantization levels (Q4, Q5, Q8) or degrade as model fidelity increases on Blackwell sm_120.
No output quality or benchmark accuracy scores were published alongside the throughput figures, leaving the fidelity-speed tradeoff at Q3_K_XL unquantified.
Whether these Blackwell MTP gains replicate on desktop RTX 5090 variants with different thermal and power envelopes compared to the laptop chip tested.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Qwen3.6 35B-A3B With MTP Hits 249 tok/s on 24GB Laptop RTX 5090 — 3.4× Faster Than Dense 27B on Same Hardware