Reddit via Reddit May 27th 2026

TritonMoE closes AMD gap in MoE inference kernels

open source inference chips mixture-of-experts inference triton portability

Key insights

TritonMoE fuses MoE gate and up GEMM projections into one SwiGLU pass, cutting kernel launch overhead during expert routing.
The kernel benchmarks competitively against vendor-optimized baselines on both NVIDIA and AMD hardware using only Triton.
AMD Instinct MI-series and RDNA GPUs can now run Qwen3.6 and Gemma 4 MoE variants without ROCm-specific kernel development.

Why this matters

The vast majority of production MoE kernels are CUDA-only, which practically excludes AMD hardware from competitive inference deployments for frontier models like Qwen3.6 and Gemma 4. TritonMoE demonstrates that hardware-portable kernel optimization is achievable through Triton's abstraction layer without sacrificing benchmark competitiveness, undercutting the central objection to cross-platform approaches. For AI infrastructure teams and model serving vendors, this opens a realistic path to AMD GPU procurement without waiting for ROCm-specific kernel development cycles to catch up.

Summary

TritonMoE is an open-source MoE inference kernel in pure OpenAI Triton that runs on both NVIDIA and AMD GPUs without a line of CUDA or ROCm code. It fuses gate and up GEMM projections into a single SwiGLU pass, reducing the kernel launch overhead that stacks up during expert routing. Benchmarks show competitive results against vendor-optimized baselines on Qwen3.6 and Gemma 4 MoE variants, covering AMD Instinct MI-series and RDNA consumer hardware where efficient inference tooling has historically lagged. Essentially: one Triton codebase now covers both NVIDIA and AMD inference targets. - Fused SwiGLU dispatch cuts kernel launch overhead per expert routing step. - AMD Instinct MI-series and RDNA GPUs gain capable MoE inference tooling without waiting for separate ROCm-specific ports. - Triton's abstraction layer handles hardware-specific lowering, removing the need to maintain parallel CUDA and ROCm codebases. Most production MoE kernels remain CUDA-only, and AMD has been a second-class inference target as a direct result.

Potential risks and opportunities

Risks

If TritonMoE's benchmark competitiveness holds only at small batch sizes, inference serving vendors (Together AI, Fireworks AI) may reject it for throughput-critical production deployments.
AMD MI300X hardware supply constraints could limit adoption even as the kernel matures, leaving the project without a large enough hardware base to attract sustained community optimization.
Triton compiler updates on either NVIDIA or AMD platforms could break kernel correctness or regress performance, creating a maintenance burden that small research teams may not sustain.

Opportunities

AMD and its OEM partners (Dell, HPE) gain a concrete open-source benchmark to reference when selling AI inference clusters against NVIDIA H100/H200 configurations.
Inference serving startups building AMD-native or multi-cloud offerings (RunPod, Lambda Labs) can adopt TritonMoE to support Qwen3.6 and Gemma 4 MoE deployments and compete on hardware cost.
Cloud providers with AMD GPU capacity (Azure ND MI300X instances, Oracle Cloud) have a new kernel they can surface to attract cost-sensitive model serving customers locked out by CUDA-only tooling.

What we don't know yet

Benchmark methodology is underspecified: whether throughput comparisons use identical batch sizes and sequence lengths on matched NVIDIA and AMD SKUs is not confirmed in the preprint.
No latency data is presented for long-context MoE routing with many experts simultaneously active, which is the stress case for production serving workloads.
Whether major inference runtimes (vLLM, SGLang, TensorRT-LLM) plan to integrate TritonMoE or require significant adaptation work remains unaddressed.

Originally reported by Reddit

Read the original article →

Original headline: r/MachineLearning: TritonMoE — Cross-Platform Fused MoE Inference Kernel in Pure Triton Achieves NVIDIA and AMD Portability Without Vendor-Specific Code