Reddit via Reddit

TritonMoE closes AMD gap in MoE inference kernels

open source inference chips mixture-of-experts inference triton portability

Key insights

  • TritonMoE fuses MoE gate and up GEMM projections into one SwiGLU pass, cutting kernel launch overhead during expert routing.
  • The kernel benchmarks competitively against vendor-optimized baselines on both NVIDIA and AMD hardware using only Triton.
  • AMD Instinct MI-series and RDNA GPUs can now run Qwen3.6 and Gemma 4 MoE variants without ROCm-specific kernel development.

Why this matters

The vast majority of production MoE kernels are CUDA-only, which practically excludes AMD hardware from competitive inference deployments for frontier models like Qwen3.6 and Gemma 4. TritonMoE demonstrates that hardware-portable kernel optimization is achievable through Triton's abstraction layer without sacrificing benchmark competitiveness, undercutting the central objection to cross-platform approaches. For AI infrastructure teams and model serving vendors, this opens a realistic path to AMD GPU procurement without waiting for ROCm-specific kernel development cycles to catch up.

Summary

TritonMoE is an open-source MoE inference kernel in pure OpenAI Triton that runs on both NVIDIA and AMD GPUs without a line of CUDA or ROCm code. It fuses gate and up GEMM projections into a single SwiGLU pass, reducing the kernel launch overhead that stacks up during expert routing. Benchmarks show competitive results against vendor-optimized baselines on Qwen3.6 and Gemma 4 MoE variants, covering AMD Instinct MI-series and RDNA consumer hardware where efficient inference tooling has historically lagged. Essentially: one Triton codebase now covers both NVIDIA and AMD inference targets. - Fused SwiGLU dispatch cuts kernel launch overhead per expert routing step. - AMD Instinct MI-series and RDNA GPUs gain capable MoE inference tooling without waiting for separate ROCm-specific ports. - Triton's abstraction layer handles hardware-specific lowering, removing the need to maintain parallel CUDA and ROCm codebases. Most production MoE kernels remain CUDA-only, and AMD has been a second-class inference target as a direct result.

Potential risks and opportunities

Risks

  • If TritonMoE's benchmark competitiveness holds only at small batch sizes, inference serving vendors (Together AI, Fireworks AI) may reject it for throughput-critical production deployments.
  • AMD MI300X hardware supply constraints could limit adoption even as the kernel matures, leaving the project without a large enough hardware base to attract sustained community optimization.
  • Triton compiler updates on either NVIDIA or AMD platforms could break kernel correctness or regress performance, creating a maintenance burden that small research teams may not sustain.

Opportunities

  • AMD and its OEM partners (Dell, HPE) gain a concrete open-source benchmark to reference when selling AI inference clusters against NVIDIA H100/H200 configurations.
  • Inference serving startups building AMD-native or multi-cloud offerings (RunPod, Lambda Labs) can adopt TritonMoE to support Qwen3.6 and Gemma 4 MoE deployments and compete on hardware cost.
  • Cloud providers with AMD GPU capacity (Azure ND MI300X instances, Oracle Cloud) have a new kernel they can surface to attract cost-sensitive model serving customers locked out by CUDA-only tooling.

What we don't know yet

  • Benchmark methodology is underspecified: whether throughput comparisons use identical batch sizes and sequence lengths on matched NVIDIA and AMD SKUs is not confirmed in the preprint.
  • No latency data is presented for long-context MoE routing with many experts simultaneously active, which is the stress case for production serving workloads.
  • Whether major inference runtimes (vLLM, SGLang, TensorRT-LLM) plan to integrate TritonMoE or require significant adaptation work remains unaddressed.