reddit.com via Reddit

AMD MI300X monokernel hits 3,300 tokens per second

amd inference open source inference ai-infrastructure

Key insights

  • A monokernel eliminates kernel-launch overhead by running the entire LLM decode sequence as one persistent GPU-resident program on AMD MI300X.
  • Mapping memory access to HBM die topology and grouping compute units by physical adjacency yields 3,300 output tokens per second per request.
  • The benchmark outperforms standard HIP/CUDA pipelines on identical hardware, indicating MI300X has significant untapped inference throughput.

Why this matters

The standard GPU software stack introduces measurable latency at every kernel boundary, and this result quantifies what that overhead actually costs on MI300X-class hardware at production token rates. For AI infrastructure teams evaluating AMD as an alternative to Nvidia, a single-developer result of 3,300 tokens per second means the hardware is capable but the default software path leaves substantial throughput unrealized. Inference platform builders and cloud operators with AMD inventory should treat this as a signal that hardware-native programming on MI300X can yield step-change improvements that framework-level tuning alone cannot reach.

Summary

A developer has built a monokernel that runs the full LLM decode sequence as one GPU-resident program on AMD MI300X, achieving 3,300 output tokens per second per request. The approach maps memory access directly to the HBM die topology and groups compute units by physical adjacency, eliminating the kernel-launch overhead that taxes standard HIP/CUDA pipelines on the same hardware. Essentially: (one independent developer, AMD MI300X) showing that the hardware ceiling is higher than current software abstractions allow. - 3,300 tokens per second per request outperforms standard HIP/CUDA pipelines on identical MI300X hardware. - The MI300X unified memory pool across GPU dies is the key enabler; the monokernel exploits inter-die locality explicitly rather than through a generic dispatch layer. - Heavy r/MachineLearning engagement marks this as one of the first published scale results built around MI300X physical memory layout. AMD has positioned MI300X as a datacenter rival to Nvidia H100/H200, and results like this suggest the performance gap is as much a software problem as a hardware one.

Potential risks and opportunities

Risks

  • vLLM and SGLang maintainers face competitive pressure if hardware-native monokernels show sustained 2x+ throughput gaps on MI300X, potentially fragmenting the AMD inference software ecosystem
  • AMD risks this result remaining a research artifact if productizing die-topology-aware programming requires toolchain investment the company has not committed to publicly
  • Inference providers standardized on Nvidia CUDA stacks could face customer re-evaluation requests if MI300X monokernel performance is replicated at production scale within the next 6 to 12 months

Opportunities

  • AMD could accelerate developer evangelism and low-level tooling around MI300X topology-aware programming to convert this benchmark into a sustained competitive differentiator against Nvidia H100/H200 deployments
  • Inference compiler teams (Modular, Triton contributors) could productize die-topology memory mapping as a reusable library or compiler pass targeting AMD hardware
  • Cloud providers with MI300X capacity (Microsoft Azure, Oracle Cloud) could re-benchmark their AMD instances using this technique and market them more aggressively to latency-sensitive inference customers

What we don't know yet

  • Whether the 3,300 tokens/second figure holds across different model sizes and architectures, or is specific to one undisclosed benchmark configuration
  • Whether the approach generalizes to multi-chip setups beyond a single MI300X, given that the gains rely on intra-die topology that changes at multi-node boundaries
  • Whether AMD's ROCm team or any major inference framework (vLLM, SGLang) is actively tracking this work for upstream integration as of mid-2026