FeatherOps fp8 Emulation Reaches Four More Models on RDNA3
Key insights
- FeatherOps added emulated fp8 matmul support for Anima, LTX 2.3, Qwen-Image, and Wan on RDNA3 GPUs lacking native fp8 hardware.
- Custom kernel emulation delivers real diffusion throughput gains on RX 7900-class cards without requiring hardware-level fp8 acceleration.
- The ComfyUI integration was substantially reworked since March, and the developer indicates further untested model compatibility is likely.
Why this matters
AMD RDNA3 covers a significant share of high-end consumer GPU installations, and the absence of native fp8 support has been a concrete performance ceiling for local diffusion workflows on that hardware. Software emulation through FeatherOps demonstrates that matrix precision hardware gaps can be bridged at the kernel level, which shifts how model developers and inference tooling authors should weigh AMD support investment. If emulated fp8 proves stable and accurate across a wider model set, RDNA3 users become a viable secondary market for AI inference tooling that has historically been designed and benchmarked Nvidia-first.
Summary
ComfyUI-FeatherOps now supports emulated fp8 matrix multiplication on four additional models: Anima, LTX 2.3, Qwen-Image, and Wan. RDNA3 GPUs like the RX 7900 series lack native fp8 hardware, leaving AMD users without the throughput improvements Nvidia owners access by default.
The update uses custom kernels to emulate fp8 matmul in software, bypassing the hardware limitation directly. The ComfyUI integration itself has been substantially reworked since the original March kernel release, making this more than a simple model addition.
Essentially: (AMD RDNA3 users, ComfyUI community) gained meaningful diffusion throughput on hardware that previously could not access fp8 acceleration at all.
- Four models now officially supported: Anima, LTX 2.3, Qwen-Image, Wan
- Developer suggests compatibility may extend to additional untested models beyond these four
- RX 7900-class cards are the primary beneficiary, gaining speed improvements that required Nvidia hardware before this release
For the local AI community, this closes a gap between consumer AMD and Nvidia hardware that has persisted since fp8 acceleration became a standard optimization target for diffusion workloads.
Potential risks and opportunities
Risks
- Emulated fp8 could produce subtle numerical divergence that only surfaces in long inference runs, potentially corrupting outputs for RX 7900 users without obvious error signals
- AMD ROCm updates that alter low-level kernel behavior could silently break FeatherOps custom kernels, leaving users with no upstream compatibility path
- ComfyUI workflows built around FeatherOps speedups depend on a single developer; abandonment or stalled maintenance would strand those users if AMD does not formalize an equivalent approach in its official stack
Opportunities
- Model developers targeting AMD (Stability AI, Black Forest Labs) can now document RDNA3 fp8 compatibility, expanding addressable user base without hardware-side changes
- ComfyUI node and workflow developers can integrate FeatherOps support to differentiate their tools for the AMD segment, which currently lacks equivalent optimization coverage
- AMD could formally absorb this emulation approach into ROCm or its AI stack, using community proof-of-concept work to accelerate native tooling ahead of the RX 9000 generation launch
What we don't know yet
- Quantified throughput numbers: no published benchmark comparisons between emulated fp8 on RDNA3 and native fp8 on equivalent Nvidia hardware for these four models
- Whether RDNA4 (RX 9000 series, which reportedly includes native fp8 support) makes this emulation layer obsolete for the next GPU generation before it matures
- Numerical accuracy parity: no independent verification that emulated fp8 outputs match native fp8 quality across extended or high-step inference runs
Originally reported by reddit.com
Read the original article →Original headline: r/StableDiffusion: FeatherOps Expands Emulated fp8 Matmul for RDNA3 — Anima, LTX 2.3, Qwen-Image, and Wan Now Supported Without Native fp8 Hardware